I would think it's time to move this discussion to the generic AMD architecture thread. This discussion has nothing to do with consoles. We are speculating how the current GCN architecture works, and what kind of options AMD has to evolve it for Vega. The Vega rumors that started the 50% fatter CU speculation aren't even about a console product.
If I understood GCN execution model correctly, there either has to be four seperate 16 wide register files (16 KB each) per SIMD or a multi-ported register file that is capable of 3 reads per cycle. I might of course be completely wrong as I am not a hardware engineer. I might have missed something.
64 wide instructions take 4 cycles to finish (4 cycle latency). However the SIMD starts execution of a new 16 wide instruction every cycle and finishes an 16 wide "partial" instruction every cycle (SIMD throughput of 16xFMADD/cycle). If the SIMD would first fetch whole 64 wide registers (one per cycle), the execution would start at cycle 3 (0 based). The SIMD execution unit is only 16 wide, meaning that the work would be finished on cycle 7. In order to keep the 16 wide FMADD ALU unit filled with work every cycle, the next instruction would need to start 3 cycles before the previous ends. However the 64 wide register fetch would need all lanes of the last instruction to be completed (assuming it has dependency). GCN documents clearly specify that even dependent instructions can be executed one after other with no stalls.
So I would assume that the register fetches are also 16 wide and pipelined. This would completely hide the latency and result in 100% workload for the FMADD unit. Two consecutive instructions would always be in flight.
Let's assume four 16 KB register files. Split by 16 lane boundaries. Single read port each. Let's also assume register fetch takes a single cycle and execution takes a single cycle. FMADD is thus 3 register fetch cycles + 1 execute cycle. This is what we get.
Timeline:
1. Instruction A (0-15) fetches register 0
2. Instruction A (16-31) fetches register 0
2. Instruction A (0-15) fetches register 1
3. Instruction A (32-47) fetches register 0
3. Instruction A (16-31) fetches register 1
3. Instruction A (0-15) fetches register 2
4. Instruction A (48-63) fetches register 0
4. Instruction A (32-47) fetches register 1
4. Instruction A (16-31) fetches register 2
4. Instruction A (0-15) executes FMADD + stores result
5. Instruction A (48-63) fetches register 1
5. Instruction A (32-47) fetches register 2
5. Instruction A (16-31) executes FMADD + stores result
5. Instruction B (0-15) fetches register 0
6. Instruction A (48-63) fetches register 2
6. Instruction A (32-47) executes FMADD + stores result
6. Instruction B (16-31) fetches register 0
6. Instruction B (0-15) fetches register 1
7. Instruction A (48-63) executes FMADD + stores result
7. Instruction B (32-47) fetches register 0
7. Instruction B (16-31) fetches register 1
7. Instruction B (0-15) fetches register 2
8. Instruction B (48-63) fetches register 0
8. Instruction B (32-47) fetches register 1
8. Instruction B (16-31) fetches register 2
8. Instruction B (0-15) executes FMADD + stores result
9. Instruction B (48-63) fetches register 1
9. Instruction B (32-47) fetches register 2
9. Instruction B (16-31) executes FMADD + stores result
10. Instruction A (48-63) fetches register 2
10. Instruction A (32-47) executes FMADD+ stores result
11. Instruction A (48-63) executes FMADD+ stores result
Cycle number in beginning of each line. Steady state marked as bold. Steady state continues as long as we have no stalls (memory waits). In steady state a new (64 wide) instruction is issued every 4 cycles, and one (64 wide) instruction retires every 4 cycles (just like described in the GCN documents).
Worth noting:
- One 16 wide FMADD gets executed every cycle. 100% ALU unit usage in steady state.
- Three 16 wide register fetches per cycle.
- All three register fetches in the same cycle come from separate 16 KB register files (split by 16 lanes). No need for big multi-ported register file.
- One 16 wide register write per cycle. Cycling through all four 16 KB register files.
Conclusion: Four small (fully separate) 16 KB register files per SIMD would work perfectly. I don't see how a single big 64 KB register would work, unless it has 3 read ports. But I might have misunderstood something, as I am not a hardware designer.
Now let's read AMDs cross lane operation article:
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
If we ignore the new DPP operations, all the older cross lane operations go through LDS permutation hardware. You need an extra register to receive the result and you need to use waitcnt to ensure completion before you read the result. This is similar to all variable length memory ops. Direct access between 16 lane register files is thus not required. LDS permutation hardware takes care of mixing the data. This could be seen as an hint that at least earlier GCN designs could have used separate 16 KB register files (lane / 16).
The new DPP operations don't use LDS permutation hardware (and don't need waitcnt) but are more limited. There are some operations that move data across 16 lane boundaries, so there must be some kind of data path to fetch data from other 16 KB register files. DPP requires two wait states (= 2x NOP if no independent instructions). Two wait states = 8 cycles (NOP = 4 cycles like every other instruction). Maybe there's some new additional permute hardware inside the SIMD that does this operation, but it is either pipelined itself and/or a bit further away to require extra cycles. Thus separate 16 KB register files would still work.
If this is all true, then GCN needs to execute the FMADD (ALU part) in a single cycle. It needs to be written back to the register file immediately as the next instruction might fetch it on the next cycle. I don't know whether this is possible, and if this is possible how big limit it would cause to the clock rate.
DISCLAIMER: If I understood something wrong, please fix my wrong assumptions about hardware engineering
I have also been wondering why the 16 wide SIMD doesn't have 16 wide split register files. Four 16 KB register files (split to serve lanes 0-15, 16-31, 32-47, 48-63).One possible way GCN's register files could be banked that would make sense with the 16-wide SIMDs and 4-cycle cadence is that the registers are subdivided into 4 16-wide banks. It removes conflicts between phases and the need for multiple ports, and it would match what the hardware is doing. It would also be a variation on the R600 register access method, which was 3 cycles while gathering components from 4 banks.
If I understood GCN execution model correctly, there either has to be four seperate 16 wide register files (16 KB each) per SIMD or a multi-ported register file that is capable of 3 reads per cycle. I might of course be completely wrong as I am not a hardware engineer. I might have missed something.
64 wide instructions take 4 cycles to finish (4 cycle latency). However the SIMD starts execution of a new 16 wide instruction every cycle and finishes an 16 wide "partial" instruction every cycle (SIMD throughput of 16xFMADD/cycle). If the SIMD would first fetch whole 64 wide registers (one per cycle), the execution would start at cycle 3 (0 based). The SIMD execution unit is only 16 wide, meaning that the work would be finished on cycle 7. In order to keep the 16 wide FMADD ALU unit filled with work every cycle, the next instruction would need to start 3 cycles before the previous ends. However the 64 wide register fetch would need all lanes of the last instruction to be completed (assuming it has dependency). GCN documents clearly specify that even dependent instructions can be executed one after other with no stalls.
So I would assume that the register fetches are also 16 wide and pipelined. This would completely hide the latency and result in 100% workload for the FMADD unit. Two consecutive instructions would always be in flight.
Let's assume four 16 KB register files. Split by 16 lane boundaries. Single read port each. Let's also assume register fetch takes a single cycle and execution takes a single cycle. FMADD is thus 3 register fetch cycles + 1 execute cycle. This is what we get.
Timeline:
1. Instruction A (0-15) fetches register 0
2. Instruction A (16-31) fetches register 0
2. Instruction A (0-15) fetches register 1
3. Instruction A (32-47) fetches register 0
3. Instruction A (16-31) fetches register 1
3. Instruction A (0-15) fetches register 2
4. Instruction A (48-63) fetches register 0
4. Instruction A (32-47) fetches register 1
4. Instruction A (16-31) fetches register 2
4. Instruction A (0-15) executes FMADD + stores result
5. Instruction A (48-63) fetches register 1
5. Instruction A (32-47) fetches register 2
5. Instruction A (16-31) executes FMADD + stores result
5. Instruction B (0-15) fetches register 0
6. Instruction A (48-63) fetches register 2
6. Instruction A (32-47) executes FMADD + stores result
6. Instruction B (16-31) fetches register 0
6. Instruction B (0-15) fetches register 1
7. Instruction A (48-63) executes FMADD + stores result
7. Instruction B (32-47) fetches register 0
7. Instruction B (16-31) fetches register 1
7. Instruction B (0-15) fetches register 2
8. Instruction B (48-63) fetches register 0
8. Instruction B (32-47) fetches register 1
8. Instruction B (16-31) fetches register 2
8. Instruction B (0-15) executes FMADD + stores result
9. Instruction B (48-63) fetches register 1
9. Instruction B (32-47) fetches register 2
9. Instruction B (16-31) executes FMADD + stores result
10. Instruction A (48-63) fetches register 2
10. Instruction A (32-47) executes FMADD+ stores result
11. Instruction A (48-63) executes FMADD+ stores result
Cycle number in beginning of each line. Steady state marked as bold. Steady state continues as long as we have no stalls (memory waits). In steady state a new (64 wide) instruction is issued every 4 cycles, and one (64 wide) instruction retires every 4 cycles (just like described in the GCN documents).
Worth noting:
- One 16 wide FMADD gets executed every cycle. 100% ALU unit usage in steady state.
- Three 16 wide register fetches per cycle.
- All three register fetches in the same cycle come from separate 16 KB register files (split by 16 lanes). No need for big multi-ported register file.
- One 16 wide register write per cycle. Cycling through all four 16 KB register files.
Conclusion: Four small (fully separate) 16 KB register files per SIMD would work perfectly. I don't see how a single big 64 KB register would work, unless it has 3 read ports. But I might have misunderstood something, as I am not a hardware designer.
Now let's read AMDs cross lane operation article:
http://gpuopen.com/amd-gcn-assembly-cross-lane-operations/
If we ignore the new DPP operations, all the older cross lane operations go through LDS permutation hardware. You need an extra register to receive the result and you need to use waitcnt to ensure completion before you read the result. This is similar to all variable length memory ops. Direct access between 16 lane register files is thus not required. LDS permutation hardware takes care of mixing the data. This could be seen as an hint that at least earlier GCN designs could have used separate 16 KB register files (lane / 16).
The new DPP operations don't use LDS permutation hardware (and don't need waitcnt) but are more limited. There are some operations that move data across 16 lane boundaries, so there must be some kind of data path to fetch data from other 16 KB register files. DPP requires two wait states (= 2x NOP if no independent instructions). Two wait states = 8 cycles (NOP = 4 cycles like every other instruction). Maybe there's some new additional permute hardware inside the SIMD that does this operation, but it is either pipelined itself and/or a bit further away to require extra cycles. Thus separate 16 KB register files would still work.
If this is all true, then GCN needs to execute the FMADD (ALU part) in a single cycle. It needs to be written back to the register file immediately as the next instruction might fetch it on the next cycle. I don't know whether this is possible, and if this is possible how big limit it would cause to the clock rate.
DISCLAIMER: If I understood something wrong, please fix my wrong assumptions about hardware engineering
Last edited: