AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

I was talking about an architectural change (similar to Pascal P100, but in reverse direction). In current GCN architecture SIMD count and register file capacity are obviously tied.

If you added 50% extra SIMDs and registers into a single CU, then there would be 50% more clients to the CU shared resources: 4 texture samplers, 16 KB of L1 cache and 64 KB of LDS. There would be lots of L1 trashing, occupancy would be horrible in shaders that use lots of LDS and more shaders would be sampler (filtering) bound. You could counteract these issues by having 6 texture samplers, 24 KB of L1 cache and 96 KB of LDS in each CU. However a 50% fatter CU like this would be less energy efficient as the smaller one, since the shared resources are shared with more clients. There would be more synchronization/communication overhead and longer distance to move the data. I am not convinced this is the right way to go.

What about unified register, texture cache ?
 
What about unified register, texture cache ?
Do you mean just a unified register file between all the SIMDs, or a unified register file between all the SIMDs that is also the vector cache?
The bandwidth, capacity, and area differences between the SIMD-local register files, shared storage like the LDS, and the vector cache is very significant.
So sharing what hasn't been shared before or moving accesses that used to be low-overhead to a pool with the highest cost will either constrain register operand sourcing or require a large expansion of the cache portion.
 
Do you mean just a unified register file between all the SIMDs, or a unified register file between all the SIMDs that is also the vector cache?
The bandwidth, capacity, and area differences between the SIMD-local register files, shared storage like the LDS, and the vector cache is very significant.
So sharing what hasn't been shared before or moving accesses that used to be low-overhead to a pool with the highest cost will either constrain register operand sourcing or require a large expansion of the cache portion.

I mean a unified cache per SM(X, ...) that serves both cached registers and texture data.
Sharing might be a good idea, as both need massive bandwidth and a shared larger cache is always desirable in case of compute only or texturing only.
 
I mean a unified cache per SM(X, ...) that serves both cached registers and texture data.
Sharing might be a good idea, as both need massive bandwidth and a shared larger cache is always desirable in case of compute only or texturing only.
I'm a bit confused. Unified cache for registers?
The L1 vector data cache is already shared by all SIMDs in a CU. It can cache all data transfers in and out of the CU besides exports to the ROPs and loads and writes of scalar data. That means it already caches textures as well as the usual buffers (besides constant buffers accessed by the scalar unit and the framebuffer accessed through the ROPs) one accesses with graphics or compute shaders.
 
You could introduce a new buffer that stores data of ongoing loads. Currently you need to reserve a VGPR for memory instructions. This is needed as you don't know the memory latency. You do s_waitcnt before using loaded registers.

Instead you could have a CU wide load buffer, that keeps the loaded data until it is copied to a VGPR just before the first use. Data would be copied from this buffer to the target VGPR when the s_waitcnt is ready. This would allow the compiler to use this VGPR for other purposes instead of serving as a dummy storage slot. This would practically increase the usable register count, as average register life time would be much shorter. There would be no need to extend register life time to hide memory latency. The separate buffer (CU wide) would keep incoming loads. This would actually allow more latency hiding, as the compiler could be more aggressive in moving loads away from use. This kind of load buffer wouldn't need to be as fast as registers as data load (and s_waitcnt before reading it) is a much less frequent action than addressing a register.

Data would still be loaded first to L1 cache, and this new load buffer would be filled from L1 cache, just like VGPRs are filled from L1. This would just be a temporary short life time storage (replacing current mechanism of allocating VGPRs to hold incoming data until s_waitcnt).

I don't understand the "register cache either". GPUs normally don't spill registers to memory. There's no need to cache them. I could understand a tiny L0 register cache that holds the hot registers (reduces register file accesses of previously used registers). IIRC Nvidia introduced L0 register cache in Maxwell (holds previously used registers). This reduces register file traffic and bank conflicts. Nvidia could simplify their register file design because of this.
 
I'm a bit confused. Unified cache for registers?
The L1 vector data cache is already shared by all SIMDs in a CU. It can cache all data transfers in and out of the CU besides exports to the ROPs and loads and writes of scalar data. That means it already caches textures as well as the usual buffers (besides constant buffers accessed by the scalar unit and the framebuffer accessed through the ROPs) one accesses with graphics or compute shaders.

That may be true for Vega but not for Volta.
 
You could introduce a new buffer that stores data of ongoing loads. Currently you need to reserve a VGPR for memory instructions. This is needed as you don't know the memory latency. You do s_waitcnt before using loaded registers.

Instead you could have a CU wide load buffer, that keeps the loaded data until it is copied to a VGPR just before the first use. Data would be copied from this buffer to the target VGPR when the s_waitcnt is ready. This would allow the compiler to use this VGPR for other purposes instead of serving as a dummy storage slot. This would practically increase the usable register count, as average register life time would be much shorter. There would be no need to extend register life time to hide memory latency. The separate buffer (CU wide) would keep incoming loads. This would actually allow more latency hiding, as the compiler could be more aggressive in moving loads away from use. This kind of load buffer wouldn't need to be as fast as registers as data load (and s_waitcnt before reading it) is a much less frequent action than addressing a register.

Data would still be loaded first to L1 cache, and this new load buffer would be filled from L1 cache, just like VGPRs are filled from L1. This would just be a temporary short life time storage (replacing current mechanism of allocating VGPRs to hold incoming data until s_waitcnt).

I don't understand the "register cache either". GPUs normally don't spill registers to memory. There's no need to cache them. I could understand a tiny L0 register cache that holds the hot registers (reduces register file accesses of previously used registers). IIRC Nvidia introduced L0 register cache in Maxwell (holds previously used registers). This reduces register file traffic and bank conflicts. Nvidia could simplify their register file design because of this.

With a register cache I refer to the CPU analog of a L1D cache, with limited real registers...
 
With a register cache I refer to the CPU analog of a L1D cache, with limited real registers...
GPU doesn't need L1D cache for that purpose, as there's no stack in memory and variables in shader code are pure registers with no memory backing. No spilling either (unless shader is awfully written). GPUs do not perform frequent register<->memory moves.
 
Last edited:
GPU doesn't need L1D cache for that purpose, as there's no stack in memory and variables in shader code are pure registers with no memory backing. No spilling either (unless shader is awfully written). GPUs do not perform frequent register<->memory moves.

Sure but that doesn't mean it will be like that forever.
 
Sure but that doesn't mean it will be like that forever.
GPU register file (256 KB in GCN CU) is much bigger than L1D cache (16 KB in GCN CU). One VGPR (full 40 wave occupancy) takes 40 * 64 * 4 = 10240 bytes. Swapping just two registers (all waves) to memory would trash L1 cache completely. We would need MUCH larger caches to do CPU style memory stack + memory backed register programming on GPU.
 
GPU registers are just big chunks of SRAM with register semantics, that is, values are explicitly loaded and stored to the common memory pool. The reason for the big size is to tolerate main memory access latencies. It would make zero sense to stick a cache in between it and main memory.

Cheers
 
Just an elision of another may - maybe?
"That may be true for Vega but may not be for Volta."

So in the end this statement would be like. "The weather might be good tomorrow, it could be the same or different the day after..."

Based on what we know from CGN it is likely that the feature will be in VEGA, but how this offers any conclusion to Volta escapes me.
 
GPU registers are just big chunks of SRAM with register semantics, that is, values are explicitly loaded and stored to the common memory pool. The reason for the big size is to tolerate main memory access latencies. It would make zero sense to stick a cache in between it and main memory.
Multi-tiered register file would make lots of sense. As you said GPU register file is just a big SRAM pool. It is used for multiple purposes. Load instructions use register files as temporary storage. Compilers try to separate load from use to hide latency. In between load and use the only purpose of the register is to store data. It is never accessed. Registers also store various data during the kernel execution. In many cases data is stored over long loops/branches, and used hundreds of instructions later (1000s of cycles). Only a small portion of the register file is currently used by the execution units (read operands & write results).

Having a huge 64 KB (per SIMD) register file near the execution units is expensive. It needs to provide one register per cycle (3 cycles out of cadence of 4 cycles) to execution units. It should be more power efficient (plus less latency + allow higher clocks) to have smaller "L1" register file (for example 16 KB) near the execution units and a bigger (for example 80 KB) "L2" register file a bit further away (not connected to execution units). This large register file would store all the registers that are not accessed in immediate future. Data movement between these two register files could be controlled by instructions (move data between L1<->L2 register files). Compiler would automatically emit these instructions. All loads would go into L2 register file (as allocating register early to hide load latency is one of the biggest causes for VGPR pressure - by definition these registers are used solely to store data before all 64 lanes in wave are loaded).

This example would provide 1.5x registers per SIMD. But require much smaller (full speed) L1 register file near the SIMD execution units. L2 register files could be half clocked and/or they could have latency of 2 cycles to provide data to L1 register file. On the other hand, the smaller L1 register file + execution units could be clocked higher. I am not an hardware engineer, but I would expect AMDs register file (plys execution cadence of 4 cycles, which are closely tied together) to be one of the things that prevents them reaching similar clock rate as Maxwell/Pascal. Nvidia simplified their register file for Maxwell (direct modulo mapped) and added register reuse cache to avoid bank conflict problems. Both reduce power and allow higher clocks. AMD needs to do something to follow Nvidia.

Update:
Seems that this is already a well researched topic (gains are nice): https://www.cs.utexas.edu/users/skeckler/pubs/micro11.pdf
 
Last edited:
Optimizing SGEMM for Maxwell:
https://github.com/NervanaSystems/maxas/wiki/SGEMM

On Maxwell there are 4 register banks, but unlike on Kepler (also with 4 banks) the assignment of banks to numbers is very simple. The Maxwell assignment is just the register number modulo 4. On Kepler it is possible to arrange the 64 FFMA instructions to eliminate all bank conflicts. On Maxwell this is no longer possible. Maxwell, however, provides something to make up for this and at the same time offers the capability to significantly reduce register bank traffic and overall chip power draw. This is the operand reuse cache. The operand reuse cache has 8 bytes of data per source operand slot. An instuction like FFMA has 3 source operand slots. Each time you issue an instruction there is a flag you can use to specify if each of the operands is going to be used again. So the next instruction that uses the same register in the same operand slot will not have to go to the register bank to fetch its value. And with this feature you can see how a register bank conflict can be averted.

This sounds very close to LRF (last result file) from this paper: https://www.cs.utexas.edu/users/skeckler/pubs/micro11.pdf.

If you look at LRF results, you see that size=1 LRF already brings most of the gains of this whole system, but with very limited amount of complexity. ORF (3 tier register file middle level) is nice, but not as important as LRF. Explains why Nvidia added size=1 LRF to Maxwell.

I this paper the main register file however was fully clocked and ported (no perf reduction compared to ORF and LRF access). If the main register file would be larger and further away, the 3 level design would be the best choice. I wonder how Nvidia managed to scale their register files 2x larger in Pascal P100 :)
 
Back
Top