I was talking about an architectural change (similar to Pascal P100, but in reverse direction). In current GCN architecture SIMD count and register file capacity are obviously tied.
If you added 50% extra SIMDs and registers into a single CU, then there would be 50% more clients to the CU shared resources: 4 texture samplers, 16 KB of L1 cache and 64 KB of LDS. There would be lots of L1 trashing, occupancy would be horrible in shaders that use lots of LDS and more shaders would be sampler (filtering) bound. You could counteract these issues by having 6 texture samplers, 24 KB of L1 cache and 96 KB of LDS in each CU. However a 50% fatter CU like this would be less energy efficient as the smaller one, since the shared resources are shared with more clients. There would be more synchronization/communication overhead and longer distance to move the data. I am not convinced this is the right way to go.
Do you mean just a unified register file between all the SIMDs, or a unified register file between all the SIMDs that is also the vector cache?What about unified register, texture cache ?
Do you mean just a unified register file between all the SIMDs, or a unified register file between all the SIMDs that is also the vector cache?
The bandwidth, capacity, and area differences between the SIMD-local register files, shared storage like the LDS, and the vector cache is very significant.
So sharing what hasn't been shared before or moving accesses that used to be low-overhead to a pool with the highest cost will either constrain register operand sourcing or require a large expansion of the cache portion.
I'm a bit confused. Unified cache for registers?I mean a unified cache per SM(X, ...) that serves both cached registers and texture data.
Sharing might be a good idea, as both need massive bandwidth and a shared larger cache is always desirable in case of compute only or texturing only.
I'm a bit confused. Unified cache for registers?
The L1 vector data cache is already shared by all SIMDs in a CU. It can cache all data transfers in and out of the CU besides exports to the ROPs and loads and writes of scalar data. That means it already caches textures as well as the usual buffers (besides constant buffers accessed by the scalar unit and the framebuffer accessed through the ROPs) one accesses with graphics or compute shaders.
You could introduce a new buffer that stores data of ongoing loads. Currently you need to reserve a VGPR for memory instructions. This is needed as you don't know the memory latency. You do s_waitcnt before using loaded registers.
Instead you could have a CU wide load buffer, that keeps the loaded data until it is copied to a VGPR just before the first use. Data would be copied from this buffer to the target VGPR when the s_waitcnt is ready. This would allow the compiler to use this VGPR for other purposes instead of serving as a dummy storage slot. This would practically increase the usable register count, as average register life time would be much shorter. There would be no need to extend register life time to hide memory latency. The separate buffer (CU wide) would keep incoming loads. This would actually allow more latency hiding, as the compiler could be more aggressive in moving loads away from use. This kind of load buffer wouldn't need to be as fast as registers as data load (and s_waitcnt before reading it) is a much less frequent action than addressing a register.
Data would still be loaded first to L1 cache, and this new load buffer would be filled from L1 cache, just like VGPRs are filled from L1. This would just be a temporary short life time storage (replacing current mechanism of allocating VGPRs to hold incoming data until s_waitcnt).
I don't understand the "register cache either". GPUs normally don't spill registers to memory. There's no need to cache them. I could understand a tiny L0 register cache that holds the hot registers (reduces register file accesses of previously used registers). IIRC Nvidia introduced L0 register cache in Maxwell (holds previously used registers). This reduces register file traffic and bank conflicts. Nvidia could simplify their register file design because of this.
GPU doesn't need L1D cache for that purpose, as there's no stack in memory and variables in shader code are pure registers with no memory backing. No spilling either (unless shader is awfully written). GPUs do not perform frequent register<->memory moves.With a register cache I refer to the CPU analog of a L1D cache, with limited real registers...
Volta? How do you know anything about Volta?
GPU doesn't need L1D cache for that purpose, as there's no stack in memory and variables in shader code are pure registers with no memory backing. No spilling either (unless shader is awfully written). GPUs do not perform frequent register<->memory moves.
GPU register file (256 KB in GCN CU) is much bigger than L1D cache (16 KB in GCN CU). One VGPR (full 40 wave occupancy) takes 40 * 64 * 4 = 10240 bytes. Swapping just two registers (all waves) to memory would trash L1 cache completely. We would need MUCH larger caches to do CPU style memory stack + memory backed register programming on GPU.Sure but that doesn't mean it will be like that forever.
It's true for current GCN GPUs and very likely Vega too.That may be true for Vega but not for Volta.
That may be true for Vega but not for Volta.
I don't I speculate.
Just an elision of another may - maybe?That's not speculation, that's a statement of fact.
"That may be true for Vega but likely not for Volta."
Just an elision of another may - maybe?
"That may be true for Vega but may not be for Volta."
Multi-tiered register file would make lots of sense. As you said GPU register file is just a big SRAM pool. It is used for multiple purposes. Load instructions use register files as temporary storage. Compilers try to separate load from use to hide latency. In between load and use the only purpose of the register is to store data. It is never accessed. Registers also store various data during the kernel execution. In many cases data is stored over long loops/branches, and used hundreds of instructions later (1000s of cycles). Only a small portion of the register file is currently used by the execution units (read operands & write results).GPU registers are just big chunks of SRAM with register semantics, that is, values are explicitly loaded and stored to the common memory pool. The reason for the big size is to tolerate main memory access latencies. It would make zero sense to stick a cache in between it and main memory.
On Maxwell there are 4 register banks, but unlike on Kepler (also with 4 banks) the assignment of banks to numbers is very simple. The Maxwell assignment is just the register number modulo 4. On Kepler it is possible to arrange the 64 FFMA instructions to eliminate all bank conflicts. On Maxwell this is no longer possible. Maxwell, however, provides something to make up for this and at the same time offers the capability to significantly reduce register bank traffic and overall chip power draw. This is the operand reuse cache. The operand reuse cache has 8 bytes of data per source operand slot. An instuction like FFMA has 3 source operand slots. Each time you issue an instruction there is a flag you can use to specify if each of the operands is going to be used again. So the next instruction that uses the same register in the same operand slot will not have to go to the register bank to fetch its value. And with this feature you can see how a register bank conflict can be averted.