Dynamic register allocation in GPUs

Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line.

Is an associative cache that much more expensive compared to a traditional banked register file? It is my understanding that they still have operand caches for recently used registers. The rest seems functionally similar to the usual operand collectors, and is likely done at the pre-scheduling stage. There are enough waves in flight to hide cache access, and each instruction can be delayed until the data has been collected. Cache misses can probably be handled efficiently as well. Stall the instructions until you fall below a certain occupancy threshold, then invoke an interrupt that will shuffle the data around. Sure, it won’t be fast, but still likely much better than a round trip to the thread scheduler.
 
Is an associative cache that much more expensive compared to a traditional banked register file? It is my understanding that they still have operand caches for recently used registers. The rest seems functionally similar to the usual operand collectors, and is likely done at the pre-scheduling stage. There are enough waves in flight to hide cache access, and each instruction can be delayed until the data has been collected. Cache misses can probably be handled efficiently as well. Stall the instructions until you fall below a certain occupancy threshold, then invoke an interrupt that will shuffle the data around. Sure, it won’t be fast, but still likely much better than a round trip to the thread scheduler.
It's moreso competing goals. Register file needs to be accessed in one cycle. This means huge SRAM cells of like 16 transistors before even thinking about checking tags. So your cache can't have high associativity because that's too slow for RF, and it can't be all that high capacity either because it's using humongous SRAM cells.

AFAIK operand cache hits are found by the compiler and baked into the instruction, the hardware isn't doing any checks at runtime.
 
Register file needs to be accessed in one cycle.
Not just that. That needs to be provided for 2 read and 1 write ports just to keep the overall lower bound for instruction latency reasonable. Add on top one more read and write port with lower latency bounds for communication with the next cache tier. That's a lot to service in a single cycle, and the tag checks can't be trivially partitioned into banks under those constraints. Only way I could imagine is if the virtual address space for registers used a different hash scheme from global memory, such that at least the 3-4 fast ports are using bank internal associativity only rather than having to provide associativity spanning the whole register file. Leaves 1 or 2 "slow" ports that need to be scheduled for checking their tags against the whole file, but each bank might just be able to check more than one tag per cycle - just not the whole 4-5.

Probably also sightly more than the 4 banks you usually would find on a register file that needs to support register based access only.
 
Given how large GPU register files are, I guess this works at least somewhat efficiently because most registers are, by necessity of the numbers, rarely accessed. So the ability to spill would let you get away with a much smaller actual register file/L0/lowest level of the hierarchy.

I wonder if this could work on a kind of instruction block basis. You already know which registers are going to be accessed (apart from indexed accesses, if supported) up until the next branch, or even until the next possible stall. Rather than checking tags per instruction, you could do it at the beginning of each block. Within a block, then, instructions only access a limited set of registers. And at each block boundary you could switch to executing another warp/wave/simdgroup, while doing cache transfers for the current one.
 
It's moreso competing goals. Register file needs to be accessed in one cycle. This means huge SRAM cells of like 16 transistors before even thinking about checking tags. So your cache can't have high associativity because that's too slow for RF, and it can't be all that high capacity either because it's using humongous SRAM cells.

I don't see why this would need to be the case at all. GPUs have enough instructions in flight to wait a coupe of cycles before data becomes available.

AFAIK operand cache hits are found by the compiler and baked into the instruction, the hardware isn't doing any checks at runtime.

Doesn't at least Nvidia do wave reordering to minimize bank conflicts? AFAIK most manufacturers do something similar. They pre-record the next instruction for multiple waves and then schedule from that buffer based on available resources. You can start fetching the register data way before the instruction is dispatched.
 
Back
Top