AMD Vega 10, Vega 11, Vega 12 and Vega 20 Rumors and Discussion

Optimizing SGEMM for Maxwell:
https://github.com/NervanaSystems/maxas/wiki/SGEMM



This sounds very close to LRF (last result file) from this paper: https://www.cs.utexas.edu/users/skeckler/pubs/micro11.pdf.

If you look at LRF results, you see that size=1 LRF already brings most of the gains of this whole system, but with very limited amount of complexity.
The LRF with size=1 actually resembles the PV and PS (previous vector/scalar) pipeline register functionality of AMD's old VLIW architectures, which also contained the values of the last result of an ALU instruction. ;)
And it was also software/compiler controlled and saved accesses to the register file (and avoiding bank conflicts which were possible with the VLIW architectures). I would claim it is pretty similar to nV's LRF.
edit: According to the linked paper, the LRF can be used a bit more flexible. But the basic idea remains the same.
 
Last edited:
FWIW it's pretty interesting how this really works with Maxwell. The open-source driver just recently added support for the scheduling control bits and operand reuse cache bits - it's definitely necessary to get good performance out of it.
Anyway, if you're interested in the details the bits are here: https://cgit.freedesktop.org/mesa/mesa/commit/?id=f519c47f7d47d88ecf3b5e8f28fdffaa12f684d3
Maxwell register file has much more bank conflicts than Kepler's if the operand cache is not used. Maxwell definitely needs a better compiler, as the cache if fully software controlled, not automated. 1.5x gains sound completely reasonable, but 3.5x worst case perf hit (without cache) is rather extreme. I'd guess the RF changes are even larger than people think. Would be nice to know more details.
 
Maxwell register file has much more bank conflicts than Kepler's if the operand cache is not used. Maxwell definitely needs a better compiler, as the cache if fully software controlled, not automated. 1.5x gains sound completely reasonable, but 3.5x worst case perf hit (without cache) is rather extreme. I'd guess the RF changes are even larger than people think. Would be nice to know more details.
Don't forget this patch isn't just about the operand reuse cache - so the 3.5 times perf improvement isn't just from operand reuse avoiding bank conflicts.
 
Instead you could have a CU wide load buffer, that keeps the loaded data until it is copied to a VGPR just before the first use. Data would be copied from this buffer to the target VGPR when the s_waitcnt is ready. This would allow the compiler to use this VGPR for other purposes instead of serving as a dummy storage slot. This would practically increase the usable register count, as average register life time would be much shorter. There would be no need to extend register life time to hide memory latency. The separate buffer (CU wide) would keep incoming loads. This would actually allow more latency hiding, as the compiler could be more aggressive in moving loads away from use. This kind of load buffer wouldn't need to be as fast as registers as data load (and s_waitcnt before reading it) is a much less frequent action than addressing a register.
This would change where VMCNT is decremented, since it would have to be tracked in the memory pipeline, which may add complexity I'll delve into later.
I think there's already some amount of buffering just get the data from a 64-wide scatter/gather coalesced and moved into the register file.

How deep do you think this CU-wide buffer would need to be?
I suppose the worst-case within one wavefront is firing off 15 vector memory ops to max VMCNT, and then setting a waitcnt of 0. The ISA doc does say that execution can continue if the count equals VMCNT or lower, but on the other hand issuing the 16th operation would exceed the currently documented representation for that value.

Potentially another reason for trying to move data into the register file sooner is reduce the burden on the vector memory pipeline. If the waitcnt value hitting 0 became the point that the program could continue, the vector memory path might be on the hook for up to 40*14*64*4 bytes of data before one more returning load could get one of the wavefronts to VMCNT=0 and data could start moving.
Anything less, and there might be deadlock where no wavefront can buffer enough loads to satisfy its waitcnt, or lower if something were done like N loads, *foo, workgroup_barrier, wait_cnt=0. (ed: if back-pressure throttles load issue)

Moving data into the register file as soon as one row can be filled can allow for the memory pipeline to vary in its capability, perhaps down to 1 register's worth of data at a time without worrying about forward progress.

In terms of sequencing, a vector load is potentially updating information in the vector, scalar, and memory domains. With the current method, there's probably a queue of some kind already that allows the SIMD to start loading data concurrently into the file and then decrementing VMCNT. If s_waitcnt became the point where this starts happening, it might for one thing complicate the process because now there's an implicit waitcnt for s_waitcnt, since the count would reach 0 based on what happened in the memory pipeline and now a SIMD needs to be detect his and then arbitrate access to load/forward the data from another domain. The pipeline logic that works for forwarding with the hard-wired latency of the register file will not have lead time for when VMCNT decrements, which might mean an additional stall.

In other scenarios, there might be a new class of manual NOP states for s_waitcnt, since it would be initiating the update of the register file. Unlike other wait states on vector registers, it would be a more universal need for vector NOPs in front of a waitcnt.
 
Last edited:
I this paper the main register file however was fully clocked and ported (no perf reduction compared to ORF and LRF access). If the main register file would be larger and further away, the 3 level design would be the best choice. I wonder how Nvidia managed to scale their register files 2x larger in Pascal P100 :)
Sorry not sure if I am missing the joke (internet and forums can cause lost in translation situation) or serious consideration as it seems a bit of a fudge anyway and you probably know.
Register file size is the same as Maxwell at 256KB per SM but they overcome part of the problem by doubling the number of SMs in the P100, however this was only ever done with the P100.
The issue it seems they may be looking to overcome in future is register and related BW pressure from multi-precision cores for even greater accelerated dot product flexibility.
Cheers
 
Last edited:
The issue it seems they may be looking to overcome in future is register and related BW pressure from multi-precision cores for even greater accelerated dot product flexibility.
That paper also mentions a 40-70% reduction in accesses to MRF. Any ~50% reduction and I'd seriously be considering SMT in addition to the energy savings. Oversubscribe the RF, feeding two execution units off the same narrow path as each has an op cache. I'd assume this is what P100 is doing to double numbers, but haven't looked into it.

As for Vega, it's roughly how I was imagining the scalars being fed if that pans out. Advantage with the scalar is that a single vector operand is sufficient to feed it.
 
If s_waitcnt became the point where this starts happening, it might for one thing complicate the process because now there's an implicit waitcnt for s_waitcnt, since the count would reach 0 based on what happened in the memory pipeline and now a SIMD needs to be detect his and then arbitrate access to load/forward the data from another domain. The pipeline logic that works for forwarding with the hard-wired latency of the register file will not have lead time for when VMCNT decrements, which might mean an additional stall.

In other scenarios, there might be a new class of manual NOP states for s_waitcnt, since it would be initiating the update of the register file. Unlike other wait states on vector registers, it would be a more universal need for vector NOPs in front of a waitcnt.
The time (in cycles) from load to the next waitcnt tends to be small, because there is no reason to switch out the wave (waitcnt is the reason). In the common case, the waitcnt instruction is reached (and issued) before the data is ready. Issuing the waitcnt instruction would inform the load buffer to start moving data to the registers (at this point you can be sure nobody anymore accesses those registers). After waitcnt has informed the load buffer that registers are now available, it would start waiting (as it does now), and obviously some other wave would take over the SIMD. There wouldn't be two waits or manual NOP states. The single waitcnt instruction would return when the data is ready in registers, just like it does right now.

This would sometimes add more latency. But the L1 load latency is already roughly 100 cycles. Few extra cycles wouldn't matter much.
How deep do you think this CU-wide buffer would need to be?
I suppose the worst-case within one wavefront is firing off 15 vector memory ops to max VMCNT, and then setting a waitcnt of 0. The ISA doc does say that execution can continue if the count equals VMCNT or lower, but on the other hand issuing the 16th operation would exceed the currently documented representation for that value.

Potentially another reason for trying to move data into the register file sooner is reduce the burden on the vector memory pipeline. If the waitcnt value hitting 0 became the point that the program could continue, the vector memory path might be on the hook for up to 40*14*64*4 bytes of data before one more returning load could get one of the wavefronts to VMCNT=0 and data could start moving.
Anything less, and there might be deadlock where no wavefront can buffer enough loads to satisfy its waitcnt, or lower if something were done like N loads, *foo, workgroup_barrier, wait_cnt=0. (ed: if back-pressure throttles load issue)
I think the easiest way would be full software management of this buffer. Similar to register file. Load instruction would specify both target VGPR and load buffer slot. Load buffer could be similarly allocated as the register file. Shader compiler outputs the peak usage. Load buffer should be private to the SIMD (one for each four in CU), as there is no data sharing between other waves. Let's say the load buffer is 16 KB per SIMD (that is 1/4 of the SIMD 64 KB register file size). Maximum occupancy (10 waves per SIMD) would result in maximum 6 loads per SIMD on fly (worth remembering that max occupancy case only gives 24 VGPRs to each lane, thus saving of 6 VGPRs is a big deal). A more common occupancy of 8 would result in 8 load buffer slots per lane. Occupancy of 4 would give 16 load buffer slots per lane. This is kind of nice behavior, since less occupancy means less latency hiding by other waves, thus you need more latency hiding inside the shader (and load buffer slots provide exactly that).

But this proposal is too close to two tier register file. Basically in this proposal, the further away "register file" (called "load buffer") is limited to keeping memory load data. It would likely be better to introduce mov instructions to manually move data between these two register file levels. This way the secondary register file could be able to serve other purposes as well. It feels counter-intuitive to have a smaller limited purpose secondary level register file, instead of having a bigger secondary level register file and cutting the primary register file size down to save power (and reduce latency -> higher clocks).

Just throwing wild ideas around... After giving this more thought I'd say two tier register file should always be a better solution. Loads would go to the furthest (largest & slowest) register file level, that's not directly connected to the execution units. All register file tiers should however still be private to the SIMD. There is no point in sharing resources that have no shared data (waves don't bounce between SIMDs). The biggest advantage of having SIMD private register files (all levels) is fixed latency mov between register file levels (no need for waitcnt). I would prefer that these movs used a separate port (co-issue with other waves) instead of wasting vector instruction slots. It could could even require some wait states (like DPP source data) if that makes possible to further simplify the second tier (big) register file design (saving power). Two wait states after load is nothing. Any half good compiler should be able to find two independent instructions after a load (as loads are very infrequent in GPU shader code).

In future we might have different size SIMDs (for example 64, 32, 16) inside the same CU and select one based on execution mask (popcnt). In this case the slowest register file could be shared between the SIMDs in the CU. This would increase power consumption, but would make moving waves between narrow<->wide SIMD easy. There was a paper about this somewhere...
 
Last edited:
If you look at LRF results, you see that size=1 LRF already brings most of the gains of this whole system, but with very limited amount of complexity.
I'd be surprised if such games aren't already played. It would be simple to latch the past few results (say four) in the vector ALU for immidiate use, deferring register reads (and possible writes, if the results are intermediate).

Also,

The LRF with size=1 actually resembles the PV and PS (previous vector/scalar) pipeline register functionality of AMD's old VLIW architectures, which also contained the values of the last result of an ALU instruction.

Cheers
 
I'd be surprised if such games aren't already played. It would be simple to latch the past few results (say four) in the vector ALU for immidiate use, deferring register reads (and possible writes, if the results are intermediate).
I was thinking about the same. But the four cycle cadence (16 lanes each) of GCN SIMD already avoids all register file bank conflicts (just by splitting the 64 KB register files into four 16 KB register files that each serve 16 lanes - do four reads at same time = 3 cycles to fetch 3 operands for multiply-add, one cycle to execute). If registers are cached it would surely reduce power, but wouldn't allow any register file simplification, because of the strict 4 cycle cadence that demands zero bank conflicts. Nvidia clearly simplified their register file a lot in Maxwell, making to more prone to bank conflicts, but making it more energy efficient. The single slot operand cache avoided most bank conflict issues of the new register file (size=1 operand cache = around 30% reduction in RF accesses) while at the same time saving power = win-win.

In my mental model about the 4 cycle GCN cadence, the write must be delayed (to the next execution cycle, as there's always a free bank during that cycle). I don't know if my model matches the actual GCN hardware at all... However I realized that my mental model requires one slot register cache as the written register can be immediately read by the subsequent instruction (GCN doesn't require any wait states in this case). This wouldn't even need any fancy logic. It is easy to see whether the previous instruction writes to the same register as the next one is reading. In this case you could just skip the register file load. Write wouldn't be skipped, because somebody might read that register later. There's no "reuse" tags in GCN assembly. Developers would see that immediately (AMDs PC tool show their assembly and people have even modified that with OpenGL hacks). With "reuse" tags the write also could be skipped if the result was only used by the next instruction. Maybe GCN already has a single slot register operand cache. But unfortunately it doesn't allow huge optimizations like Maxwell, unless the 4 cycle cadence is not actually how the hardware operates.
 
Last edited:
In my mental model about the 4 cycle GCN cadence, the write must be delayed (to the next execution cycle, as there's always a free bank during that cycle). I don't know if my model matches the actual GCN hardware at all... However I realized that my mental model requires one slot register cache as the written register can be immediately read by the subsequent instruction (GCN doesn't require any wait states in this case).

I'd be extremely surprised if a result couldn't be immidiately forwarded in the same execution unit.

Given the structure of the CUs' register file and vector ALUs, local forwarding wouldn't buy any performance, just save power.

Nvidia seems to capitalize on the saved register file traffic by using it to feed more ALUs.

Cheers
 
Instead you could have a CU wide load buffer, that keeps the loaded data until it is copied to a VGPR just before the first use. Data would be copied from this buffer to the target VGPR when the s_waitcnt is ready. This would allow the compiler to use this VGPR for other purposes instead of serving as a dummy storage slot. This would practically increase the usable register count, as average register life time would be much shorter. There would be no need to extend register life time to hide memory latency. The separate buffer (CU wide) would keep incoming loads. This would actually allow more latency hiding, as the compiler could be more aggressive in moving loads away from use. This kind of load buffer wouldn't need to be as fast as registers as data load (and s_waitcnt before reading it) is a much less frequent action than addressing a register.
It feels like they already did. IIRC VMEM reads are guaranteed to be written back in program order. It is hard to imagine that the MLP would not be exploited, so a reordering facility for out-of-order memory accesses seems inevitable.
 
I'd be extremely surprised if a result couldn't be immidiately forwarded in the same execution unit.

Given the structure of the CUs' register file and vector ALUs, local forwarding wouldn't buy any performance, just save power.

Nvidia seems to capitalize on the saved register file traffic by using it to feed more ALUs.
Exactly my thoughts as well.

GCN execution model is different from anything else. The programmer/compiler doesn't need to care about RF bank conflicts or instruction latencies at all (except for special cases). And no scoreboarding either. All basic instructions execute in 1 cycle (4 cycles wall clock) and can use any input registers. At first it feels like GCN is leaving performance & efficiency at the table, but this 4 cycle cadence also allows very efficient register file design (if my mental model is right, single read port is enough on each 16 KB register file). Nvidia's more flexible execution model needs hardware & software solutions to avoid bank conflicts (including the operand cache). Nvidia's solution however allows more optimizations to be done at compile time. This is usually a good thing for power efficiency. But then again, AMD doesn't need to do similar optimizations at runtime. As a software engineer I don't know enough HW details to actually tell which design is more energy efficient. All we know is that Nvidia made a pretty good efficiency jump with Maxwell. But Maxwell also introduced tiled rasterizer. Hard to say which is contributing more in game workloads.
 
Nvidia seems to capitalize on the saved register file traffic by using it to feed more ALUs.

Cheers
They feed similar number of Cuda cores as in the past (putting aside the Pascal general design expansion that increases this slightly to 5SM per GPC rather than historical 4SM per GPC) but change the ratio for the SM/Cuda cores for their flagship Pascal; for P100 it is 64 cores per SM and every other Nvidia GPU since Maxwell it is 128 cores per SM.
The full core GP102 P40 has 30 SM with 3,840 cores, but the cut down P100 has 56 SM giving it the largest 14.3MB GPU register file size for 3,584 cores.
Just mentioning this because it depends if reading a paper that discuss Pascal generally or specifically the P100, also Jonah has mentioned in the past that other improvements are needed to take pressure off the register and its bandwidth if they intend to give greater flexibility with mixed-precision.

Edit:
NextPlatform table with GPU spec but without the full core GP102 P40 that has 30 SM and so smaller GPU register file.
nvidia-tesla-gpu-capabilities-table.jpg

Cheers
 
Last edited:
They feed the same number of Cuda cores as in the past (putting aside the Pascal general design expansion that increases this slightly to 5SM per GPC rather than historical 4SM per GPC) but change the ratio for the SM/Cuda cores; for P100 64 cores per SM and every other Nvidia GPU it is 128 cores per SM.
The full core GP102 P40 has 30 SM with 3,840 cores, but the cut down P100 has 56 SM giving it the largest 14.3MB GPU register file size for 3,584 cores.
Yes. P100 registers files are same size, but each register file serves only 1/2 threads (warps). Thus P100 has twice as many registers per thread (warp). Also twice as much groupshared memory per thread (warp). In the end, P100 has much larger register files (total) than any other GPU. I am just wondering whether they managed to do this efficiently with a multi-tier register file. More registers per computational resource isn't cheap. I doubt they just brute-force scaled up their register files without improving the efficiency by some means.
 
Yes. P100 registers files are same size, but each register file serves only 1/2 threads (warps). Thus P100 has twice as many registers per thread (warp). Also twice as much groupshared memory per thread (warp). In the end, P100 has much larger register files (total) than any other GPU. I am just wondering whether they managed to do this efficiently with a multi-tier register file. More registers per computational resource isn't cheap. I doubt they just brute-force scaled up their register files without improving the efficiency by some means.
Not sure we are disagreeing Sebbi just that I am pointing out how they do this with the P100 (also meaning a distinction to every other Pascal GPU) and it is a relationship to the SM rather than the ALU/cores, anyway I included the chart in previous post while you were responding that shows why it relates to the SM.
Originally the discussion was about capacity/size/larger on page 45.
The register file size total is SMs x 256KB for Maxwell and Pascal.

Cheers
 
Last edited:
Not sure we are disagreeing Sebbi just that I am pointing out how they do this with the P100 (also meaning a distinction to every other Pascal GPU) and it is a relationship to the SM rather than the ALU/cores, anyway I included the chart in previous post while you were responding that shows why it relates to the SM.
P100 SM seems otherwise similar to GM200 SM, except it only has 1/2 CUDA cores. This still doesn't mean that only the CUDA core count is changed. It might be that a new register file design allowed (or forced) them to do this change. Or it might simply be that P100 is designed mainly for FP64 workloads (FP64 registers take twice as much space). No matter what is the reason, it is still clear that P100 would benefit a lot from more power efficient register files (as the total register count of P100 is so large).
 
P100 SM seems otherwise similar to GM200 SM, except it only has 1/2 CUDA cores. This still doesn't mean that only the CUDA core count is changed. It might be that a new register file design allowed (or forced) them to do this change. Or it might simply be that P100 is designed mainly for FP64 workloads (FP64 registers take twice as much space). No matter what is the reason, it is still clear that P100 would benefit a lot from more power efficient register files (as the total register count of P100 is so large).
The difference is the SM and Cuda ratio to SM.
Only P100 has the 64 Cuda Cores per SM, all other Pascal including the full core P40 and Maxwell are 128 Cuda cores per SM; it is made a little bit more complicated in that for Pascal generally Nvidia moved to 5SM per GPC when with Maxwell it was 4SM per GPC.
256KB x 56SMs is how you get the GPU register file size for the P100.
Cheers
 
Back
Top