Hardware implementation of threading models in contemporary GPUs

dr_ribit

Newcomer
I recently found these very interesting slides on GPU threading by Andy Glew dated 2009. He describes various hardware models that can be used to implement hardware threading. One particularly interesting takeaway for me was he calls the vector lane threading (SIMT) models, where different threads (lanes) have their own (potentially different) program counter, but the ALU can only execute one data-parallel instruction at a time. If I understand this model correctly, the scheduler will select a bunch of lanes with equal PC values and execute the corresponding instruction in a data-parallel fashion. Again, if I understood this correctly the advantage is that you can save the expensive control logic (area+power), while still being able to execute divergent programs somewhat efficiently (especially if you ALUs are pipelined). Furthermore, this model can be extended to an (N)IMT model where ALUs can execute up to N different instructions (grouped by masks) — but here my understanding is a bit more hazy (does this ability only kick in if we have divergence? or is there something else going on?)

Now, more than 10 years have passed since those slides were written and I was curious how things are done today. On one hand it seems like modern GPUs (NVIDIA for a while, AMD since RNDA2) are capable of limited superscalar execution under certain conditions. From what I gather these GPUs have two sets of data-parallel ALUs and can issue up to two instructions per cycle for each tread/lane, but how does this work in practice? E.g. what about dependency tracking and stuff like that (I don't imagine GPUs use CPU-like reorder buffers, right?). On the other hand, modern GPUs expose their SIMD nature more explicitly by offering instructions that can operate across threads/lanes (e.g. warp/group vote, broadcast, shift etc. instructions). I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution. I had a quick look at the reverse-engineered Apple G13 documentation, which seems to me like a very straightforward in-order machine that uses traditional wide SIMD to implement scalar threading. Divergence appears to be handled via an execution mask and the mask is controlled via a per-thread counter that stores how often a thread was "suspended" (e.g. failing an if condition increases the counter; a thread is masked if the counter is not zero). This doesn't seem to at all like SIMT that Andy Glew describes — there is only one PC for all the threads, and only the active set of threads is executed. It's just masked SIMD with some additional tricks for mask generation and control flow tracking. Then again, Apple GPUs are very simple compared to what Nvidia and AMD ships. Is there a more detailed information, on a technical level (but still understandable to an amateur like myself) about their threading model? I read the white papers etc., but I can't shake the feeling that they ar mostly marketing material that don't explain at all how things actually work. Like, yeah, you can dual-issue, I got that, but how do you do that exactly? Is your hardware capable of detecting the data dependencies, or is this some sort of VLIW where dependencies are tracked by the compiler, or some other schema entirely?
 
I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution.
Modern GPUs execute instructions across a number of threads simultaneously in groups called waves/thread groups/subgroups (every API calls them something different.) Each GPU arch has different size(s), but typically these are in groups of 8-64, and yes each thread in that group has the same PC. It keeps the thread scheduler fairly simple, and opens up the possibility of subgroup operations like vote/broadcast/etc. that you mentioned.

The idea of running all threads with independent PCs and detecting shared PCs to execute all together is interesting. It would indeed be optimal for divergent control flow, but would probably require a massively powerful (and power hungry) scheduler, likely why no one that I know of has gone with such an implementation.
 
Modern GPUs execute instructions across a number of threads simultaneously in groups called waves/thread groups/subgroups (every API calls them something different.) Each GPU arch has different size(s), but typically these are in groups of 8-64, and yes each thread in that group has the same PC. It keeps the thread scheduler fairly simple, and opens up the possibility of subgroup operations like vote/broadcast/etc. that you mentioned.

That’s the basic knowledge, yes, but I was hoping to hear some more detail. For example, it doesn’t explain how some GPUs can issue multiple instructions for the same threat simultaneously. Also, isn’t Intel using a different type of architecture?

The idea of running all threads with independent PCs and detecting shared PCs to execute all together is interesting. It would indeed be optimal for divergent control flow, but would probably require a massively powerful (and power hungry) scheduler, likely why no one that I know of has gone with such an implementation.

Your statement seems to contradict the slides I linked, where Nvidia is described as using that kind of architecture, and that was over 10 years ago.

BTW, I saw that similar question was posted on Nvidia support forums and the Nvidia employee was like “yeah, it’s more complicated but we don’t talk about that…”
 
it doesn’t explain how some GPUs can issue multiple instructions for the same threat simultaneously.
From my understanding, at least on AMD with this VOPD, it's encoded into the instruction itself. Like the compiler says "these two multiplies don't depend on each other, so I'll write a single instruction that performs two independent multiplies".

If you go really down in the details, GCN (pre-RDNA) had a pipelined "waterfall" approach where even though from a software point of view there were 64 threads all working on the same instruction at the same time, at the silicon level it was only executing 16 threads at a time on given clock cycle. But after 4 clock cycles all 64 threads would be done, and there was no way to really tell what was going on without some architectural white papers. I'm sure NV and Intel have similar voodoo going on underneath the simplistic wavefront/subgroup model, but they've got a lot fewer white papers documenting this stuff than AMD has.

Also, isn’t Intel using a different type of architecture?
Intel uses the same basic "all these threads are operating in lockstep on this PC" as everyone else, but they have a much more flexible system for choosing wavefront size than AMD or NV. For example their Vulkan driver reports a subgroup size of 32 for all graphics stages, and a user-configurable 8/16/32 size for compute shaders. But they have multiple execution modes for vertex and fragment shaders that they don't expose to the user. This blog post by a previous Intel driver developer has a lot of great low-level details.

Your statement seems to contradict the slides I linked, where Nvidia is described as using that kind of architecture, and that was over 10 years ago.
And that's why I'm not a hardware designer, and I should just shut up about hardware design :)
 
And that's why I'm not a hardware designer, and I should just shut up about hardware design :)

Or maybe I misunderstood the slides. I just looked again and the basic SIMT algorithm described there seems to use only one PC after all (with a PC/execution mask stack that stores PCs of inactive threads). But then I don't understand why make a difference between SIMT and SIMD. Everyone seems to speak about SIMT like it's somehow architecturally different but when I look into details it's just SIMD where execution and data is treated "vertically" rather than "horizontally" + some special instructions to handle masking and track control flow.
 
In case anyone is interested in this, at least AMD’s dual-issue on RDNA3 is not a mystery anymore. According to their ISA reference they use a limited form of VLIW which packs two operations into one. There are some limitations however which operands can be used. So it’s much less exiting than I thought :) I wonder whether Nvidia does something similar or whether they have a more complex scheduler that can issue multiple instructions per clock from the same thread.
 
I wonder whether Nvidia does something similar or whether they have a more complex scheduler that can issue multiple instructions per clock from the same thread.
The don't need to. Their SIMDs are 16 wide and warps execute over 2 cycles. Each cycle an instruction can be issued to one of two SIMDs.
 
In case anyone is interested in this, at least AMD’s dual-issue on RDNA3 is not a mystery anymore. According to their ISA reference they use a limited form of VLIW which packs two operations into one. There are some limitations however which operands can be used. So it’s much less exiting than I thought :) I wonder whether Nvidia does something similar or whether they have a more complex scheduler that can issue multiple instructions per clock from the same thread.
The restrictions are more datapath related, rather than solely a scheduling concern (irrespective of hardware or software based scheduling).

Evidently, AMD does not seem to incline to expand their CU SIMD datapaths beyond the long-standing core of a “4 banks of 1R1W” VGPR VRF, which is slightly more than what’s enough to sustain one 3-operand FMA per clock. Under this theme, the dual-issue capability in RDNA 3 (both VOPD and certain opcodes in Wave64) is rather there to try to squeeze as much out as possible without expanding it.

It is a reasonable stance since complexity generally trends exponentially the wider you go, including more complicated operand routing/caching, and managing higher probability of VRF bank conflicts. These all could work against your perf/area and perf/watt goals, while not guaranteed to be on or above par than scaling out (i.e. more CUs).
 
Last edited:
The don't need to. Their SIMDs are 16 wide and warps execute over 2 cycles. Each cycle an instruction can be issued to one of two SIMDs.

I have some difficulty wrapping my head around it. Just to see if I got it correctly: when you say that 32-wide warps execute over 2 cycles this still means that an instruction has to be issued only once, right? So if I have two instruction without dependencies the timeline will look a bit like this?

cyclescheduler SIMD1SIMD2
1Issue instruction AStart executing lower half of A-
2Issue instruction BStart executing upper half of AStart executing lower half of B
3Start executing upper half of B
(Pipelined execution over n cycles)
nA is completed
n+1B is completed

Is this more or less how it works?
 
Yes but with an asterisk that there are more than just two SIMDs in an SM group which needs to be scheduled.

But they will all be executing different warps, so it’s taken care by other schedulers, right? What I want to understand is how some GPUs achieve within-warp ILP without the expensive OOE machinery used in CPUs. Anyway, these are really cool tricks!
 
But they will all be executing different warps, so it’s taken care by other schedulers, right? What I want to understand is how some GPUs achieve within-warp ILP without the expensive OOE machinery used in CPUs. Anyway, these are really cool tricks!
There isn’t much variance since all modern GPUs still sticks with an in-order pipeline.

It is either hardware dependency checker (as part of the instruction buffer/issuing stages, scanning ahead), compile-time scheduling that embeds hints into instruction streams for hardware to pick up, or VLIW.
 
I recently found these very interesting slides on GPU threading by Andy Glew dated 2009. He describes various hardware models that can be used to implement hardware threading. One particularly interesting takeaway for me was he calls the vector lane threading (SIMT) models, where different threads (lanes) have their own (potentially different) program counter, but the ALU can only execute one data-parallel instruction at a time. If I understand this model correctly, the scheduler will select a bunch of lanes with equal PC values and execute the corresponding instruction in a data-parallel fashion.
Nvidia refers to this programming model as "independent thread scheduling" where each lane in a warp features their own program counter ...
On the other hand, modern GPUs expose their SIMD nature more explicitly by offering instructions that can operate across threads/lanes (e.g. warp/group vote, broadcast, shift etc. instructions). I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution.
It's true that you can't assume the property of implicit intra-warp synchronization anymore for architectures such as Nvidia Volta and upwards which is why they're now selling programmers the idea of explicitly synchronized warp-level primitives as opposed to their legacy/implicitly synchronized warp-level primitives ...
I had a quick look at the reverse-engineered Apple G13 documentation, which seems to me like a very straightforward in-order machine that uses traditional wide SIMD to implement scalar threading. Divergence appears to be handled via an execution mask and the mask is controlled via a per-thread counter that stores how often a thread was "suspended" (e.g. failing an if condition increases the counter; a thread is masked if the counter is not zero). This doesn't seem to at all like SIMT that Andy Glew describes — there is only one PC for all the threads, and only the active set of threads is executed. It's just masked SIMD with some additional tricks for mask generation and control flow tracking. Then again, Apple GPUs are very simple compared to what Nvidia and AMD ships.
It's as you described, they use a counter to track control flow nesting to handle divergence so the obvious takeaway should be is that not all GPU designs feature a Volta style SIMT programming model so let's revisit an earlier statement ...
Again, if I understood this correctly the advantage is that you can save the expensive control logic (area+power), while still being able to execute divergent programs somewhat efficiently (especially if you ALUs are pipelined). Furthermore, this model can be extended to an (N)IMT model where ALUs can execute up to N different instructions (grouped by masks) — but here my understanding is a bit more hazy (does this ability only kick in if we have divergence? or is there something else going on?)
The clear benefit of independent thread scheduling is that it makes control flow divergence easy to implement but this programming model isn't highly conducive for things like cross-lane operations, maximal reconvergence, or any SIMD programming abstraction thereof ...

The programming model may even operate in the presence of no control flow divergence too!
 
Last edited:
TLDR: NVIDIA conditionals are more expensive than AMD conditionals partly as a result of Volta SIMT which is awesome but overkill for graphics/AI, but they could probably make it ~free with a bit more effort.


This is a bump of a very old thread (pre-forum-closing-and-reopening) but a very interesting topic so not a bad thing
:) Sigh, I really want to go into a lot more detail about different alternatives here, but given my past employers and some of the things discussed, I really shouldn't.

Let me just say I think NVIDIA's scheme is brilliant but overkill for both graphics and AI (very likely more useful in some HPC workloads). You can also search for "OPTIMIZED SCHEDULING AND RESOURCE ALLOCATION FOR THREAD PARALLEL ARCHITECTURES" if you're curious what crazy optimisations Volta SIMT allows in theory (this is a PhD thesis by someone who is now at NVIDIA, and everything described is just software tricks on top of the existing Volta HW as far as I can tell - unclear whether NVIDIA actually does any of this in their compiler today or not).

There's also the story of the Vega "Arbitrary Divergent Control Flow" instructions (Section 4.6 of the ISA) which was apparently removed in RDNA but kept in CDNA so that's one of the many unintuitive differences between the two. It's a clever trick but NVIDIA's solution is clearly more elegant. And AMD's RDNA control flow support is less powerful than that, therefore less powerful than NVIDIA's, but modern GPUs are flexible enough you can emulate anything at terrible performance if you try hard enough, and more importantly: it doesn't matter whatsoever for any graphics or AI workload I've ever heard of.

NVIDIA's control flow instructions are quite expensive in terms of the number of instructions on the critical path: SETP to set the predicate, BSSY+BRA for the branch, and BSYNC after the branch to reconverge. On Volta that didn't really matter much because you had dual-issue (or rather: half-issue ALUs with 1 instruction/clk) so you could do a FMA and any other instruction for free. But starting on GA102 where we have 2 FMA pipelines, every instruction is (at least) a cycle, so you've wasted 3-4 FMAs for your conditional (unless you just predicate the ALU instructions and pay the cost for them no matter what but that only makes sense for extremely small conditional blocks) although it doesn't matter for Hopper AI workloads where the Hopper tensor core is a single instruction taking many cycles though.

The BRA instruction will inevitably take a cycle *if* the threads don't all take the same path because that requires reading and writing 2 registers (64-bit program counter). But in case everything goes through the same path, then in theory BSSY/BRA/BSYNC should only use resources "off the critical path" and NVIDIA could make it completely free if their instruction fetch/decode/issue hardware wasn't such an obvious weak point for them... it'll be interesting to see if they managed to improve that in Blackwell.
 
But starting on GA102 where we have 2 FMA pipelines, every instruction is (at least) a cycle, so you've wasted 3-4 FMAs for your conditional (unless you just predicate the ALU instructions and pay the cost for them no matter what but that only makes sense for extremely small conditional blocks) although it doesn't matter for Hopper AI workloads where the Hopper tensor core is a single instruction taking many cycles though.
Is this not true pre-wgmma? mma/wmma still take several cycles to complete. Or is the difference here that wgmma is explicitly async?
 
Is this not true pre-wgmma? mma/wmma still take several cycles to complete. Or is the difference here that wgmma is explicitly async?
You're right, I think in A100 they got up to 1 MMA/WMMA instruction every 2 clocks, maybe that depends on the size/precision variant though... However, AI code has quite a lot of integer multiply-adds for address generation and load instructions, so the instruction decoder might still sometimes be extremely busy, but it's probably not a significant bottleneck (and the loops are typically unrolled so much in these kernels that the few branch instructions are negligible compared to the integer and memoruy instructions anyway).
 
There's also the story of the Vega "Arbitrary Divergent Control Flow" instructions (Section 4.6 of the ISA) which was apparently removed in RDNA but kept in CDNA so that's one of the many unintuitive differences between the two. It's a clever trick but NVIDIA's solution is clearly more elegant. And AMD's RDNA control flow support is less powerful than that, therefore less powerful than NVIDIA's, but modern GPUs are flexible enough you can emulate anything at terrible performance if you try hard enough, and more importantly: it doesn't matter whatsoever for any graphics or AI workload I've ever heard of.
I think they removed native HW support for unstructured control flow out of both performance pragmatism and hardware complexity implementation reasons. AMD gfx HW is natively restricted to structured control flow since unstructured control flow can inevitably lead to reduced performance caused by register spilling due to the use of indirect function calls and high compatibility for programming models in compute stacks (OpenCL/CUDA/etc.) isn't a part of their gfx HW design goals ...

The SIMT programming abstractions of CUDA/HIP are suboptimal for SIMD architectures like AMDs since they leave performance on the table. Implicit warp-synchronous programming (SIMDs running in lockstep) and cross-lane operations (intra-thread communication & data exchange) are more intuitive expressions for those architectures. Modern AMD gfx HW design goes much farther in terms of rich SIMD features like ds_permute/ds_bpermute/DPP/v_permlane than the warp-level primitives found in CUDA and LDS memory is available in EVERY shader stage (not just in compute) in case DPP instructions don't cover some scenarios ...
 
Back
Top