Hardware implementation of threading models in contemporary GPUs

dr_ribit

Newcomer
I recently found these very interesting slides on GPU threading by Andy Glew dated 2009. He describes various hardware models that can be used to implement hardware threading. One particularly interesting takeaway for me was he calls the vector lane threading (SIMT) models, where different threads (lanes) have their own (potentially different) program counter, but the ALU can only execute one data-parallel instruction at a time. If I understand this model correctly, the scheduler will select a bunch of lanes with equal PC values and execute the corresponding instruction in a data-parallel fashion. Again, if I understood this correctly the advantage is that you can save the expensive control logic (area+power), while still being able to execute divergent programs somewhat efficiently (especially if you ALUs are pipelined). Furthermore, this model can be extended to an (N)IMT model where ALUs can execute up to N different instructions (grouped by masks) — but here my understanding is a bit more hazy (does this ability only kick in if we have divergence? or is there something else going on?)

Now, more than 10 years have passed since those slides were written and I was curious how things are done today. On one hand it seems like modern GPUs (NVIDIA for a while, AMD since RNDA2) are capable of limited superscalar execution under certain conditions. From what I gather these GPUs have two sets of data-parallel ALUs and can issue up to two instructions per cycle for each tread/lane, but how does this work in practice? E.g. what about dependency tracking and stuff like that (I don't imagine GPUs use CPU-like reorder buffers, right?). On the other hand, modern GPUs expose their SIMD nature more explicitly by offering instructions that can operate across threads/lanes (e.g. warp/group vote, broadcast, shift etc. instructions). I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution. I had a quick look at the reverse-engineered Apple G13 documentation, which seems to me like a very straightforward in-order machine that uses traditional wide SIMD to implement scalar threading. Divergence appears to be handled via an execution mask and the mask is controlled via a per-thread counter that stores how often a thread was "suspended" (e.g. failing an if condition increases the counter; a thread is masked if the counter is not zero). This doesn't seem to at all like SIMT that Andy Glew describes — there is only one PC for all the threads, and only the active set of threads is executed. It's just masked SIMD with some additional tricks for mask generation and control flow tracking. Then again, Apple GPUs are very simple compared to what Nvidia and AMD ships. Is there a more detailed information, on a technical level (but still understandable to an amateur like myself) about their threading model? I read the white papers etc., but I can't shake the feeling that they ar mostly marketing material that don't explain at all how things actually work. Like, yeah, you can dual-issue, I got that, but how do you do that exactly? Is your hardware capable of detecting the data dependencies, or is this some sort of VLIW where dependencies are tracked by the compiler, or some other schema entirely?
 
I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution.
Modern GPUs execute instructions across a number of threads simultaneously in groups called waves/thread groups/subgroups (every API calls them something different.) Each GPU arch has different size(s), but typically these are in groups of 8-64, and yes each thread in that group has the same PC. It keeps the thread scheduler fairly simple, and opens up the possibility of subgroup operations like vote/broadcast/etc. that you mentioned.

The idea of running all threads with independent PCs and detecting shared PCs to execute all together is interesting. It would indeed be optimal for divergent control flow, but would probably require a massively powerful (and power hungry) scheduler, likely why no one that I know of has gone with such an implementation.
 
Modern GPUs execute instructions across a number of threads simultaneously in groups called waves/thread groups/subgroups (every API calls them something different.) Each GPU arch has different size(s), but typically these are in groups of 8-64, and yes each thread in that group has the same PC. It keeps the thread scheduler fairly simple, and opens up the possibility of subgroup operations like vote/broadcast/etc. that you mentioned.

That’s the basic knowledge, yes, but I was hoping to hear some more detail. For example, it doesn’t explain how some GPUs can issue multiple instructions for the same threat simultaneously. Also, isn’t Intel using a different type of architecture?

The idea of running all threads with independent PCs and detecting shared PCs to execute all together is interesting. It would indeed be optimal for divergent control flow, but would probably require a massively powerful (and power hungry) scheduler, likely why no one that I know of has gone with such an implementation.

Your statement seems to contradict the slides I linked, where Nvidia is described as using that kind of architecture, and that was over 10 years ago.

BTW, I saw that similar question was posted on Nvidia support forums and the Nvidia employee was like “yeah, it’s more complicated but we don’t talk about that…”
 
it doesn’t explain how some GPUs can issue multiple instructions for the same threat simultaneously.
From my understanding, at least on AMD with this VOPD, it's encoded into the instruction itself. Like the compiler says "these two multiplies don't depend on each other, so I'll write a single instruction that performs two independent multiplies".

If you go really down in the details, GCN (pre-RDNA) had a pipelined "waterfall" approach where even though from a software point of view there were 64 threads all working on the same instruction at the same time, at the silicon level it was only executing 16 threads at a time on given clock cycle. But after 4 clock cycles all 64 threads would be done, and there was no way to really tell what was going on without some architectural white papers. I'm sure NV and Intel have similar voodoo going on underneath the simplistic wavefront/subgroup model, but they've got a lot fewer white papers documenting this stuff than AMD has.

Also, isn’t Intel using a different type of architecture?
Intel uses the same basic "all these threads are operating in lockstep on this PC" as everyone else, but they have a much more flexible system for choosing wavefront size than AMD or NV. For example their Vulkan driver reports a subgroup size of 32 for all graphics stages, and a user-configurable 8/16/32 size for compute shaders. But they have multiple execution modes for vertex and fragment shaders that they don't expose to the user. This blog post by a previous Intel driver developer has a lot of great low-level details.

Your statement seems to contradict the slides I linked, where Nvidia is described as using that kind of architecture, and that was over 10 years ago.
And that's why I'm not a hardware designer, and I should just shut up about hardware design :)
 
And that's why I'm not a hardware designer, and I should just shut up about hardware design :)

Or maybe I misunderstood the slides. I just looked again and the basic SIMT algorithm described there seems to use only one PC after all (with a PC/execution mask stack that stores PCs of inactive threads). But then I don't understand why make a difference between SIMT and SIMD. Everyone seems to speak about SIMT like it's somehow architecturally different but when I look into details it's just SIMD where execution and data is treated "vertically" rather than "horizontally" + some special instructions to handle masking and track control flow.
 
In case anyone is interested in this, at least AMD’s dual-issue on RDNA3 is not a mystery anymore. According to their ISA reference they use a limited form of VLIW which packs two operations into one. There are some limitations however which operands can be used. So it’s much less exiting than I thought :) I wonder whether Nvidia does something similar or whether they have a more complex scheduler that can issue multiple instructions per clock from the same thread.
 
I wonder whether Nvidia does something similar or whether they have a more complex scheduler that can issue multiple instructions per clock from the same thread.
The don't need to. Their SIMDs are 16 wide and warps execute over 2 cycles. Each cycle an instruction can be issued to one of two SIMDs.
 
In case anyone is interested in this, at least AMD’s dual-issue on RDNA3 is not a mystery anymore. According to their ISA reference they use a limited form of VLIW which packs two operations into one. There are some limitations however which operands can be used. So it’s much less exiting than I thought :) I wonder whether Nvidia does something similar or whether they have a more complex scheduler that can issue multiple instructions per clock from the same thread.
The restrictions are more datapath related, rather than solely a scheduling concern (irrespective of hardware or software based scheduling).

Evidently, AMD does not seem to incline to expand their CU SIMD datapaths beyond the long-standing core of a “4 banks of 1R1W” VGPR VRF, which is slightly more than what’s enough to sustain one 3-operand FMA per clock. Under this theme, the dual-issue capability in RDNA 3 (both VOPD and certain opcodes in Wave64) is rather there to try to squeeze as much out as possible without expanding it.

It is a reasonable stance since complexity generally trends exponentially the wider you go, including more complicated operand routing/caching, and managing higher probability of VRF bank conflicts. These all could work against your perf/area and perf/watt goals, while not guaranteed to be on or above par than scaling out (i.e. more CUs).
 
Last edited:
The don't need to. Their SIMDs are 16 wide and warps execute over 2 cycles. Each cycle an instruction can be issued to one of two SIMDs.

I have some difficulty wrapping my head around it. Just to see if I got it correctly: when you say that 32-wide warps execute over 2 cycles this still means that an instruction has to be issued only once, right? So if I have two instruction without dependencies the timeline will look a bit like this?

cyclescheduler SIMD1SIMD2
1Issue instruction AStart executing lower half of A-
2Issue instruction BStart executing upper half of AStart executing lower half of B
3Start executing upper half of B
(Pipelined execution over n cycles)
nA is completed
n+1B is completed

Is this more or less how it works?
 
Yes but with an asterisk that there are more than just two SIMDs in an SM group which needs to be scheduled.

But they will all be executing different warps, so it’s taken care by other schedulers, right? What I want to understand is how some GPUs achieve within-warp ILP without the expensive OOE machinery used in CPUs. Anyway, these are really cool tricks!
 
But they will all be executing different warps, so it’s taken care by other schedulers, right? What I want to understand is how some GPUs achieve within-warp ILP without the expensive OOE machinery used in CPUs. Anyway, these are really cool tricks!
There isn’t much variance since all modern GPUs still sticks with an in-order pipeline.

It is either hardware dependency checker (as part of the instruction buffer/issuing stages, scanning ahead), compile-time scheduling that embeds hints into instruction streams for hardware to pick up, or VLIW.
 
Back
Top