I recently found these very interesting slides on GPU threading by Andy Glew dated 2009. He describes various hardware models that can be used to implement hardware threading. One particularly interesting takeaway for me was he calls the vector lane threading (SIMT) models, where different threads (lanes) have their own (potentially different) program counter, but the ALU can only execute one data-parallel instruction at a time. If I understand this model correctly, the scheduler will select a bunch of lanes with equal PC values and execute the corresponding instruction in a data-parallel fashion. Again, if I understood this correctly the advantage is that you can save the expensive control logic (area+power), while still being able to execute divergent programs somewhat efficiently (especially if you ALUs are pipelined). Furthermore, this model can be extended to an (N)IMT model where ALUs can execute up to N different instructions (grouped by masks) — but here my understanding is a bit more hazy (does this ability only kick in if we have divergence? or is there something else going on?)
Now, more than 10 years have passed since those slides were written and I was curious how things are done today. On one hand it seems like modern GPUs (NVIDIA for a while, AMD since RNDA2) are capable of limited superscalar execution under certain conditions. From what I gather these GPUs have two sets of data-parallel ALUs and can issue up to two instructions per cycle for each tread/lane, but how does this work in practice? E.g. what about dependency tracking and stuff like that (I don't imagine GPUs use CPU-like reorder buffers, right?). On the other hand, modern GPUs expose their SIMD nature more explicitly by offering instructions that can operate across threads/lanes (e.g. warp/group vote, broadcast, shift etc. instructions). I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution. I had a quick look at the reverse-engineered Apple G13 documentation, which seems to me like a very straightforward in-order machine that uses traditional wide SIMD to implement scalar threading. Divergence appears to be handled via an execution mask and the mask is controlled via a per-thread counter that stores how often a thread was "suspended" (e.g. failing an if condition increases the counter; a thread is masked if the counter is not zero). This doesn't seem to at all like SIMT that Andy Glew describes — there is only one PC for all the threads, and only the active set of threads is executed. It's just masked SIMD with some additional tricks for mask generation and control flow tracking. Then again, Apple GPUs are very simple compared to what Nvidia and AMD ships. Is there a more detailed information, on a technical level (but still understandable to an amateur like myself) about their threading model? I read the white papers etc., but I can't shake the feeling that they ar mostly marketing material that don't explain at all how things actually work. Like, yeah, you can dual-issue, I got that, but how do you do that exactly? Is your hardware capable of detecting the data dependencies, or is this some sort of VLIW where dependencies are tracked by the compiler, or some other schema entirely?
Now, more than 10 years have passed since those slides were written and I was curious how things are done today. On one hand it seems like modern GPUs (NVIDIA for a while, AMD since RNDA2) are capable of limited superscalar execution under certain conditions. From what I gather these GPUs have two sets of data-parallel ALUs and can issue up to two instructions per cycle for each tread/lane, but how does this work in practice? E.g. what about dependency tracking and stuff like that (I don't imagine GPUs use CPU-like reorder buffers, right?). On the other hand, modern GPUs expose their SIMD nature more explicitly by offering instructions that can operate across threads/lanes (e.g. warp/group vote, broadcast, shift etc. instructions). I find it difficult to reconcile the existence of such instructions with the possibility that each thread has its own PC, as they simply wouldn't make any sense if hardware lanes can be in a different state of execution. I had a quick look at the reverse-engineered Apple G13 documentation, which seems to me like a very straightforward in-order machine that uses traditional wide SIMD to implement scalar threading. Divergence appears to be handled via an execution mask and the mask is controlled via a per-thread counter that stores how often a thread was "suspended" (e.g. failing an if condition increases the counter; a thread is masked if the counter is not zero). This doesn't seem to at all like SIMT that Andy Glew describes — there is only one PC for all the threads, and only the active set of threads is executed. It's just masked SIMD with some additional tricks for mask generation and control flow tracking. Then again, Apple GPUs are very simple compared to what Nvidia and AMD ships. Is there a more detailed information, on a technical level (but still understandable to an amateur like myself) about their threading model? I read the white papers etc., but I can't shake the feeling that they ar mostly marketing material that don't explain at all how things actually work. Like, yeah, you can dual-issue, I got that, but how do you do that exactly? Is your hardware capable of detecting the data dependencies, or is this some sort of VLIW where dependencies are tracked by the compiler, or some other schema entirely?