Really wide SIMT in GPUs

I was just going to ask about the future of really wide SIMT in GPU's but i figure one thread on this design decision would be nice, so.
1.Why was really wide SIMT the direction GPU's took in the first place?
2.What advantages and disadvantages does really wide SIMT nowadays?
3.Is really wide SIMT the future of GPU's and has the industry stabilized in regards to this basic architecture cornerstone or is there something different coming in the future?

Thanks for your input.
 
Wider SIMT benefits:
- Less hardware for scheduling/spawning waves (less waves to schedule)
- Less hardware for instruction fetch/decode/retire/etc (instructions are wave wide)
- Instruction cache BW is reduced (less N wide instructions to fetch)
- Instruction cache trashing is reduced (in shaders that are long and branchy and don't fit the caches)
- Less (wider) memory load/store requests (to L1, L2 and memory)
- Less requests/traffic between CU <-> texture sampling units (one request gets more samples)
- Coarser resource (register, LDS, thread) allocation granularity (simpler hardware)
- If a per wave scalar unit is present (to offload wave invariant math), less copies of that unit is needed.

Wider SIMT downsides:
- Branching occurs at wave granularity. More useless (execution masked) work is done in code that has non-coherent branches.
- Coarser resource allocation/execution more often results in worse HW utilization (resource count divided by resource need results in a nonzero remainder).
- Execution of a wave cannot proceed until all threads in it have data available. Wider waves have higher probability of a cache miss (single thread is enough to stall the whole wave).
- Stalled waves keep more resources (registers) allocated making the GPU is more often register bound. This in turn reduces the latency hiding capability of the GPU (causing more stalls).
- Shader stages (frequently HS/GS, often VS/DS, possibly PS) might have limited amount of threads to run (or bursts of work). Waves that are not fully filled are sent to execution.
- Fully filling waves (PS/VS/GS/DS/HS) incurs extra pipeline latency, as one wave needs to wait for higher count of pixels/vertices/patches from the fixed function units to be filled 100%.

Not directly related, but an implementation detail of the widest (64 wide) GPU available: AMD GCN uses 16 wide SIMD hardware to execute 64 wide waves. Each instruction takes 4 cycles to execute. Round robin scheduling alternates between 4 waves, meaning that more simultaneous waves are needed to ensure that the execution units are 100% occupied.

I most likely forgot some things, especially low level hardware details that do not affect programmers directly.
 
Last edited:
Every pixel in a polygon runs the same pixel shader AFAIK... :p

How much pixel has a polygon? On average?
I'm sure drawcalls have more than one polygon on average. If it's only about the unique number of pixel shaders ... why not going as wide as the average number of pixels per draw-call instead?
 
I most likely forgot some things, especially low level hardware details that do not affect programmers directly.
If you have the time could you try to list the things you forgot, besides me being interested I wanted to make a thread that could be used as a reference by beginners and experts alike.

Not directly related, but an implementation detail of the widest (64 wide) GPU available: AMD GCN uses 16 wide SIMD hardware to execute 64 wide waves. Each instruction takes 4 cycles to execute. Round robin scheduling alternates between 4 waves, meaning that more simultaneous waves are needed to ensure that the execution units are 100% occupied.
So one waves takes a minimum of 16 cycles to complete, with a maximum of 4 waves can complete in those 16 cycles right?

- Shader stages (frequently HS/GS, often VS/DS, possibly PS) might have limited amount of threads to run (or bursts of work). Waves that are not fully filled are sent to execution.
Can't this be a good/alright thing? Although you loose out in ALU utilization the amount of resources that need to be fetched is less and will take less time to execute a wave and therefore start a new wave quicker.

- Fully filling waves (PS/VS/GS/DS/HS) incurs extra pipeline latency, as one wave needs to wait for higher count of pixels/vertices/patches from the fixed function units to be filled 100%.
Could this be OK as well since it would reduce contention for ALU resources in a single CU? (although I don't really know work is mixed between different CU's)(I think it's called a CU)
 
Can't this be a good/alright thing? Although you loose out in ALU utilization the amount of resources that need to be fetched is less and will take less time to execute a wave and therefore start a new wave quicker.
You probably don't want bits of your GPU running idle regularly, that means you built a bad, poorly engineered chip... :p
 
So one waves takes a minimum of 16 cycles to complete, with a maximum of 4 waves can complete in those 16 cycles right?

Isn't it 64 cycles per wave (16*4), or 1 cycle virtual throughput per thread? There are a few instruction with longer latencies, transcendental ones fe.
 
Isn't it 64 cycles per wave (16*4), or 1 cycle virtual throughput per thread? There are a few instruction with longer latencies, transcendental ones fe.
Well I was going for the minimum so I assumed 1 cycle per instruction and from Sebbbi's post I got 4*(64/16=4)=16.
In other words wave1phase1,wave2phase1,wave3phase1,wave4phase1,wave1phase2,wave2phase2,... 4 phases total.
 
You probably don't want bits of your GPU running idle regularly, that means you built a bad, poorly engineered chip... :p
Thats kind of the reason for the topic in the first place why the really wide SIMT. But in regards to what I said I was saying that sometimes it an underutilized wave for less cycles vs a fully utilized wave that takes more cycles to complete due to the gathering of memory resident resources. Memory resident resources take longer (multiple cycles) to gather so utilization might wind up being equal or lower for a fully utilized wave vs a partially utilized one, although given same amount of work greater wave utilization seems be the right answer (it is the common sense answer). But if you combine the previous with the following quote:
- Stalled waves keep more resources (registers) allocated making the GPU is more often register bound. This in turn reduces the latency hiding capability of the GPU (causing more stalls).
The interplay between competing waves comes into play, I wonder if there are any "use cases" which makes the interplay more important?
 
Well I was going for the minimum so I assumed 1 cycle per instruction and from Sebbbi's post I got 4*(64/16=4)=16.
In other words wave1phase1,wave2phase1,wave3phase1,wave4phase1,wave1phase2,wave2phase2,... 4 phases total.

I'm confused, is it now 1 or 4 cycles per instruction? This post suggests that the four phases run pipelined, which means it's not even 16 but 8 cycles for a wavefront to complete, or 4 cycles if you run longer programs and the pipeline keeps being fed.
 
I'm confused, is it now 1 or 4 cycles per instruction? This post suggests that the four phases run pipelined, which means it's not even 16 but 8 cycles for a wavefront to complete, or 4 cycles if you run longer programs and the pipeline keeps being fed.
For AMD hardware... On a SIMD wave 1 runs all 4 phases before wave 2 runs. If there's another ALU op instruction 2 starts 4 cycles after instruction 1 for wave 1. This is true for most instructions. Some have a longer latency.
 
...sometimes it an underutilized wave for less cycles vs a fully utilized wave that takes more cycles to complete due to the gathering of memory resident resources. Memory resident resources take longer (multiple cycles) to gather so utilization might wind up being equal or lower for a fully utilized wave vs a partially utilized one, although given same amount of work greater wave utilization seems be the right answer (it is the common sense answer).
The less utilized wave will use the same amount of register space as the fully utilized one. Hardware does not execute threads, it executes waves (of N threads, lets assume 64). A single register is wave wide (64x 32 bits). If the PS/VS packs less that 64 pixels/vertices to a single wave, the rest of the lanes (in the wave) will be masked out (similarly than those lanes that fail a branch test). A single VGPR register still has 64 lanes (it always does), meaning that some register lanes are unused. Register file space usage per useful work increases as the packing utilization decreases. A wave with a single pixel/vertex uses 64x register file space per actual processed pixel/vertex compared to a fully packed wave (that processes 64 vertices/pixels). So it is not only a loss of ALU to pack waves poorly, it also wastes register file equally badly.

As you notice, in some corner cases, packing less threads to a single wave might be a benefit. Example: you have a PS wave with 60 pixels already packed from a single (big) triangle. These pixels are local in screen space and UV space (read & write likely share cache lines). You have 4 other pixels (with the same shader) from four separate single pixel triangles that are all far away (from each other and the 60 pixels) in both UV and screen coordinates. Packing these to the same 60 pixel wave slows down that wave a lot. It would be likely better to run that wave at 60/64 of capacity and gather some more random pixels to form a single slow wave (that is practically guaranteed to miss the cache for every read & write). It usually does not cause a longer wait to miss the cache for N lanes (compared to just one) in a single wave, as the GPU can simultaneously serve all the cache misses from a single wave. So packing work items together that have random memory accesses is a good idea (and items together that share the cache lines). Of course predicting memory access patterns before running the shader code is often impossible for more complex shaders.

Old GPUs (before DX9) were optimized to prefetch data by interpolated UVs (from the VS). With no PS you could not modify the UV per pixel (it was fully known). Also in PS1.4 (DX8.1) you had to separate the shader to two parts if you had dependent texture reads (UV was calculated). See http://www2.ati.com/developer/ATIMeltdown01Pixel.PDF pages 14+. DX 8.0 (PS 1.0-1.3) didn't allow artibrary math to UVs. Some DX9 GPUs (and mobile GPUs) continued to optimize for non-dependent texture reads. Modern GPUs no longer do that.
 
I'm confused, is it now 1 or 4 cycles per instruction? This post suggests that the four phases run pipelined, which means it's not even 16 but 8 cycles for a wavefront to complete, or 4 cycles if you run longer programs and the pipeline keeps being fed.
I really don't know, I was just making assumptions from thin air, I guess I have some research to add to the queue unless someone else knows better.

For AMD hardware... On a SIMD wave 1 runs all 4 phases before wave 2 runs. If there's another ALU op instruction 2 starts 4 cycles after instruction 1 for wave 1. This is true for most instructions. Some have a longer latency.
Oh thank you for correcting me (and I suppose it makes sense as well now that I think about it), do you know how many cycles minimum for 4 waves?
edit - wait my memory blanked... Didn't sebbbi state that the waves are scheduled alternately in a round robin fashion?
 
Last edited:
The less utilized wave will use the same amount of register space as the fully utilized one.
Yes but if the less utilized thread takes less cycles the same amount of registers aren't allocated for the same amount of time thus freeing them to be utilized by another wave right? or is allocation handled differently?
A wave with a single pixel/vertex uses 64x register file space per actual processed pixel/vertex compared to a fully packed wave (that processes 64 vertices/pixels). So it is not only a loss of ALU to pack waves poorly, it also wastes register file equally badly.
Of course, no argument here. Just wondering about utilization and the interplay of different waves in regards to register allocation over time.
It usually does not cause a longer wait to miss the cache for N lanes (compared to just one) in a single wave, as the GPU can simultaneously serve all the cache misses from a single wave.
You mean because of multiple memory channels right?
 
Yes but if the less utilized thread takes less cycles the same amount of registers aren't allocated for the same amount of time thus freeing them to be utilized by another wave right? or is allocation handled differently?
Yes, the wave will finish (as soon or) sooner and the registers will be freed sooner, because the execute masked lanes skip all memory reads & writes and thus reduce the likelihood of a stall on every memory instruction. As an interesting tidbit AVX2 gather also has support for execution masking (branchless way to skip loads of lanes, saving BW).

In the worst case a 64 wide wave with a single pixel shader invocation (63 lanes unused) can stall as many times as a 64 wide wave packed full with PS invocations. In this case the 64 wide wave has done 64 times as much work in the same time. In the best case, the single PS invocation (63 lanes unused) wave would finish much sooner with no memory stalls (if some other wave touched the same cache lines just before), however it would still require 64x more registers per peformed work (albeit for a shorter time) and do 64x less work (but still use the execution units and burn power).

Partially filled waves are only a good idea in some corner cases (as described above) and sometimes with VS/GS/HS/DS as you can be primitive setup bound waiting for the waves to be filled. With small triangles, it is best to execute the vertex waves as soon as possible, because the GPU utilization is bursty (not fully utilized all the time). You want to start the PS waves as soon as possible to give the GPU enough meaningful work to fill all the CUs.

If the GPU waves are frequently underutilized, a more narrow wave width would be the best choice to improve the performance and to reduce the power usage. Currently is seems that the optimum is somewhere around 16-64 threads per wave. Different GPU architectures have different technical choices that affect the optimum. For example AMDs scalar unit and resource descriptor based sampling model make wider waves better for their hardware. NVIDIA is stuck at 32 wide waves, as CUDA has exposed the wave width from the beginning, and majority of the optimized CUDA code would break if the wave width would change. Luckily, 32 seems to be a very good wave width.
You mean because of multiple memory channels right?
It takes awfully long time from the load instruction to reach the physical DDR/GDDR memory. First the L1 cache is checked (before that the store forward buffer is checked on CPUs), if the line is not found, a message is sent to L2 (coherency protocol). If the L2 fails to find the line (and it is not in a L1 cache of another core), a memory read request is sent to the memory controller. The memory controller also buffers the requests. Each core (or CU in GPUs) can issue some maximum amount of memory requests per clock. These numbers are well documented for CPUs, but the GPU numbers are not. A controlled microbenchmark of course would give us exact memory request concurrency numbers of each GPU brand.
 
Last edited:
Oh thank you for correcting me (and I suppose it makes sense as well now that I think about it), do you know how many cycles minimum for 4 waves?
edit - wait my memory blanked... Didn't sebbbi state that the waves are scheduled alternately in a round robin fashion?
If those four waves are on different SIMDs in the same CU the first wave will finish its instruction on clock 4 while the other waves instructions finish on clocks 5, 6 and 7. That's for one instruction. They schedule round robin between SIMDs.

The absolute minimum is longer than this.
 
thank you for correcting me (and I suppose it makes sense as well now that I think about it), do you know how many cycles minimum for 4 waves?
edit - wait my memory blanked... Didn't sebbbi state that the waves are scheduled alternately in a round robin fashion?
The instruction arbitration issues only one vector instruction per clock in a round-robin manner across four 16-wide SIMDs. The wavefront pools of the four SIMDs are apparently independent from each other, so there is no cross SIMD issuing.

The SIMD hardware generally executes an vector instruction in a lockstep through its 4-cycle instruction pipeline. Say when the first batch of 16 lanes is at cycle 2, the second batch would be at cycle 1 of the pipeline. While the last batch would never have completed its pipeline from the time of issue, the first batch is always completed anyway. This means back-to-back issue is safe unless your operation is cross-lane, or is not pipelined in the same way as the SIMD.
 
Back
Top