GCN and mixed wavefronts

ieldra

Newcomer
Hi guys, I'll get right to the point. I'm wondering if you can have compute and graphics wavefronts in execution at the same time on different SIMDs within a CU.

This diagram from anand really threw me off because I thought the waves/warps were the columns. So I thought you had one wave completing per cycle, instead you have 4 quarter waves completing per cycle, a wave in four cycles.

image.png


I'm assuming instr 1..4 are actually all the same, no idea who made this damned thing.

Anyway, AMD is pretty clear in their documentation that their 'async shaders' scheme is enabled by fast context switching (dedicated cache within each ACE afaik) but some people are telling me that you don't need a context switch at all and that each SIMD can be working on a wavefront from a different kernel

Can someone shed some light on this and tell me where they get their info ?

thanks in advance :)
 
Is a zero-cost/zero-overhead context switch still a context switch as per classical definition? IOW does it make sense to talk of context switches when all that's changing is some sort of flag, whether or not a Wavefront originates from an ACE or from the Graphics Command Processor?
 
Is a zero-cost/zero-overhead context switch still a context switch as per classical definition? IOW does it make sense to talk of context switches when all that's changing is some sort of flag, whether or not a Wavefront originates from an ACE or from the Graphics Command Processor?

Why can't anything be nice and simple :p

I understand what you're getting at, but the wavefront must be cached (l2) if there's no context swap, and if it's cached then there was wasted cache - right?. I hope you get what I mean. This is what I'm trying to wrap my head around having read the async shaders whitepaper a few times I was pretty confident there is a context swap and the whole CU is assigned to a compute kernel from the ACEs. If this isn't the case, and you have mixed compute/graphics wavefronts on different SIMDs within the CU then there wouldn't be context swaps unless the wavefronts in question weren't previously dispatched to the CU.

A context swap is a context swap if there's data from one kernel being dumped and that of another kernel being retrieved, at least by my humble standards
 
Yes, the wavefront's data needs to be stored somewhere. But it would need storage anyway when it's ready for execution - no matter if it's seamlessly interleaved with other wavefronts or not. Maybe, just mabye, contention could be higher when intermixing additional wavefronts, leaving less wiggle room for others. But I'm guessing that the ACEs would have some kind of knowledge of the machine's filled-upness (is that a legit word?) and stall their issue when the rest of the machine has no free resources anyway.
 
One GCN CU can have up to 40 waves running concurrently (10 per SIMD). It doesn't matter where each wave has originated. There can be any mix of pixel/vertex/geometry/hull/domain/compute shader waves in flight at the same time (from any amount of queues). Instructions from these 40 waves are scheduled to the CU SIMDs in a round robin manner. If some of these 40 waves is waiting for memory, the GPU simply jumps over it in the round robin scheduling.

There is no need to store wave's data in off-chip memory. Each CU has enough on-chip storage for the metadata of these waves. Waves are grouped as thread groups. A single thread group needs to execute on a single CU (this is true for all GPUs, including Intel and Nvidia). This is because threads in the same thread group can use barrier synchronization and share data through LDS (64 KB on-chip buffer on each CU). All GPU architectures use static register allocation. The maximum count of registers used during a shader life time (even if some branch was never taken) needs to be allocated for the wave. Simplified: The GPU scheduler keeps track of available resources (free registers, free LDS, free waves) on each CU. When there's enough registers on some CU for a new thread group (thread group = 1 to 16 waves), the scheduler spawns a thread group for that CU. Each GCN CU has 256 KB of registers, 40 wave slots and 64 KB of LDS. There is no need to context swap kernels (*). Each thread group is guaranteed to finish execution once started. GPU programming model doesn't support thread groups waiting for other thread groups (atomics are supported, but the programmer is not allowed to write spinning locks). Nvidia GPUs work similarly. Intel has per wave (they call them threads) register files. Their register allocation works completely differently (shader compiler outputs different SIMD widths based on register count).

(*) Exceptional cases might need context switch (= store GPU state to memory and restore later). Normal flow of execution doesn't.
 
One GCN CU can have up to 40 waves running concurrently (10 per SIMD). It doesn't matter where each wave has originated. There can be any mix of pixel/vertex/geometry/hull/domain/compute shader waves in flight at the same time (from any amount of queues). Instructions from these 40 waves are scheduled to the CU SIMDs in a round robin manner. If some of these 40 waves is waiting for memory, the GPU simply jumps over it in the round robin scheduling.

There is no need to store wave's data in off-chip memory. Each CU has enough on-chip storage for the metadata of these waves. Waves are grouped as thread groups. A single thread group needs to execute on a single CU (this is true for all GPUs, including Intel and Nvidia). This is because threads in the same thread group can use barrier synchronization and share data through LDS (64 KB on-chip buffer on each CU). All GPU architectures use static register allocation. The maximum count of registers used during a shader life time (even if some branch was never taken) needs to be allocated for the wave. Simplified: The GPU scheduler keeps track of available resources (free registers, free LDS, free waves) on each CU. When there's enough registers on some CU for a new thread group (thread group = 1 to 16 waves), the scheduler spawns a thread group for that CU. Each GCN CU has 256 KB of registers, 40 wave slots and 64 KB of LDS. There is no need to context swap kernels (*). Each thread group is guaranteed to finish execution once started. GPU programming model doesn't support thread groups waiting for other thread groups (atomics are supported, but the programmer is not allowed to write spinning locks). Nvidia GPUs work similarly. Intel has per wave (they call them threads) register files. Their register allocation works completely differently (shader compiler outputs different SIMD widths based on register count).

(*) Exceptional cases might need context switch (= store GPU state to memory and restore later). Normal flow of execution doesn't.

So let me get this straight; you can have wavefronts from different origins in flight on a single CU. A threadblock/threadgroup runs on only one unit due to LDS and sync, and dispatch is at the thread group level, the part about register allocation etc that'sfine. You specifically mentionethere's no need to context swap *kernels* and I used the wrong term there possibly - doesn't need to be a kernel, just context swap of sets of waves - generally speaking will you have graphics on compute waves executing in parallel on a single CU ?
Threadblocks are guaranteed to finish once started, but can more than one be in execution at any given time ? Obviously if the wavefronts in question are already in flight there's no context swapping involved (at least not when they're dispatched for execution) but this still doesn't satisfy my curiosity about the context switching being fast due to the ACEs. If the "context swap cache" I mentioned is actually LDS then what do the ACEs have to do with it ?

Thanks for your replies guys, appreciate it :)
 
Sebbbi, In what context does GCN support 16 wavefronts in a thread group (work group)?
16 waves * 64 threads/wave = 1024 threads. That's the biggest thread group size supported (required) by DirectX. 64 KB register file / 1024 threads = 64 registers/thread. If you need more than 64 registers per thread, the compiler is going to spill to memory (or LDS). A CU can run two 1024 thread (16 wave) thread groups at the same time, if your shader needs 32 or less VGPRs.

The image printed on wall of every console graphics programmer:
gcn_vgpr_table.png
 
1024 threads (16 waves) is not an optimal thread group size for GCN. It doesn't evenly divide the 2560 thread CU maximum concurrent thread count (40 waves). 1024 thread group cannot achieve max occupancy (max latency hiding).

Good GCN thread group sizes (max occupancy): 64, 128, 256, 320, 512, 640.

But maximum occupancy is only achievable in shaders with no more than 256 KB / (2560 threads * 4 bytes/register) = 25 registers/thread.

Sometimes larger thread groups (than 640) are the best choice, as it allows sharing data between larger groups of threads. Max occupancy is not always equal to best performance.
 
Note that those "16"s are not related to each other at all since if you had a workgroup with 16 wavefronts, you couldn't store 16 workgroups in a CU :)

Mistaken numeric association on my part. The GCN architectural slides are more explicit in describing 16 work group barriers, assuming those are the same as described in the whitepaper.
 
Hmm, I think AMD has always limited work group size to 256 for OpenCL. I forgot that D3D is more generous.

The picture isn't really on topic though, since work group size isn't a factor there.

This might help ieldra:

https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf
I've looked at the whitepaper before but it didn't answer my question, I'll take another look anyway.

I already knew each SIMD has up to 10 waves in flight, now it's clear they each execute a quarter wave per cycle and not a wave per cycle over 4 SIMDs, I know it's *possible* that they be mixed compute/graphics.

But... As Sebbbi said each thread group is guaranteed execution on a single CU, and each thread group is guaranteed execution once started, and the scheduler assigns a threadgroup at a time to one CU in round robin. Presumably this applies to ACEs just as well the gcp.

Will the CU be executing waves from two different threadblocks simultaneously at any point? On different SIMDs

As for the context switching it obviously wouldn't be necessary if the scheduler identifies a dependency causing latency for some threadblocks in a task beforehand and dispatches groups from a compute shader instead.

In the case it hadn't been accounted for beforehand context switching is fast.

ACEs are microcode programmable right? AMD has preprogrammed scheduling modes/routines, so I'm assuming the context switching is a sort of last recourse in general cases, otherwise used when preempting for time critical kernels?
 
Ieldra ? 1 quarter wave by cycles? Occpuancy ? GCn is an overthreading architecures, perfect for parallel computing ( with split kernel if we speak about OpenCL )... But ofc this have include some optimizations in graphics case ( in the engine who was used and are still used ) .

I dont know, but it seems most of your response, are in your question, with GCn never consider one approach, GCN is a mullti approach ressources....
 
Last edited:
If AMD hardware is anything like Nvidia (and I rather think it is), you can absolutely have multiple blocks resident on the same CU. I use this to allow wave sized blocks which avoid the need for syncthreads in some kernels.

Allowing blocks from different kernels on one CU at the same time is a much stickier issue. One has to remember that if this happens, you'll end up with a lot of contention over the instruction cache. This means that scheduling can be very tricky. Do you favor homogeneous loads on each CU, thereby saving instruction cache space (remembering that the compiler may just be looking at the size of that cache to determine how far it can unroll loops and so forth), or do you favor heterogeneous loads, which can help when one task's footprint is conveniently sized to "fill in the gaps", or perhaps you can offset a kernel that bottlenecks on some per-CU resource, like local memory bandwidth, or perhaps texture cache.

Dynamic profiling, anyone?:devilish:
 
If AMD hardware is anything like Nvidia (and I rather think it is), you can absolutely have multiple blocks resident on the same CU. I use this to allow wave sized blocks which avoid the need for syncthreads in some kernels.
This is definitely true. You can have as many thread groups executing simultaneously on a single CU as long as you have resources available. Resources in this case being waves (40 -> 2560 threads), registers (256KB) and LDS (64KB). Scalar register count can also be a limitation, but in practice never is. If a group's thread count is not dividable by wave size (64), it is rounded up. This would give a maximum count of 40 thread groups per wave (assuming single wave groups).

DirectX programmers can't count on a fixed wave/warp size, since Nvidia = 32, AMD = 64 and Intel = {8,16,32}. A good compiler should remove synchthreads calls in kernels with thread group size less than the GPUs native wave/warp size.
Allowing blocks from different kernels on one CU at the same time is a much stickier issue. One has to remember that if this happens, you'll end up with a lot of contention over the instruction cache. This means that scheduling can be very tricky. Do you favor homogeneous loads on each CU, thereby saving instruction cache space (remembering that the compiler may just be looking at the size of that cache to determine how far it can unroll loops and so forth), or do you favor heterogeneous loads, which can help when one task's footprint is conveniently sized to "fill in the gaps", or perhaps you can offset a kernel that bottlenecks on some per-CU resource, like local memory bandwidth, or perhaps texture cache.
Heterogeneous loads are a big performance win in many cases on GCN. There are many real world examples available (look at recent GDC/SIGGRAPH presentations).

Common examples are:
- One kernel is sampler bound and the other isn't. Example: parallax mapping (N trilinear/aniso taps for root finding), anisotropic filtering in general (multiple textures), bicubic kernels (3x3 bilinear taps per pixel), blur filters (N bilinear taps), etc. A pure compute task (no tex filter instructions) can fill the execution gaps nicely.
- One kernel uses lots of LDS (thread group shared memory) and this limits occupation. Other kernel with no LDS usage increases occupancy.
- One kernel is heavily memory (and L1 cache) bound, while the other is mostly math crunching (ALU instructions using registers and/or LDS).
- Tail of another kernel. Work finishes at wave granularity (after last barrier). Waiting for the last wave to finish means that processing cycles are lost on each CU (every time kernel changes).
- Resource bottlenecks (waves/registers/LDS). Kernel allocates some CU resources heavily, while other resources are left unused. It is better to schedule thread groups from multiple kernels (different shader) to utilize the CU resources better.
- Uneven split of resources. There are cases where a single kernel doesn't evenly divide all CU resources (waves/registers/LDS) between thread groups (see my post above). Reminder is left unused. Better to schedule multiple kernels (with different resource counts) to use all the resources.

The most common heterogeneous task is executing vertex and pixel shader on the same CU. Vertex + hull + domain shader is another example. GCN is not limited to running multiple compute shaders concurrently on the same CU. You can have a mix of different graphics kernels and compute kernels.
 
Last edited:
The most common heterogeneous task is executing vertex and pixel shader on the same CU. Vertex + hull + domain shader is another example. GCN is not limited to running multiple compute shaders concurrently on the same CU. You can have a mix of different graphics kernels and compute kernels.
Vertex + pixel is what I always consider when trying to wrap my head around how the scheduling would work. Under what conditions would the SIMDs likely be partitioned between different kernels executing in parallel? If a kernel is hitting some bottleneck that hinders utilization (be it register count, lds, L1, FFH dependency) what determines whether it will execute another concurrently or in parallel? Is control over this exposed in any way?
Sorry if I sound like a broken record, just really curious about this parallel mixed wavefronts idea
 
ACEs are microcode programmable right? AMD has preprogrammed scheduling modes/routines, so I'm assuming the context switching is a sort of last recourse in general cases, otherwise used when preempting for time critical kernels?
ACEs won't really perform a context switch to my understanding. They simply stall and a different ACE takes priority. That's what the HWS would handle to my understanding. They should be able to switch, but more streams than ACEs would be required to consider it.

Common examples are:
- One kernel is sampler bound and the other isn't. Example: parallax mapping (N trilinear/aniso taps for root finding), anisotropic filtering in general (multiple textures), bicubic kernels (3x3 bilinear taps per pixel), blur filters (N bilinear taps), etc. A pure compute task (no tex filter instructions) can fill the execution gaps nicely.
- One kernel uses lots of LDS (thread group shared memory) and this limits occupation. Other kernel with no LDS usage increases occupancy.
- One kernel is heavily memory (and L1 cache) bound, while the other is mostly math crunching (ALU instructions using registers and/or LDS).
- Tail of another kernel. Work finishes at wave granularity (after last barrier). Waiting for the last wave to finish means that processing cycles are lost on each CU (every time kernel changes).
- Resource bottlenecks (waves/registers/LDS). Kernel allocates some CU resources heavily, while other resources are left unused. It is better to schedule thread groups from multiple kernels (different shader) to utilize the CU resources better.
- Uneven split of resources. There are cases where a single kernel doesn't evenly divide all CU resources (waves/registers/LDS) between thread groups (see my post above). Reminder is left unused. Better to schedule multiple kernels (with different resource counts) to use all the resources.
So here's a fun question. On GCN with a HWS does the programmer explicitly have to handle complementary pairings? I'm assuming Pascal could do something similar in software, but relies on accurately predicting the results?

Vertex + pixel is what I always consider when trying to wrap my head around how the scheduling would work. Under what conditions would the SIMDs likely be partitioned between different kernels executing in parallel? If a kernel is hitting some bottleneck that hinders utilization (be it register count, lds, L1, FFH dependency) what determines whether it will execute another concurrently or in parallel? Is control over this exposed in any way?
Sorry if I sound like a broken record, just really curious about this parallel mixed wavefronts idea
SIMDs aren't partitioned. They simply have a number of waves assigned which are being processed round robin selecting the next "ready" wave. If one stalls it's simply skipped, if they all stall you have some downtime. I'm not 100% sure if they account for prioritization.
 
Back
Top