GCN and mixed wavefronts

ACEs are microcode programmable right? AMD has preprogrammed scheduling modes/routines, so I'm assuming the context switching is a sort of last recourse in general cases, otherwise used when preempting for time critical kernels?
The ACEs are not, but the HWS are. The (original) ACEs used to be hardwired to fetch and decode up to 8 PM4 command streams each. (The predecessor of the ACEs used in the console APUs only have access to one stream each.)

The fetch, decoding and queue scheduling is now offloaded to the dual-threaded HWS processor, which is programmable. The 4 remaining ACEs themselves now only handle the actual dispatch (single internal queue per ACE, and a dispatch rate of two threadgroups per cycle), while the HWS handles fetch, decode and scheduling. How the HWS distributes the workload to the ACEs is also subject to the firmware.

Depending on the generation of the HWS, the size of the microcode varies. I think the APUs had a rather large one to begin with, but the first generation on dedicated GPUs was lacking in these terms. This results in different capabilities to handle either more different buffer formats (it's no longer limited to PM4), or to execute more complex scheduling algorithms.

So prior to the introduction of the HWS, you actually needed at least 8 command queues to achieve full dispatch rate by utilizing all 8 ACEs - since the introduction of the HWS you no longer do. The 4 ACEs together with the dual-threaded HWS processor can reach full dispatch rate even with only a single command queue.

How the ACEs themselve dispatch the wavefronts to the CUs is also configurable by software. AFAIK, it supports two different modes: Either schedule round-robin to all vacant CUs first, or fill each single CU to the top before passing on. I don't know which mode is used per default on which platform, or which APIs even expose this toggle.
 
Last edited:
ACEs won't really perform a context switch to my understanding. They simply stall and a different ACE takes priority. That's what the HWS would handle to my understanding. They should be able to switch, but more streams than ACEs would be required to consider it.

So here's a fun question. On GCN with a HWS does the programmer explicitly have to handle complementary pairings? I'm assuming Pascal could do something similar in software, but relies on accurately predicting the results?

The async shaders whitepaper is pretty persistent about the fact that this scheme is enabled by fast context switching thanks to the ACEs, and I remember reading about each ACE having a buffer for swaps, but generally I speaking I completely understand where you're coming from when you say context switching isn't needed; effective scheduling will render this approach moot for the most part.

Fun question indeed, I'd like some more info on this as well :) HWS is new right ? As of Polaris ?


Edit: when sebbbi said a group is guaranteed to finish once it starts to execute I took that to mean all warps in the group must complete before it can move onto another group

SIMDs aren't partitioned. They simply have a number of waves assigned which are being processed round robin selecting the next "ready" wave. If one stalls it's simply skipped, if they all stall you have some downtime. I'm not 100% sure if they account for prioritization.
[/QUOTE]

Partitioned isn't the right word I guess, I'm just talking about any situation in which SIMDs are executing groups from different streams in parallel. Say you have some threadgroups that will stall if you run using all 4 SIMDs (some dependency on FFH) but won't if you just use 2. On what basis will the scheduler choose between dispatching groups from another kernel to execute in parallel, thus partitioning the SIMD's between the two, or executing on all 4 SIMDs and switching to another group until the former can proceed ?

The ACEs are not, but the HWS are. The (original) ACEs used to be hardwired to fetch and decode up to 8 PM4 command streams each. (The predecessor of the ACEs used in the console APUs only have access to one stream each.)

The fetch, decoding and queue scheduling is now offloaded to the dual-threaded HWS processor, which is programmable. The 4 remaining ACEs themselves now only handle the actual dispatch (single internal queue per ACE, and a dispatch rate of two threadgroups per cycle), while the HWS handles fetch, decode and scheduling. How the HWS distributes the workload to the ACEs is also subject to the firmware.

Depending on the generation of the HWS, the size of the microcode varies. I think the APUs had a rather large one to begin with, but the first generation on dedicated GPUs was lacking in these terms. This results in different capabilities to handle either more different buffer formats (it's no longer limited to PM4), or to execute more complex scheduling algorithms.

So prior to the introduction of the HWS, you actually needed at least 8 command queues to achieve full dispatch rate by utilizing all 8 ACEs - since the introduction of the HWS you no longer do. The 4 ACEs together with the dual-threaded HWS processor can reach full dispatch rate even with only a single command queue.

How the ACEs themselve dispatch the wavefronts to the CUs is also configurable by software. AFAIK, it supports two different modes: Either schedule round-robin to all vacant CUs first, or fill each single CU to the top before passing on. I don't know which mode is used per default on which platform, or which APIs even expose this toggle.

Yes someone was just telling me about these changes, now the ACEs are glorified dispatchers. ACEs (pre-GCN1.3) were configurable (I remember programmable but I'll have to double check) and they came with predefined modes like quick response queue (there were another two I believe).
 
Last edited:
when sebbbi said a group is guaranteed to finish once it starts to execute I took that to mean all warps in the group must complete before it can move onto another group
I meant that resources for the whole thread group are allocated at once. When the group's resources are allocated each wave in the group is ready for execution. Round robin scheduling guarantees that each wave on a CU advances. All stalls (memory, LDS, special instruction latency) are temporary and have upper bounds. There is no way to stop execution of a wave, except with barriers. And barrier is also just a temporary block, since it only waits for other waves in the same group (and all waves of the same group started execution at the same time, so each one will eventually reach the barrier).

Any amount of thread groups from any draw/dispatch can be executed concurrently with no fine grained synchronization, since thread groups can't have any dependencies on each other. Any execution order is legal. There's once GCN specific exception to this rule (ordered count atomics), but that instruction is not exposed in any PC API (DX12 SM6.0 however seems to expose it soon).

If the scheduler wants to mix two kernels, it doesn't need to context switch (stop some work to start some other). It simply needs to schedule every other group from a different kernel, and CUs will run two kernels 50/50. It doesn't need to be much more complex than that.
 
Last edited:
Thanks for clearing that up sebbbi, lots of helpful replies from all of you in the thread , appreciate it :)
 
Will the CU be executing waves from two different threadblocks simultaneously at any point? On different SIMDs
Potentially. It depends on priority of waves on the SIMD and wait status.
GCN can and often does execute up to 40 thread groups per CU simultaneously. One SIMD thus executes up to 10 thread groups concurrently. This happens when you use thread group size of 64 (single wave). I use thread groups of 64 lanes often. That's the most efficient thread group size for GCN if you don't need LDS. Bigger thread groups of course are beneficial if you can share lots of data over LDS. On Nvidia GPUs my profiling shows that thread group size of 64 is slightly slower than 128. 128 is also good for AMD. 256 also performs great on both.

GCN seems to release resources (LDS, wave slots and registers) by thread group granularity instead of wave granularity (resources are obviously allocated for the whole block at once to prevent deadlock). If some waves of a thread group exit early, the resources (wave slots and registers) of those waves seem not to be freed until the whole thread group is finished. This is clearly visible (in performance) when using large thread groups such as 1024.

Only two 1024 lane thread groups fit simultaneously to a single CU (max 40 waves = 2560 lanes). Third group cannot start until every single wave of either thread group has finished. This means that every time a thread group finishes, there is a short period where there is only a single thread group in execution. 1024 lane thread group = 16 waves. If that is evenly split among CUs, the occupancy is 4 (of 10) per SIMD. This is often not enough to hide memory latency. Occasionally both thread groups finish at the same time, meaning that the whole CU is empty (idling) for a short time. Thread groups of 640 lanes or smaller are much better for GCN. 640 = 4 groups fit to CU at once, 512 = 5 groups fit to CU at once. This radically reduces the problem described above.

Another problem related to the above are shaders with high register count. Register count can also be (and often is) the limiting factor of simultaneous thread groups per CU. Again, if you can't fit at least 4 groups per CU, you are going to see bad performance.
 
The ACEs are not, but the HWS are. The (original) ACEs used to be hardwired to fetch and decode up to 8 PM4 command streams each. (The predecessor of the ACEs used in the console APUs only have access to one stream each.)
Err, what predecessor of ACEs used in console APUs? Both console APUs have 2nd gen GCN equivalent ACEs, capable of 8 streams each. First gen GCNs had the one stream ACEs.
The definition of HWS is kinda sketchy too. Fiji for example has 8 ACEs or 4 ACEs and 2 HWSs depending on which diagram you want to look at - so we know that at least since Fiji, possibly since Tonga since they are mostly "same gen", 2 ACEs can function as HWS, or vice versa.
 
Err, what predecessor of ACEs used in console APUs? Both console APUs have 2nd gen GCN equivalent ACEs, capable of 8 streams each. First gen GCNs had the one stream ACEs.
Oh right, my fault. Mixed that one up. Even though it's not yet a full 2nd gen GCN in the PS4 either.
The definition of HWS is kinda sketchy too. Fiji for example has 8 ACEs or 4 ACEs and 2 HWSs depending on which diagram you want to look at - so we know that at least since Fiji, possibly since Tonga since they are mostly "same gen", 2 ACEs can function as HWS, or vice versa.
The charts showing 8 ACEs are misleading. There are no 8 ACEs in Fiji or any other Volcanic Island card. It's not even the same "ACEs" as in the previous generations. And the HWS can't double function as an "ACE" either. AMD stopped publishing any new slides showing 8 ACEs for a good reason.

The HWS processor is the frontend (decode, schedule) for the new "ACE"s, which are, as @ieldra put it quite fitting, nothing more but dispatchers now.

The performance characteristics are comparable to having 4-16 "classic" ACEs on a single Volcanic Island GPU (upper bound just an estimation, lower bound defined by ACE throughput), depending on the type of workload, but there really is no straight 1:1 relationship between ACE and HWS.
 
The async shaders whitepaper is pretty persistent about the fact that this scheme is enabled by fast context switching thanks to the ACEs, and I remember reading about each ACE having a buffer for swaps, but generally I speaking I completely understand where you're coming from when you say context switching isn't needed; effective scheduling will render this approach moot for the most part.

Fun question indeed, I'd like some more info on this as well :) HWS is new right ? As of Polaris ?
I'd be cautious of using the word context switching here. Context switching of queues is what ACEs designed for, and it is inherently easy due to queues being generally non-blocking dispatch commands and asynchronous waits. HWS additionally allows oversubscription to take it further.

What the asynchronous shader paper (and Quick Response) is about is that this context switching of queues AND the scheduling of which dispatch grids get a green traffic light (kinda QoS).

However, what people usually meant by context switching is something else called mid-wave preemption in GCN. i.e. You can bump something running to the memory, run something important, and resume the bumped task from memory at a later point.
 
Last edited:
I'd be cautious of using the word context switching here. Context switching of queues is what ACEs designed for, and it is inherently easy due to queues being generally non-blocking dispatch commands and asynchronous waits. HWS additionally allows oversubscription to take it further.

What the asynchronous shader paper (and Quick Response) is about is that this context switching of queues AND the scheduling of which dispatch grids get a green traffic light (kinda QoS).

However, what people usually meant by context switching is something else called mid-wave preemption in GCN. i.e. You can bump something running to the memory, run something important, and resume the bumped task from memory at a later point.

When I talk about context switching I mean halting a wave (or wavegroup) that is in execution, bumping it off somewhere, running something else then resuming it later. I would not consider switching queues to be context switching, and it's rather confusing

I thought there was a buffer within each ACE that enabled it to store/load context for preempting wavegroups from a different grid in a different queue.
 
When I talk about context switching I mean halting a wave (or wavegroup) that is in execution, bumping it off somewhere, running something else then resuming it later. I would not consider switching queues to be context switching, and it's rather confusing

I thought there was a buffer within each ACE that enabled it to store/load context for preempting wavegroups from a different grid in a different queue.
The implementation is obscured. But realistically ACEs would not have such buffer. For the amount of state (256KB+ 8KB per CU) and the sequentiality of the process, you might be better just write them directly out to the memory. Carrizo apparently does this.

Moreover, the ACEs should not have knowledge about wavefronts being preempted either. Not only because preemption is expected to be transparent to the application, but the ACEs and the CUs are living in two different realms of existence. It is kinda control path vs data path.
 
Last edited:
Even if there is such write buffer to accelerate the process (or even reuse the ROP hierarchy), you should probably expect it to be tied with the memory hierarchy, instead of the control path.
 
When I talk about context switching I mean halting a wave (or wavegroup) that is in execution, bumping it off somewhere, running something else then resuming it later. I would not consider switching queues to be context switching, and it's rather confusing

I thought there was a buffer within each ACE that enabled it to store/load context for preempting wavegroups from a different grid in a different queue.
The implementation is obscured. But realistically ACEs would not have such buffer. For the amount of state (256KB+ 8KB per CU) and the sequentiality of the process, you might be better just write them directly out to the memory. Carrizo apparently does this.

Moreover, the ACEs should not have knowledge about wavefronts being preempted either. Not only because preemption is expected to be transparent to the application, but the ACEs and the CUs are living in two different realms of existence. It is kinda control path vs data path.
Context switch / pre-emption (actually writing all thread group VGPRs to memory and restoring them later) doesn't need to be on fast path. This is not an operation that is used often. Basically you only need it to ensure quality of service. A long running compute shader potentially blocks the whole GPU for a long time. A single thread group could have life time of multiple seconds (or even minutes) if very long loops are used. If the GPU doesn't support context switch / pre-emption, the OS has no other choice than kill + reboot the display driver (Windows does this if a shader takes longer than 2 seconds to run). With context switch / pre-emption, some groups can be periodically flushed to memory and restored, so that the display can update. GPU virtualization also needs similar functionality.

GPU context switch / pre-emption is basically a safety / QoS feature. It's should not be used in performance critical settings. Usually thread groups have significantly less than 1 ms running time. Ensuring that high priority tasks (thread groups) get scheduled before low priority thread groups is good enough. Doesn't waste memory bandwidth and gets the job done. I would also guess that waves on CU originated from high priority tasks have some extra bit set. The round robin scheduling can give these waves more cycles than others (multiple waves are ready to run -> always take a high priority one if possible).
 
AMD has made mention of one possible embodiment of preemption-capable graphics to involve some amount of dedicated logic in existing units or separate blocks that assist with preemption.
If done that way, it could make sense since it avoids scaling much of the investment in hardware for a preferably rare operation with either CU count or front-end size.
The command front end processors are not flush with execution resources, and their own timeliness is something the whole GPU depends on, so keeping the more variable latency and mostly irrelevant details of the switch out of that part of the pipeline could be helpful. Some context of when resources are available or if tasks need to be relaunched might need to filter back around, otherwise the upstream components would be unable to take advantage of the resources being freed up or make room for finishing what was interrupted.

As was noted, it's a QoS/system survivability measure rather than performance optimization, since there's now more context, hardware expenditure, execution time, and delay than in a system that purely ran uninterrupted from start to finish (at the cost of responsiveness and vulnerability to buggy/malicious code).
 
I wouldn't be surprised if the compilers may have already take trivial cases to the scalar path, say a constant pointer with a constant offset.
Scalar path yes, but it would be an interesting alternative for scalarization if the schedulers could determine a wave was significantly small or masked off. Start off with, or mask off, down to single digit threads it might be worthwhile to execute on the scalar unit. This would have to occur transparently to the compiler. For example the scheduler could take a vector ALU instruction and schedule it to the scalar which would know how to unroll it. This would require the scalar to appear as a SIMD in regards to scheduling. A higher clocked, largely asynchronous scalar, would definitely help there if it could skip masked lanes.

there really is no straight 1:1 relationship between ACE and HWS.
I think a lot of that 1:1 relationship comes from the hardware requirements. The hardware for 1 HWS could be 2 ACEs for example. Capabilities a completely different matter.

I thought there was a buffer within each ACE that enabled it to store/load context for preempting wavegroups from a different grid in a different queue.
Very small buffer. Most of the data they should be operating on will be pointers and various metrics for scheduling and progress related to the active grid. Possibly some queue depth for the next grid. The buffer size is likely measured in bytes, but nobody knows for certain what all is in there. A context switch would be possible, but the system would have to decide that 4 active dispatch queues were insufficient or needed preempted.
 
Back
Top