Ext3h
Regular
The ACEs are not, but the HWS are. The (original) ACEs used to be hardwired to fetch and decode up to 8 PM4 command streams each. (The predecessor of the ACEs used in the console APUs only have access to one stream each.)ACEs are microcode programmable right? AMD has preprogrammed scheduling modes/routines, so I'm assuming the context switching is a sort of last recourse in general cases, otherwise used when preempting for time critical kernels?
The fetch, decoding and queue scheduling is now offloaded to the dual-threaded HWS processor, which is programmable. The 4 remaining ACEs themselves now only handle the actual dispatch (single internal queue per ACE, and a dispatch rate of two threadgroups per cycle), while the HWS handles fetch, decode and scheduling. How the HWS distributes the workload to the ACEs is also subject to the firmware.
Depending on the generation of the HWS, the size of the microcode varies. I think the APUs had a rather large one to begin with, but the first generation on dedicated GPUs was lacking in these terms. This results in different capabilities to handle either more different buffer formats (it's no longer limited to PM4), or to execute more complex scheduling algorithms.
So prior to the introduction of the HWS, you actually needed at least 8 command queues to achieve full dispatch rate by utilizing all 8 ACEs - since the introduction of the HWS you no longer do. The 4 ACEs together with the dual-threaded HWS processor can reach full dispatch rate even with only a single command queue.
How the ACEs themselve dispatch the wavefronts to the CUs is also configurable by software. AFAIK, it supports two different modes: Either schedule round-robin to all vacant CUs first, or fill each single CU to the top before passing on. I don't know which mode is used per default on which platform, or which APIs even expose this toggle.
Last edited: