Ext3h
Regular
Shame on me. I messed up with the data on GCN 1.2.
It is not the HWS which is capable of doing balancing. It's the Graphics Command Processor which got doubled in width on GCN 1.2. That thing is able to hold 128 grids in flight, as opposed to 64 in 1.1. The ACE/HWS still only manages 64 grids each, no more. That's why GCN 1.2 was much faster in "single commandlist" mode (="Everything in the 3D queue") than it was in pure compute or split mode.
That means I also misinterpreted the data on Maxwell. It actually has 32 shader slots which are also all active while in compute mode. But it has only a single(!) compute shader slot while in graphics mode while the other slots are reserved (hard wired) for other shader types, which is why it failed so badly. And yes, it does need to switch in between graphics and compute queue mode, it can't do it in parallel. This is unrelated to the Hyper-Q feature, which is operating unrelated to these regular 32 slots, which is why dmw.exe and alike can cut ahead.
There is no parallel DX12 compatible compute and 3D paths in hardware. Only one 3D queue, which can switch between compute and graphics mode.
I failed at interpreting "single commandlist" correctly, and never gave it a second thought.
It is not the HWS which is capable of doing balancing. It's the Graphics Command Processor which got doubled in width on GCN 1.2. That thing is able to hold 128 grids in flight, as opposed to 64 in 1.1. The ACE/HWS still only manages 64 grids each, no more. That's why GCN 1.2 was much faster in "single commandlist" mode (="Everything in the 3D queue") than it was in pure compute or split mode.
That means I also misinterpreted the data on Maxwell. It actually has 32 shader slots which are also all active while in compute mode. But it has only a single(!) compute shader slot while in graphics mode while the other slots are reserved (hard wired) for other shader types, which is why it failed so badly. And yes, it does need to switch in between graphics and compute queue mode, it can't do it in parallel. This is unrelated to the Hyper-Q feature, which is operating unrelated to these regular 32 slots, which is why dmw.exe and alike can cut ahead.
There is no parallel DX12 compatible compute and 3D paths in hardware. Only one 3D queue, which can switch between compute and graphics mode.
I failed at interpreting "single commandlist" correctly, and never gave it a second thought.