The test case is whether graphics processing time can overlay with compute, is there an expectation that a 256 thread group would change the verdict for the GPUs running it?
A single lane compute work needs the scheduler to spawn a single wave on all GPUs. On AMD the wave is 64 wide, meaning that the architecture is designed to run/manage less waves (as each do more work). If you spawn single lane work, you will more likely end up under-utilizing GCN compared to other GPUs. Work group sizes can also expose (academic) bottlenecks, since the resource (gpr, lds) acquisition/release is done at work group granularity.
I am just saying that using a workload that is not realistic can cause various bottlenecks that will not matter in a realistic scenario. Some people seem to be drawing all kinds of conclusions based on the results of this thread. Even though the results might not mean anything for most applications (especially games - the main purpose for DX12 API).
There was puzzlement when it came to why the latency was that disparate for GCN for the lowest cases, which is largely explained by the omission of the 4-cycle wavefront.
This is a downside of the GCN architecture, but it almost never matters in real software.
I am curious about the exact placement of the inflection points for the timings, since they don't necessarily line up with some of the most obvious resource limits.
PC DirectX 12 abstracts the barriers quite a bit, so we don't know whether we are timing end-of-pipe or something else. This matters quite a bit, since the GPUs have super long pipelines.
I believe SALU and VALU ops have to come from different hardware threads, so this specific kernel couldn't be sped up that way.
A single CU can run multiple kernels, and since the test is using async compute, we can assume that multiple queues produce work to the same CU. This means that it can interleave SALU from the other kernel and VALU from the other, making both process twice the rate.
I strongly disagree as I have some code that runs fastest with 64, but I come from an OpenCL perspective (can't get more than 256 work items into a work-group, apart from anything else
)...
Of course there are kernels that are fastest with thread blocks of 64 or 1024 threads. But these are fast because of some other bottlenecks, you most likely are trading some GPU utilization for other improvements (like reduced memory traffic). Also if the GCN OpenCL compiler is very smart, it could compile thread groups of 64 threads differently (the scalar unit could be exploited more).
Fillrate tests aren't meaningful work either. This test does reveal serial versus async behaviours, so it's a success on those terms.
It measures many things, not just async compute. This makes the results hard to understand, and people are making wrong conclusions.
It appears that even a single queue, the command list test, results in async compute on GCN.
All modern GPUs are capable of running multiple graphics tasks and multiple compute tasks in parallel, when these tasks originate from the same queue. This has been true for long time already. However DirectX 11 API and DX 11 drivers are quite defensive in their resource tracking, meaning that concurrent execution for compute doesn't happen often. Concurrent execution for multiple graphics draw calls however happens regularly (unless the graphics shaders use UAVs). How many graphics draws are executed simultaneously depends on fixed function resource limitations (only a limited amount of global state combinations can execute concurrently).
what forced async, there is no forced async, the single command list is not forced async
Yes, single command list by definition is not async. Shaders still can run concurrently even from a single command list (but not asynchronously). DirectX 12 exposes resource barriers to the programmer, giving the programmer more control over concurrent execution from a single command queue. Manual resource barriers allow greater and more controlled parallelism from a single queue, and this also is supported by more GPU vendors. If you don't necessarily need async, this is a good way to maximize the GPU utilization.
This is latency not end performance.
Exactly!
This is
not a performance (maximum throughput) benchmark. However it seems that less technically inclined people believe it is, because this thread is called
"DX12 performance thread". This thread does't in any way imply that "asynchronous compute is broken in Maxwell 2", or that "Fiji (Fury X) is super slow compared to NVIDIA in DX12 compute". This benchmark is not directly relevant for DirectX 12 games. As some wise guy said in SIGGRAPH: graphics rendering is the killer-app for compute shaders. DX12 async compute will be mainly used by graphics rendering, and for this use case the CPU->GPU->CPU latency has zero relevance. All that matters is the total throughput with realistic shaders. Like hyperthreading, async compute throughput gains are highly dependent on the shaders you use. Test shaders that are not ALU / TMU / BW bound are not a good way to measure the performance (
yes I know, this is not even supposed to be a performance benchmark, but it seems that some people think it is).
This benchmark has relevance for mixed tightly interleaved CPU<->GPU workloads. However it is important to realize that the current benchmark does not just measure async compute, it measures the whole GPU pipeline latency. The GPUs are good at hiding this latency internally, but are not designed to hide it to external observers (such as the CPU).