DX12 Performance Discussion And Analysis Thread

Doesn't work for me--it sees it and tries to connect but times-out. I allowed it through the firewall, so I don't know why...
 
10 Wavefront by CU on GCN1.o, but they can initiate the work on a different CU ( crossbar ). This said, the scalar engine will surely got a role in it, in GCN1.2, the scalar can be used for use the ALU of another CU for balance the load.. remember that GCN like to be overloaded, not the invert.
Yes, GCN requires at least 4 wavefronts per CU to become fully utilized, up to 10 are possible, each wavefront executed in lockstep.

But what I'm actually wondering about, is how they got scheduled in the first place. With only 2 single queue ACEs, and the work load consisting of single threads only, I wouldn't have expected them to form a full wavefront of 64 threads in the first place.

And then again:
If it was capable of forming a full wavefront, why did it stop at a single one? It should at least have launched another one per CU, even if the shader program did break the register limit per CU (which I don't think is what happened). Given the specs and the fact that it obviously DID assemble full wave fronts (otherwise it wouldn't have been able to execute 64 threads in parallel), even Cap Verde shouldn't have topped out at less than 2560-6400 threads per batch - and not at precisely 64.
 
Yes, GCN requires at least 4 wavefronts per CU to become fully utilized, up to 10 are possible, each wavefront executed in lockstep.
FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.

But what I'm actually wondering about, is how they got scheduled in the first place. With only 2 single queue ACEs, and the work load consisting of single threads only, I wouldn't have expected them to form a full wavefront of 64 threads in the first place.
Queues are taking kernel dispatches (as a single command) of a multi-dimensional grid, submitted by CPU or other agents. So generally the GPU hardware scheduler is dividing a large dispatch into workgroups, and then dividing the workgroups into wavefronts of 64 lanes.

Say if you have an OpenCL filter working on an 1080p image, you submit a kernel dispatch with a dimension of 1920*1080 to the GPU queue, and then signals the ACE about the enqueuing. The ACE will then take the packet and transform it into wavefronts internally, and in this case it would be 32400 wavefronts. Since the GPU obviously has no such concurrency available, the wavefront scheduler would generate new wavefront as long as there is an available slot, or otherwise wait for one.
 
I wasn't trying to suggest that 64 kernel launches are actually assembled into a single work-group to run as a single hardware thread (though I did theorise that this is possible).

I think you guys misunderstand the nature of the ACEs. Each of them manages a queue. I would expect an ACE to be able to launch multiple kernels in parallel, which appears to be what we're seeing in each line where the timings in square brackets feature many results which are identical, e.g.:

Code:
64. 5.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39]
65. 10.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37]
66. 10.98ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37 10.78]

There can be only 10 kernels per SIMD, or 40 per CU, so multiple CUs are required to be running the set of kernels simultaneously in order to deliver these timings.

At 1GHz the theoretical time to run this kernel is 4.33ms (128 instructions per loop = 512 cycles; + 16 cycles to branch = 528 cycles; * 8192 iterations of the unrolled loop = 4,325,376 cycles; at 1GHz = 4.33ms). The timings observed are slightly slower, e.g. 4.33ms for 1025MHz HD7950, or 4ms for 1100MHz Fury X, which is 98.3% of the theoretical speed.

So there is some overhead in assembling and launching kernels and it seems as if the overhead is shared by either approximately 64 (compute or async) or ~128 (single command list on newer GCN) hardware threads.

I guess that launching the same number of hardware threads (e.g. 64) from a single kernel, which is in the normal style of enqueuing work to a GPU would result in a lower overhead.
 
FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.

*snip*
Oops. I missed the earlier discussions. Sorry about that.

IMO there is a maximum number of kernels that each compute pipeline scheduler that can handle. On top of that, it seems each shader engine (group of CUs) gets its own wavefront scheduler, while ACEs could dispatch workgroups to any of them. That means there could be interlocks and queues in between that might cause incrementing latencies at high loads.
 
r9-270x before and after unrolled. Mine is the only 270x so far I think, so if you don't feel like graphing the one before unrolling that's fine. I did both of these today, though, so there shouldn't be anything different other than the hlsl source.
Your results are like the other AMD version 6 GPUs (original GCN), and fairly erratic timings. I expect your card will behave like Benny-ua's card.

Still can't upload attachments--does it require flash, or is it a firefox thing?
I haven't done an upload, but Firefox allows me to press the Upload a File button when I'm in the "More Options" view for writing a reply.
 
FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.
Thanks, got the per CU / per SIMD limits mixed up.

IMO there is a maximum number of kernels that each compute pipeline scheduler that can handle.
If that was the case, we should see different level of parallelism depending on the number of CUs per card, but we actually don't. We don't even see any difference between GCN 1.0 and 1.2 for true async workload, and that's really odd.

Also, an ACE shouldn't be able to schedule more than the top of each queue at a time. But perhaps it is capable of joining equal kernels into a single wavefront?

Is it possible, that even on GCN 1.1 and 1.2 cards, only a SINGLE ACE and a SINGLE queue is being used so far? That would for sure explain why it looks like as if there was only a single wavefront active.

That would mean this benchmark isn't even utilizing 1/64th of the scheduling capabilities of GCN 1.2....
 
We need a version with more queues. 64 queues should be perfectly fine.

I'm not sure about what that would to Maxwell v2 either. It is possible that the Nvidia driver already pushed each async task into a different hardware queue, and achieved parallelism this way (with only a single thread per wavefront). But it's also possible that it did the same as GCN, and actually dispatched a full wavefront from only a single queue. Which means the platform is also actually underutilized by a factor of 32x.
 
Looks like Nvidia pretty much distributed the asynchronous dispatching at multi-processor level in similar manner as tessellation. :???:
I'm dubious that the intra-SMM issue/scheduling of hardware threads is relevant here as I reckon that's handling work that's already been distributed internally by the hardware to the SMM.
 
Request for a favour from the AMD testers: I would like to see what happens when the code runs faster with an unrolled loop.

I find this thread interesting so I tested my hd7970.(15.8 driver)
 

Attachments

  • 7970 perf unrolled.zip
    45.8 KB · Views: 7
well we can throw out preemption and context switching out the window, both those are used together and they will actually increase latency not reduce, so what AMD's Hallock has eluded to is nV is using this method when he mentioned context switching, which it isn't right, we can see that with GPUveiw and the data from small program, that would be easy to spot, the latency doesn't have enough spikes for that, it wouldn't be a step like plot. It would be more erratic almost like a EKG added to the step like plot.
 
well we can throw out preemption and context switching out the window, both those are used together and they will actually increase latency not reduce, so what AMD's Hallock has eluded to is nV is using this method when he mentioned context switching, which it isn't right, we can see that with GPUveiw and the data from small program, that would be easy to spot, the latency doesn't have enough spikes for that, it wouldn't be a step like plot. It would be more erratic almost like a EKG added to the step like plot.
Uhm, I think we saw that context switching actually IS involved, even though not in the way we expected it. The context switch didn't happen when switching between async compute and graphics shaders of the same program as originally expected, but it still did happen when switching between pure compute context and graphics context, as we could observe with the starving DWM process. It actually looked like the platform wasn't even capable of preemption in that case.

We could also observe that Nvidia apparently isn't capable (yet?) of performing shaders in a sequential order in any efficient manner, while AMD apparently could execute them in parallel to some extent by ensuring an ordered memory view - either in hard or in software, we don't know that yet.

Even though we still don't know all about it yet.
For instance, we don't know yet if the Nvidia platform is grouping async shaders into full wavefronts, we don't know how software queues relate to hardware queues, we don't know how deep the queues can be in hardware, we don't know whether possible grouping into wavefronts is performed in hardware or software, we don't know why there is a limit to grouping.

Only thing we know for sure so far, is that on GCN 1.0, shaders are definitely being grouped, both in "async" and "sequential" execution mode.


The next step would be to test again with multiple queues, to see for sure whether the async shaders were being put in the same or in different hardware queues actually.

My guess is, so far it was only a single hardware queue for both vendors, and we actually have a queue depth of 32 tasks for Nvidia and up to ~128 for GCN 1.2. Still unsure why only a single wavefront was dispatched on GCN 1.2 in actual "async" mode.

If my assumptions are correct, this still means that GCN is a lot more powerful, where Nvidia could schedule at most 992 (31*32) "draw calls" (well, "async compute tasks" actually), AMD would go up to 8192 (64*128) draw calls being executed in parallel.
 
The summery is no one really know yet lol, but it seems like Maxwell 2 can do Async, and we don't really know the limitations right now, if and when the drivers are ready, that will give us a better idea.
 
Could I have someone give me an ELI5 ("explain like I'm 5") summary of this thread so far? I've really gotten confused.
Original topic:
What performance gains are to be expected fro the use of async shaders and other DX12 features. Async shaders are basically additional "tasks" you can launch on your GPU, to increase the utilization and thereby efficiency as well as performance even further. Unlike in the classic model, these tasks are not performed one after another, but truly in parallel.

IT happened:
One prominent developer studio, currently working on an DX12 title, tried to make use of async compute shaders for the first time. It went horribly wrong on Nvidia GPUs while it achieved astonishing gains on AMD GPUs.

The question: What went wrong?
One guy tried to construct a minimal testsuite, trying to replicate the precise circumstances under which the the Nvidia GPUs failed. Many assumptions were made, but only few held true.

There were claims originally, that Nvidia GPUs wouldn't even be able to execute async compute shaders in an async fashion at all, this myth was quickly debunked.

What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.

Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.
 
Original topic:
What performance gains are to be expected fro the use of async shaders and other DX12 features. Async shaders are basically additional "tasks" you can launch on your GPU, to increase the utilization and thereby efficiency as well as performance even further. Unlike in the classic model, these tasks are not performed one after another, but truly in parallel.

IT happened:
One prominent developer studio, currently working on an DX12 title, tried to make use of async compute shaders for the first time. It went horribly wrong on Nvidia GPUs while it achieved astonishing gains on AMD GPUs.

The question: What went wrong?
One guy tried to construct a minimal testsuite, trying to replicate the precise circumstances under which the the Nvidia GPUs failed. Many assumptions were made, but only few held true.

There were claims originally, that Nvidia GPUs wouldn't even be able to execute async compute shaders in an async fashion at all, this myth was quickly debunked.

What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.

Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.
What about NVIDIA apparently using PreEmption and not truly parallel execution and having to wait for draw call to finish to switch context? At least that's the case for "asynchronous timewarp", which is just another case of asynchronous shaders / compute? (it's in their VR documents, linked here too, I'm sure)
 
Preemption is being used in VR, but that's because VR needs preemption, I don't think async needs this at all, as preemption is to completely shift kernels, kernels with aynsc should be doing them concurrently right? I'm thinking that you shouldn't need to shut down one kernel and go to another........not sure though.
 
What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.

Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.

I thought there were anomalies with Fiji GPUs as well?
Seems these have a subtly different architecture to the other models; would need to check but I thought in Ashes the 290x outperforms it as well.
Cheers
 
Back
Top