Yes, GCN requires at least 4 wavefronts per CU to become fully utilized, up to 10 are possible, each wavefront executed in lockstep.10 Wavefront by CU on GCN1.o, but they can initiate the work on a different CU ( crossbar ). This said, the scalar engine will surely got a role in it, in GCN1.2, the scalar can be used for use the ALU of another CU for balance the load.. remember that GCN like to be overloaded, not the invert.
FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.Yes, GCN requires at least 4 wavefronts per CU to become fully utilized, up to 10 are possible, each wavefront executed in lockstep.
Queues are taking kernel dispatches (as a single command) of a multi-dimensional grid, submitted by CPU or other agents. So generally the GPU hardware scheduler is dividing a large dispatch into workgroups, and then dividing the workgroups into wavefronts of 64 lanes.But what I'm actually wondering about, is how they got scheduled in the first place. With only 2 single queue ACEs, and the work load consisting of single threads only, I wouldn't have expected them to form a full wavefront of 64 threads in the first place.
64. 5.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39]
65. 10.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37]
66. 10.98ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37 10.78]
Oops. I missed the earlier discussions. Sorry about that.FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.
*snip*
Your results are like the other AMD version 6 GPUs (original GCN), and fairly erratic timings. I expect your card will behave like Benny-ua's card.
I haven't done an upload, but Firefox allows me to press the Upload a File button when I'm in the "More Options" view for writing a reply.Still can't upload attachments--does it require flash, or is it a firefox thing?
Thanks, got the per CU / per SIMD limits mixed up.FYI, it is up to 32/40 wavefronts per CU in fact, depending on required register allocation by the kernel.
If that was the case, we should see different level of parallelism depending on the number of CUs per card, but we actually don't. We don't even see any difference between GCN 1.0 and 1.2 for true async workload, and that's really odd.IMO there is a maximum number of kernels that each compute pipeline scheduler that can handle.
I'm dubious that the intra-SMM issue/scheduling of hardware threads is relevant here as I reckon that's handling work that's already been distributed internally by the hardware to the SMM.Looks like Nvidia pretty much distributed the asynchronous dispatching at multi-processor level in similar manner as tessellation.
Request for a favour from the AMD testers: I would like to see what happens when the code runs faster with an unrolled loop.
Uhm, I think we saw that context switching actually IS involved, even though not in the way we expected it. The context switch didn't happen when switching between async compute and graphics shaders of the same program as originally expected, but it still did happen when switching between pure compute context and graphics context, as we could observe with the starving DWM process. It actually looked like the platform wasn't even capable of preemption in that case.well we can throw out preemption and context switching out the window, both those are used together and they will actually increase latency not reduce, so what AMD's Hallock has eluded to is nV is using this method when he mentioned context switching, which it isn't right, we can see that with GPUveiw and the data from small program, that would be easy to spot, the latency doesn't have enough spikes for that, it wouldn't be a step like plot. It would be more erratic almost like a EKG added to the step like plot.
I'm hoping this silence is precursor to the Eli5's that are being made ready as I type these.Could I have someone give me an ELI5 ("explain like I'm 5") summary of this thread so far? I've really gotten confused.
Original topic:Could I have someone give me an ELI5 ("explain like I'm 5") summary of this thread so far? I've really gotten confused.
What about NVIDIA apparently using PreEmption and not truly parallel execution and having to wait for draw call to finish to switch context? At least that's the case for "asynchronous timewarp", which is just another case of asynchronous shaders / compute? (it's in their VR documents, linked here too, I'm sure)Original topic:
What performance gains are to be expected fro the use of async shaders and other DX12 features. Async shaders are basically additional "tasks" you can launch on your GPU, to increase the utilization and thereby efficiency as well as performance even further. Unlike in the classic model, these tasks are not performed one after another, but truly in parallel.
IT happened:
One prominent developer studio, currently working on an DX12 title, tried to make use of async compute shaders for the first time. It went horribly wrong on Nvidia GPUs while it achieved astonishing gains on AMD GPUs.
The question: What went wrong?
One guy tried to construct a minimal testsuite, trying to replicate the precise circumstances under which the the Nvidia GPUs failed. Many assumptions were made, but only few held true.
There were claims originally, that Nvidia GPUs wouldn't even be able to execute async compute shaders in an async fashion at all, this myth was quickly debunked.
What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.
Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.
What become clear, however, is that Nvidia GPUs preferred a much lighter load than AMD cards. At small loads, Nvidia GPUs would run circles around AMD cards. At high load, well, quite the opposite, up to the point where Nvidia GPUs took such a long time to process the workload that they triggered safeguards in Windows. Which caused Windows to pull the trigger and kill the driver, assuming that it got stuck.
Final result (for now): AMD GPUs are capable of handling a much higher load. About 10x times what Nvidia GPUs can handle. But they also need also about 4x the pressure applied before they get to play out there capabilities.