DX12 Performance Discussion And Analysis Thread

Why is the compute queue load so regular with unlocked frame-rate?
Why would it not be regular? Most "async" compute work is still submitted at fixed engine intervals these days. It were only if it were being used for streaming or something (more likely on the copy queue) that would expect it to possibly be less regular.
 
Oh, I did not think about fixed timestep. Well, this make sense, a lot of jobs results can be easily interpolated in the other frames. ...Now I remember, that ASOS - as examples - does a lot of work at a fixed timestep.

Does anyone know about some test involving copy queues? I bet they can express a lot with some more additional love...
 
Dynamic load balancing is a thing - yes, and it is a hardware feature. But it's nowhere the same, or even remotely comparable to GCN's async execution via the independent command lists dispatched by the ACE units.

Dynamic load balancing is only for efficiently switching between compute and graphic workloads inside a single command list, respectively for eliminating the need for a full command buffer flush every time the partition scheme changes.

Pascal fully supports Async Compute unlike Maxwell.

Can you further explain what the difference between AMD's and nVidia's way of Async Compute is?
 
Maxwell can do async compute, concurrency is its problem after the first initial SMX scheduling.

There is no difference is async compute on either card, the difference is concurrent kernel execution on the same SMX.
Actually, also on Pascal, scheduling is done on a per-SM-basis. So no mixing of workloads intra-SM.
 
Yeah, they can dynamically change the number of SMs assigned to compute and graphics.
I'm assuming dynamically means they use that preemption technique to suspend and flush everything, reallocate, then resume? Not sure how dynamic that would really be in practice or if suspended jobs can be assigned to different SMs.
 
They aren't using preemption, preemption for such a thing (even though pascal's preemption is faster (less latency) than Maxwell and finer grain) latency will be pretty bad because you will have to re-partitioning the SMX scheduling more often than not.
 
They aren't using preemption, preemption for such a thing (even though pascal's preemption is faster than Maxwell and finer grain) latency will be pretty bad because you will have to re-partitioning the SMX scheduling more often than not.
That's sort of what I'm getting at. The same capability that enables the preemption is likely what they would use to dynamically reallocate the SMs. Previously they'd have to flush the pipeline, now they can suspend and resume. Possibly reassign but I'd need to read up on the specifics a bit more. Point being I'm not sure just how "dynamic" the balancing would be. Definitely an improvement over maxwell, probably enough to keep performance acceptable or positive, but likely not on par with what GCN does or really able to benefit from concurrency. If TMUs were decoupled from SMs and able to be reassigned similar to how ROPs work they should see a benefit.
 
Actually, also on Pascal, scheduling is done on a per-SM-basis. So no mixing of workloads intra-SM.
An SM can concurrently compute multiple independent CUDA kernels, even on Kepler and Maxwell, as long as the resources (registers and shared memory) are available. But async compute with mixed graphics and compute tasks is what we're all really talking about and it's still unclear what the Pascal hardware can do, what the current drivers have exposed, and what they might be able to update in the future with software. And P100 may be different than GP104. One new feature is documented in the P100 whitepaper: P100 can assign priorities for preemptive compute and graphics tasks, evicting running kernels and restoring them after the high priority task is finished. It's unknown if this feature is in GP104. But again preemption support is a different feature than mixing graphics and compute kernels on one SM.

Preemption overhead is on the order of 10us, pretty similar to a kernel launch overhead. Likely slightly faster on P100 because of the higher bandwidth memory. So doing hundreds of premptions per second won't be a performance problem, but 10,000 per second would be. But again, that's not the same as async, which is not the quite the same as mixing compute and graphics on the same SM, and it may be different on P100 versus GP104. (It gets confusing to me since I'm a compute-only CUDA guy.)
 
It's unknown if this feature is in GP104. But again preemption support is a different feature than mixing graphics and compute kernels on one SM.

Preemption overhead is on the order of 10us, pretty similar to a kernel launch overhead. Likely slightly faster on P100 because of the higher bandwidth memory. So doing hundreds of premptions per second won't be a performance problem, but 10,000 per second would be. But again, that's not the same as async, which is not the quite the same as mixing compute and graphics on the same SM, and it may be different on P100 versus GP104. (It gets confusing to me since I'm a compute-only CUDA guy.)
I think we are seeing the same situation as with Kepler but the model numbers were slightly different in structure, meaning this time 1080 is a reduced Pascal and will need a Titan/ti model to have the functionality seen with P100.
Originally with Kepler all the 78x-Titan range also had the functionality of the Tesla model such as dynamic parallelism and Hyper-Q but lower models did not - was originally reported the 78x models did not have those functions enabled but later reports clarified they were.
So we will need to wait for the GP102, which IMO will be shared across Tesla/Quadro/consumer same way as GK110 was, while leaving the GP100 purely as a flagship FP64/mixed-compute accelerator.

While Nvidia may say it is not paying attention to AMD, if big Vega has good figures for FP32/FP16 then GP102 has to respond to this not just from a consumer perspective but also the business-DL/workstation side as well.
Which is why I see the GPU102 spanning all three business sectors of NVIDIA and more heavily focused on its FP32/FP16 Cuda cores (still with moderate DP).

Cheers
 
That's sort of what I'm getting at. The same capability that enables the preemption is likely what they would use to dynamically reallocate the SMs. Previously they'd have to flush the pipeline, now they can suspend and resume. Possibly reassign but I'd need to read up on the specifics a bit more. Point being I'm not sure just how "dynamic" the balancing would be. Definitely an improvement over maxwell, probably enough to keep performance acceptable or positive, but likely not on par with what GCN does or really able to benefit from concurrency. If TMUs were decoupled from SMs and able to be reassigned similar to how ROPs work they should see a benefit.

Pascal shouldnt be able to preempt itself. There doesnt exist any interruptions in GPUView.

When i look at the MDolenc Async Compute test Pascal behaves exactly like GCN with a direct and compute queue running concurrently.
 
Using preemption as basic async compute mechanism makes no sense. What's there to preempt if half of the GPU is sitting idle waiting for some work to be scheduled on if? That's yet another myth repeated over and over again by the usual ones that are desperate to prove Pascal doesn't support async compute. It does, deal with it.
 
Using preemption as basic async compute mechanism makes no sense. What's there to preempt if half of the GPU is sitting idle waiting for some work to be scheduled on if? That's yet another myth repeated over and over again by the usual ones that are desperate to prove Pascal doesn't support async compute. It does, deal with it.

I dont think the question in this forum is IF it support it, but HOW it is doing it....

Saw that it support it is good, knowing how they do it and what to expect about it is better..
 
Last edited:
I'd like to believe that the discussion moved from if to how..

Baah, with Pascal it is not much that in practice it can or not achieve it, but how it do it.. because it seems it can.. as for the benefit of it due to the way it achieve it, it is pushing some doubt on the method and so some doubt on the performance of it.

As Nvidia dont have much communicate on this at least on an technical aspect...
 
Last edited:
Back
Top