DX12 Performance Discussion And Analysis Thread

I'd like to believe that the discussion moved from if to how..
By a stretched definition even Maxwell supports async. Interleave compute and graphics and it's asynchronous. Results are ugly, but it's async.

Using preemption as basic async compute mechanism makes no sense. What's there to preempt if half of the GPU is sitting idle waiting for some work to be scheduled on if? That's yet another myth repeated over and over again by the usual ones that are desperate to prove Pascal doesn't support async compute. It does, deal with it.
So you're saying a single SM can execute compute and graphics concurrently? Because all the evidence we've seen so far suggests otherwise. Schedule a single long running compute job to each SM and what happens when you attempt to run graphics? The current design is segmenting the GPU into two sections, compute and graphics, and pushing tasks to their respective areas. When those ratios change, things get interesting. It's far more useful to know how a feature works and the limitations than just assuming checkbox feature. As implemented, it should work well for a feature like ATW where it mostly occurs once a frame. Get a physics engine dispatching jobs at a different framerate, asynchronously, and you'll likely have problems.
 
By a stretched definition even Maxwell supports async. Interleave compute and graphics and it's asynchronous. Results are ugly, but it's async.
And yet I still haven't been able to see these ugly results, or how to actually get to them.

Schedule a single long running compute job to each SM and what happens when you attempt to run graphics?
That's been done and it works fine (concurrently). It also works fine if you schedule more then one long running compute job to each SM.
And even GCN is not that all powerful. It's actually quite easy to kick it in a way that you won't see any performance benefit of concurrent execution.
 
So you're saying a single SM can execute compute and graphics concurrently? Because all the evidence we've seen so far suggests otherwise.
I am not saying that because we have no evidence.

I don't know how they schedule work but I do know you don't need preemption and you don't need an SM capable of running compute and graphics tasks at the same time in order to co-schedule graphics and compute work on a many-SM GPU.
 
That's been done and it works fine (concurrently). It also works fine if you schedule more then one long running compute job to each SM.
And even GCN is not that all powerful. It's actually quite easy to kick it in a way that you won't see any performance benefit of concurrent execution.
A quick question to MDolenc about compute queue latency on PC. People are mostly talking about concurrency (to increase GPU utilization) but high priority compute queues also have other use cases.

I am interested in this use case:
- The main rendering queue is filled with rendering tasks (buffering up to 2 frames). Might be either vsynch locked or not (depending on settings). Basic game rendering (PS/VS) + some compute. Nothing special.
- There's a separate high priority compute queue (D3D12_COMMAND_QUEUE_PRIORITY_HIGH) for game logic tasks. It is driven from a separate CPU thread than the rendering and there is no frame lock between logic and rendering.
- Each game logic compute task is cheap (less than 1 millisecond).

I am interested to know whether this use case is practical on various PC hardware (NV/AMD/Intel). What kind of latency should I expect to see (dispatch + readback data to CPU) based on the results you have seen? Do the high priority compute queues actually work (in reducing CPU->GPU->CPU latency when the main render queue is filled with work)?
 
That's been done and it works fine (concurrently). It also works fine if you schedule more then one long running compute job to each SM.
And even GCN is not that all powerful. It's actually quite easy to kick it in a way that you won't see any performance benefit of concurrent execution.
Concurrently on a insta-SM level? I'm really curious to actually know. Because my understanding was they were varying ratios of Graphics:Compute assignments each frame. A SM still needs reconfigured to my understanding, which would mean it has to flush and possibly affect other SMs when the switch occurs?

I don't know how they schedule work but I do know you don't need preemption and you don't need an SM capable of running compute and graphics tasks at the same time in order to co-schedule graphics and compute work on a many-SM GPU.
This wouldn't be preemption, but based on the ability that enables it. Coscheduling on a many SM device isn't the issue here. It's the hardware being able to be reconfigured when the workload changes.

sebbbi's use case is the one I'm interested in as well. But I should add, some of the compute tasks will be arriving in irregular, unpredictable bursts. Say evaluating bullet collision when the action starts coupled with an async timewarp. Quite a few other possibilities as well. So how quickly could all of the SMs evaluating graphics switch to compute and return results?
 
Concurrently on a insta-SM level? I'm really curious to actually know. Because my understanding was they were varying ratios of Graphics:Compute assignments each frame. A SM still needs reconfigured to my understanding, which would mean it has to flush and possibly affect other SMs when the switch occurs?

"Intra-SM" level scheduling doesnt make sense on nVidia. With Maxwell every SM is occupied by either graphics or compute workload. Every SM is binded to a geometry pipeline and four pixel pipelines. A full GM200 chip has 24 SM, 96 ROPs and 6 GPCs which results in 96 pixel per clock. The chip isnt able to schedule additional workload without reconfiguration of the partition.

nVidia has changed the SM per GPC ratio with Pascal. Instead of four SMs Pascal has 5 SMs for every GPC. While a GPC can still out 16 pixel per clock only four of the five SMs are used for graphics operations.
There is an up to 25% compute overhead on Pascal for Async Compute. Pascal has an uneven ratio between graphics and compute. With the dynamic load balancing the chip is able to redirect compute workload from 4SMs to all SMs after the graphics workload has been finished.
 
"With Maxwell every SM is occupied by either graphics or compute workload.
It's far more limited than that. Maxwell (and earlier generations) switch between graphics and compute loads on an even coarser whole-device basis, not on an SM or even GPC basis. This is evidenced by the watchdog timeout kernel killer for CUDA. Start a 1 block CUDA kernel that takes multiple seconds to run, and your display will freeze during execution dispite the fact that all but one SMs have no allocated task. And after 3-5 seconds the kernel will be killed to allow display updates. If there is no display attached to the GPU, there is no timeout killer.
 
A quick question to MDolenc about compute queue latency on PC. People are mostly talking about concurrency (to increase GPU utilization) but high priority compute queues also have other use cases.

I am interested in this use case:
- The main rendering queue is filled with rendering tasks (buffering up to 2 frames). Might be either vsynch locked or not (depending on settings). Basic game rendering (PS/VS) + some compute. Nothing special.
- There's a separate high priority compute queue (D3D12_COMMAND_QUEUE_PRIORITY_HIGH) for game logic tasks. It is driven from a separate CPU thread than the rendering and there is no frame lock between logic and rendering.
- Each game logic compute task is cheap (less than 1 millisecond).

I am interested to know whether this use case is practical on various PC hardware (NV/AMD/Intel). What kind of latency should I expect to see (dispatch + readback data to CPU) based on the results you have seen? Do the high priority compute queues actually work (in reducing CPU->GPU->CPU latency when the main render queue is filled with work)?
I checked this today real quick. Added a new case to the sample, so scenario is a bit different:
- main queue renders to offscreen target (no buffering of frames, trivial VS/PS) - 128 draws.
- there's a high priority queue that executes a compute kernel after 10ms delay.
Seems to work on GCN only. That is on 380X graphics finishes in 70ms and compute in 1.5ms. Reaction time (from issue to completion signal on high priority queue - average kernel runtime) seems to be in 0.2-0.5ms-ish range. Checked on Maxwell and HD 4600 and in both cases high priority queue only kicked in after graphics queue was done.
 
Seems to work on GCN only. That is on 380X graphics finishes in 70ms and compute in 1.5ms. Reaction time (from issue to completion signal on high priority queue - average kernel runtime) seems to be in 0.2-0.5ms-ish range. Checked on Maxwell and HD 4600 and in both cases high priority queue only kicked in after graphics queue was done.
Big thanks for doing the test! Great results for GCN (as expected from a console dev). I am currently evaluating my options about implementing some gameplay GPGPU code on PC. I hoped that DX12 / Vulkan high priority compute queues would make this practical.

I hoped that the Nvidia GPUs would stop the rendering when (either): A) command processor fetches the next draw call (or packet of draws), B) next "wait for idle" event occurs (for example switching from RT/UAV write to read). There shouldn't be a good reason NOT to switch to high priority compute after "wait for idle" (as the GPU is already fully idle)... except that Fermi/Kepler (Maxwell?) incur additional penalty of switching the GPU mode between graphics<->compute. I guess it needs to do flush all the caches in addition to "wait for idle", as it re-purposes parts of the cache as scratchpad for graphics stuff. As far as I understood, Intel also re-purposes parts of their L3 cache in graphics/compute modes (LDS is reserved from L3 cache).

Could you retest with the graphics queue workload consisting solely of compute kernels? Does Kepler perform the high priority compute queue task immediately if the render queue also happens to running a compute kernel at the same time? My rendering code is mostly compute shaders. If the high priority queue works in this case, then it is good enough for my purposes.

Also, I would be interested to know whether the concurrent execution of compute queues is working on Nvidia (Kepler, Maxwell) and Intel when the graphics queue is only running compute shaders. As far as I know, there shouldn't be any technical issues blocking this use case. But the drivers might not take advantage of this yet (as it is a rather uncommon use case). Most developers are still rendering triangles after all...
 
Last edited:
Big thanks for doing the test! Great results for GCN (as expected from a console dev). I am currently evaluating my options about implementing some gameplay GPGPU code on PC. I hoped that DX12 / Vulkan high priority compute queues would make this practical.

I hoped that the Nvidia GPUs would stop the rendering when (either): A) command processor fetches the next draw call (or packet of draws), B) next "wait for idle" event occurs (for example switching from RT/UAV write to read). There shouldn't be a good reason NOT to switch to high priority compute after "wait for idle" (as the GPU is already fully idle)... except that Fermi/Kepler (Maxwell?) incur additional penalty of switching the GPU mode between graphics<->compute. I guess it needs to do flush all the caches in addition to "wait for idle", as it re-purposes parts of the cache as scratchpad for graphics stuff. As far as I understood, Intel also re-purposes parts of their L3 cache in graphics/compute modes (LDS is reserved from L3 cache).

Could you retest with the graphics queue workload consisting solely of compute kernels? Does Kepler perform the high priority compute queue task immediately if the render queue also happens to running a compute kernel at the same time? My rendering code is mostly compute shaders. If the high priority queue works in this case, then it is good enough for my purposes.

Also, I would be interested to know whether the concurrent execution of compute queues is working on Nvidia (Kepler, Maxwell) and Intel when the graphics queue is only running compute shaders. As far as I know, there shouldn't be any technical issues blocking this use case. But the drivers might not take advantage of this yet (as it is a rather uncommon use case). Most developers are still rendering triangles after all...
I'll have to revise my statement about Maxwell. It does interrupt graphics queue with a high priority queue, but it's a bit more jumpy. First there's the need to finish current draw call (and in my test they are quite long). Second there's some fighting with DWM I guess. Simply having Firefox with youtube open completely messed up my test (though I imagine full screen exclusive will make this a non issue).

With regard to compute: I have checked on Kepler with two normal queues and compute only kernels. They can freely overtake one another so I don't think there would be any weirdness with high priority queues.

Off to DX12 Performance Discussion And Analysis Thread. :)
 
Updated AsyncCompute test. Changes:
- shorter compute shader
- thanks to Jawed (loop within loop) compute performance should be more comparable between GeForce and Radeon
- latency test where high priority compute kernel jumps in the middle of graphics queue

P.S.: Removed CUDA dependency.
 

Attachments

  • AsyncCompute.zip
    18.6 KB · Views: 30
Last edited:
Me, too. But google helped to fix it.

However the programm is crashing within the compute test at kernel 115 on Pascal...

/edit: After ten tries or so i was able to finish the run with a GTX1080 @ 1924Mhz.
 

Attachments

  • perf_gtx1080.txt
    11.2 KB · Views: 76
Last edited:
I'll have to revise my statement about Maxwell. It does interrupt graphics queue with a high priority queue, but it's a bit more jumpy. First there's the need to finish current draw call (and in my test they are quite long). Second there's some fighting with DWM I guess. Simply having Firefox with youtube open completely messed up my test (though I imagine full screen exclusive will make this a non issue).

With regard to compute: I have checked on Kepler with two normal queues and compute only kernels. They can freely overtake one another so I don't think there would be any weirdness with high priority queues.

Off to DX12 Performance Discussion And Analysis Thread. :)
Good to know. So graphics + high prio compute queue is working fine on Maxwell (as long as you use compute queue to reduce GPGPU latency). Someone needs to test this on Intel (Haswell, Broadwell, Skylake). If Kepler is the only problem case (Fermi never likely gets DX12 drivers), it might be OK to just accept an extra frame of latency for those players.

MODS: Please move our posts to DX12 performance analysis thread. This discussion is mostly OT in this thread.
 
In case anybody find this of interest, ran it on a GT2 Haswell (4210U).
 

Attachments

  • perf_4210U_HSWGT2.txt
    1.6 KB · Views: 29
Here are the results for a Titan X card (Maxwell Gen2) if someone are interested...
Driver 368.39
 

Attachments

  • perf_titanx.txt
    11.1 KB · Views: 50
980Ti @ 1400MHz, 368.39 driver.

9GM5NwW.png
 

Attachments

  • 980ti_1400mhz.txt
    11.1 KB · Views: 10
Back
Top