DX12 Performance Discussion And Analysis Thread

Am I reading this wrong, because it looks like a lot of Maxwell cards save a lot of time using async? And the Furys spend a lot of time saving virtually nothing? And then the 280X saves more time than anything?
That doesn't seem to be the case at all.
Would you mind linking which chart(s) in particular you're looking at?

For maxwell cards (here or here or here) the orange dots are mostly on 0 with a few dots managing to jump up, seemingly at fixed intervals. Never anywhere close to blue or red.

For GCN cards (here or here or here) the orange dots are mostly on either blue or red (whichever is lower).
They're on red at the beginning where red is lower than blue, and stays on blue afterwards when red has increased above blue.

Indeed, Fury X (here or here) is looking very weird. The orange dots either stay on zero or jump all the way up to blue (or red, whichever's lower), but never anywhere inbetween.
Another run using the older version of the test shows the orange dots being either 0, 25%, 50%, 75%, or 100% of the blue dots.
 
Ok, so digging a little deeper, I wasn't sure what to make of the DWM thing. I did scroll earlier back and notice that when the "compute" task isn't running, the DWM usage is this tiny little sliver. So maybe when the compute task is active it just gets dragged out over time and looks like a much heavier operation than it is. Anyway, still not sure what the deal is with that. But I wanted to know what was going on with that compute queue in general, so I tried a few games out.

I tried AC Unity (DX11), no activity in that queue except DWM. Downloaded the elemental demo with dx11/dx12 - I've no way to confirm whether or not its running in dx12, or if it even supports async compute yet...but either way, nothing in the compute queue. Fired up Killing Floor 2 with PhysX flex.....finally the compute queue lit up like a Christmas tree with something other than DWM.exe. Very heavy usage. Tried another PhysX game, batman arkham origins....again, lots of compute activity coming from it.

So PhysX is clearly able to access that compute queue. And seeing that on the fury when the two operations were in the two separate queues there was some clean asynchronous behavior, I presume the work being sent to the two queues is a prerequisite for the async compute to function. But right now only PhysX is using it, at least from what I've tested. I can't imagine that the compute queue is *only* available for PhysX, it doesn't really make sense that DX12 won't be able to access it. I've no idea what PhysX is doing to get into the compute queue that this asynccompute test isn't, but IMO that's the main thing that needs to be identified to get to the bottom of this. Then when we've got a graphics task and a compute task that's actually in the compute queue in DX12 on maxwell, maybe we'll be able to see if Maxwell is able to do async compute on some level.
 
Ok, so I also tested out some random openCL demo and the openCL benchmark luxmark...both showed compute queue usage. Especially luxmark, which was really heavy. So it's not just limited to PhysX or CUDA...if OpenCL can access it, anything should be able to. So what is this asynccomputetest doing or not doing that the compute work isn't getting sent to the compute queue?
 
Both the TitanX and the Fury X are doing hardly any async, but the Fury X goes fully async (hiding all of one operation with the other) when it is, and the TitanX only partially (hiding about 1/3rd of an op).

I'd be willing to run WPA with my R9-270x, if that would help, but I don't know how. :/
 
Ok, so digging a little deeper, I wasn't sure what to make of the DWM thing.

The DWM involvement in this on Maxwell 2 would certainly explain why the TDR time-out and the TDR itself are triggered showing the constant 100-0-100 GPU usage switches and eventually crashing the driver.
All the Maxwell 2 logs I looked at have this almost exact sequence of 2 lines in the logs:
208. 922.87ms
209. 1883.43ms

At that point of the huge jump, nearly doubling the time required, is when the 100-0-100 switching starts. I suspect that's when the TDR timeout starts kicking in, at ~900ms.
An adding the 2000ms time of the time-out it makes sense that the TDR would occur at ~ 900+2000=2900ms. Which it does for everyone.
 
Fired up Killing Floor 2 with PhysX flex.....finally the compute queue lit up like a Christmas tree with something other than DWM.exe. Very heavy usage. Tried another PhysX game, batman arkham origins....again, lots of compute activity coming from it.

VERY interesting! Could you please send screenshots of GPUView for this cases?
 
VERY interesting! Could you please send screenshots of GPUView for this cases?
Running a benchmark with PhysX gives me this.
aa533a6ade.png
 
Ok, so to summarize my thoughts:

In the asynccompute test, I believe the crashing behavior is just a driver bug or something it was never designed to handle. It's seeing a heavy graphics workload (made worse by the fact that the compute is also in the graphics queue) - but there's not any graphics activity on the screen. And since it's running in a window, it has to compete with running the graphics for the windows desktop. In that single command list test you can see the time spent processing the DWM command gets longer and longer as it goes on, and there's a corresponding increase CPU usage in csrss.exe - both things that are tiny sliver when theyre not run alongside the benchmark get stretched out to extraordinary lengths. It's almost as if the driver isn't able to properly preempt the benchmark to run the DWM, and it's just burning away CPU cycles and it switches between the DWM and an ever increasing test load. To me it doesn't look like this is really revealing anything about whether or not Maxwell supports async compute, either its just a bug or a normal reaction to an abnormal load, one that GCN happens to handle more gracefully.

The primary thing that GPUview revealed is that GCN considers the compute portion of the test as compute, while Maxwell still considers it graphics. This either tells us that A) Maxwell has a smaller or different set of operations that it considers compute, B) Maxwell needs to be addressed a certain/different way to consider it compute or C) it's another corner case or driver bug. And it's possible that whatever was happening in the ashes benchmark that was causing the performance issues is the same thing that's happening here. But we've got enough examples of stuff actually using the compute queue from CUDA to OpenCL, so it's absolutely functional.

So first we need to find some way to send in DX12 a concurrent graphics workload alongside a compute workload that Maxwell recognizes as compute, and see if there's any async behavior in that context. Unless the async test can be modified to do this, I think it's utility has run it's course and it's revealed a lot along the way.

And then figure out why it's not being used in the current version of the test. And if it's not being used for a legitimate reason rather than a bug or programming error - that this is one of the many things that GCN can consider compute that Maxwell can't. I can certainly believe GCN is more capable in this regard, but I still find it very difficult to believe that NVIDIA outright lied about Maxwell 2 and it's async capabilities.
 
Ok, so to summarize my thoughts:

In the asynccompute test, I believe the crashing behavior is just a driver bug or something it was never designed to handle. It's seeing a heavy graphics workload (made worse by the fact that the compute is also in the graphics queue) - but there's not any graphics activity on the screen. And since it's running in a window, it has to compete with running the graphics for the windows desktop. In that single command list test you can see the time spent processing the DWM command gets longer and longer as it goes on, and there's a corresponding increase CPU usage in csrss.exe - both things that are tiny sliver when theyre not run alongside the benchmark get stretched out to extraordinary lengths. It's almost as if the driver isn't able to properly preempt the benchmark to run the DWM, and it's just burning away CPU cycles and it switches between the DWM and an ever increasing test load. To me it doesn't look like this is really revealing anything about whether or not Maxwell supports async compute, either its just a bug or a normal reaction to an abnormal load, one that GCN happens to handle more gracefully.

The primary thing that GPUview revealed is that GCN considers the compute portion of the test as compute, while Maxwell still considers it graphics. This either tells us that A) Maxwell has a smaller or different set of operations that it considers compute, B) Maxwell needs to be addressed a certain/different way to consider it compute or C) it's another corner case or driver bug. And it's possible that whatever was happening in the ashes benchmark that was causing the performance issues is the same thing that's happening here. But we've got enough examples of stuff actually using the compute queue from CUDA to OpenCL, so it's absolutely functional.

So first we need to find some way to send in DX12 a concurrent graphics workload alongside a compute workload that Maxwell recognizes as compute, and see if there's any async behavior in that context. Unless the async test can be modified to do this, I think it's utility has run it's course and it's revealed a lot along the way.

And then figure out why it's not being used in the current version of the test. And if it's not being used for a legitimate reason rather than a bug or programming error - that this is one of the many things that GCN can consider compute that Maxwell can't. I can certainly believe GCN is more capable in this regard, but I still find it very difficult to believe that NVIDIA outright lied about Maxwell 2 and it's async capabilities.

Crashing behaviour is due to Timeout Detection and Recovery. It crashes because it takes a long time, more than 2000 ms, to process a queue during the single commandlist run. Windows specifies the number of seconds that the GPU can delay the preempt request from the GPU scheduler. This is effectively the timeout threshold. The default value is 2 seconds, or 2000 ms.
I was able to complete the single commandlist without a crash by editing the registry and increasing the default delay from 2 to 10 seconds, or by completely turing TDR off.
 
Right, but it shouldn't have gotten into that precarious state to begin with. I've never seen a game crash DWM like that, so I don't know what it's revealing of other than if you throw a bizarre workload at a GPU, bizarre things may happen. It's great at least GCN can walk out unscathed but there could also be an alternative situation where GCN would trip up and Maxwell would pass with flying colors.
 
Perhaps if the workload were ramped higher, the GCN solution could hit the same threshold, or at least some of them could.

This is an artificial case, but the problem with long-running compute causing the OS to freak out is an observed problem, especially since the "fix" is mentioned as part of the process of debugging CUDA kernels. If it does not affect the end product, it does impact the development.
In the consoles, there was a brief mention in a slide from Sucker Punch about long-running kernels remaining problematic, even with the GCN GPU there.

There may be a class of shader implementations out there for long-running compute that are being decided against because the operating system does not or cannot trust the hardware. AMD's hardware has done more to add context switching and preemption in the latest IP levels, which may provide a way in the future to give the OS the assurance that the GPU is frozen while the GPU is allowed to run programs that are not subject to an arbitrary rule that they always complete in 3 seconds.
 
Perhaps if the workload were ramped higher, the GCN solution could hit the same threshold, or at least some of them could.

This is an artificial case, but the problem with long-running compute causing the OS to freak out is an observed problem, especially since the "fix" is mentioned as part of the process of debugging CUDA kernels. If it does not affect the end product, it does impact the development.
In the consoles, there was a brief mention in a slide from Sucker Punch about long-running kernels remaining problematic, even with the GCN GPU there.

There may be a class of shader implementations out there for long-running compute that are being decided against because the operating system does not or cannot trust the hardware. AMD's hardware has done more to add context switching and preemption in the latest IP levels, which may provide a way in the future to give the OS the assurance that the GPU is frozen while the GPU is allowed to run programs that are not subject to an arbitrary rule that they always complete in 3 seconds.

I'm curious what relevance a 3 second task has to a game that generally needs to get things done in a max of 30ms though? I can understand it being an issue outside of gaming, but isn't this test kind of extreme in this context?
 
can you run the same test on older nvidia cards? specifically on cards that we know that they cant do any asynchronous process
 
The primary thing that GPUview revealed is that GCN considers the compute portion of the test as compute, while Maxwell still considers it graphics. This either tells us that A) Maxwell has a smaller or different set of operations that it considers compute, B) Maxwell needs to be addressed a certain/different way to consider it compute or C) it's another corner case or driver bug. And it's possible that whatever was happening in the ashes benchmark that was causing the performance issues is the same thing that's happening here. But we've got enough examples of stuff actually using the compute queue from CUDA to OpenCL, so it's absolutely functional.
How does this mesh with NVIDIA's commentry that Async Timewarp works via preemtption at the draw call boundries only? i.e. slide 23 notes:
Since we’re relying on preemption, let’s talk about how preemption actually works on current GPUs. Fermi, Kepler, and Maxwell GPUs — basically GeForce GTX 500 series and forward — all manage multiple contexts by time-slicing, with draw-call preemption. This means the GPU can only switch contexts at draw call boundaries! Even with the high-priority context, it’s possible for that high-priority work to get stuck behind a long-running draw call on a normal context. If you have a single draw call that takes 5 ms, then asynctimewarp can get stuck behind it and be delayed for 5 ms, which will mean it misses vsync, causing stuttering or tearing.

On future GPUs, we’re working to enable finer-grained preemption, but that’s still a long way off.
 
Back
Top