DX12 Performance Discussion And Analysis Thread

Hi, I've created an account just because ToTTenTranz asked for more Maxwell results with CPU usage.

Here's my perf log on a 980 TI. I got a TDR crash on the single commandlist but it shouldn't matter.

Also, here is overall CPU usage on the duration of the test on a i7 5960X:

View attachment 866 View attachment 867 View attachment 868


Interesting.

In your case, there doesn't seem to be any relevant CPU activity.. but, there's no "Async" happening at all either.
Your "Async" results are almost a carbon copy of what the "pure compute" results show if you just add the 16.15ms render time on top of them:

hQxdbfN.png




EDIT: results with TDR disabled are practically the same. As I said, disabling TDR is simply preventing the graphics driver from interrupting the program due to a timeout in the third test:

7OFEwqk.png
 
Last edited by a moderator:
Sorry to ask but did you restart your computer after the change?
Seems we have a divergence to that of PadyEOS who also has a 980TI
Cheers

Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.
 
So with nVidia Maxwell 2, specifically a GM200 Geforce 980 Ti we have 2 different results so far:

1 - Async exists somehow but it causes a huge load on the CPU -> PadyEos' result using 355.82

2 - Async doesn't exist at all -> Devnant's result using 355.82 too...


I wonder what's causing this "CPU-assisted Async" to kick-in.
If it was pure CPU performance, then it should be Devnant's machine the one to activate Async, since his CPU has twice the cores/threads.
Perhaps CPU frequency?
 
Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.
OK that possibly explains the divergence :)
Can you do the test with actually disabling it like PadyEOS did?
Cheers
 
970 on 355.82 OCed. Didn't get a TDR. Also attached is Afterburner log if anyone is interested.
 

Attachments

  • HardwareMonitoring.zip
    44.3 KB · Views: 6
  • perf.zip
    187.4 KB · Views: 4
970 on 355.82 OCed. Didn't get a TDR. Also attached is Afterburner log if anyone is interested.

Well I can't spend the whole afternoon drawing graphics in excel but looking at your results within 5 different points (0, 128, 256, 384, 512), async time = compute time+render so it doesn't look like your test isn't doing any async compute either.
 
OK that possibly explains the divergence :)
Can you do the test with actually disabling it like PadyEOS did?
Cheers

Just did, but that didn't change a thing. Same results. Seems like I don't get any async benefits, maybe because we are using different CPUs? I just don't know.
 
More Maxwell 2 results. It seems the 980 Ti that showed some asynch compute happening with weird cpu spikes is the odd man out.

POXJfpU.png

nHQyMcp.png
 
So with nVidia Maxwell 2, specifically a GM200 Geforce 980 Ti we have 2 different results so far:

1 - Async exists somehow but it causes a huge load on the CPU -> PadyEos' result using 355.82

2 - Async doesn't exist at all -> Devnant's result using 355.82 too...


I wonder what's causing this "CPU-assisted Async" to kick-in.
If it was pure CPU performance, then it should be Devnant's machine the one to activate Async, since his CPU has twice the cores/threads.
Perhaps CPU frequency?
Are you basing PadyEOS results from page 12?
He did the change and had a different behaviour afterwards.
Regardin Async behaviour, why are threads around 200 to 260 consistently showing Graphics+compute faster than them separate when looking at his results after the change?
Cheers
 
ToTTenTranz,
nothing is truly consistent so far as one has to remove certain results to get a conclusion one way or the other (yes/no support); and unfortunately as I mentioned this is compounded that none of this is being done on a controlled environment PC.
Cheers
 
Again: TDR won't change any outcome, it will just prevent the third test (which prevents Async from kicking in, so it wouldn't tell us if Async is working either way) from crashing.
Here's PadyEos' latest result anyway.

B4MUaBQ.png



CSI PC, no one is removing results. In fact I've been asking people to bring more results into the equation.
As far as consistency goes, except from PadyEos' results (which we're still trying to find a way to replicate), almost all Maxwell 2 results are pretty damn consistent: Async Compute isn't working in this test at all.
 
Question. Wouldnt Nvidia technically be performing Async computation/shaders exactly how MS documentation and even AMDs own slide by the results that we are seeing. I could have missed something more solid on the topic in the API reference, but this is what I was able to find.

Command queue overview

Direct3D 12 command queues replace hidden runtime and driver synchronization of immediate mode work submission with APIs for explicitly managing concurrency, parallelism and synchronization. Command queues provide the following improvements for developers:

  • Allows developers to avoid accidental inefficiencies caused by unexpected synchronization.
  • Allows developers to introduce synchronization at a higher level where the required synchronization can be determined more efficiently and accurately. This means the runtime and the graphics driver will spend less time reactively engineering parallelism.
  • Makes expensive operations more explicit.
These improvements enable or enhance the following scenarios:

  • Increased parallelism - Applications can use deeper queues for background workloads, such as video decoding, when they have separate queues for foreground work.
  • Asynchronous and low-priority GPU work - The command queue model enables concurrent execution of low-priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
  • High-priority compute work - This design enables scenarios that require interrupting 3D rendering to do a small amount of high-priority compute work so that the result can be obtained early for additional processing on the CPU.
Thats the command queue, which from the results were are seeing Nvidia run 31 in parallel, so it seems that Nvidia is meeting that requirement (or at least that description)

And then there is the command list

Executing command Lists
After you have recorded a command list and either retrieved the default command queue or created a new one, you execute command lists by calling ID3D12CommandQueue::ExecuteCommandLists.

Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.

The runtime will validate the submitted command list and will drop the call to ExecuteCommandLists if any of the restrictions are violated. Calls will be dropped for the following reasons:

Nvidia is executing command lists in serial as well (graphics + compute) per command queue. Unless im missing something totally obvious, it seems like Nvidia is handling this exactly how they should. I havent seen anywhere, where it says that command lists on the same queue are supposed to be executed asynchronously. Please, if I am missing something here, let me know.
 
ToTTenTranz,
look at around thread 200 to say 265 (happens in other places).
You agree it shows Compute+render time is consistently less than the individual runs of Compute and Async?
So how is this possible serially?
As you said earlier when you said it is not async compute capable it would need Compute+Render to be around the same time of both of those combined, but in places it is pretty clear the improvements are greater than marginal.
So this is what I mean about having to ignore aspects of variables to reach conclusions that are being said.
One cannot say for sure what is happening apart from something is not right and maybe it is driver/scheduler or indeed something more architectural, but no facts can be conclusively said because it needs further testing; in fact I see no-one as even raised the consideration of what changed between Star Swarm and Ashes, both from Nitrous engine perspective and NVIDIA drivers.
Caveat there being are the results from Star Swarm comparable to Ashes on low settings, but so much has not been clarified and tested, including all of these from clean-controlled PC environment where there are no dual drivers/"boost" software/etc,
And yes I agree something is strange when looking at the various results, I do think a couple of other measurement tests posted showed that sections of threads had consistent improvements for Compute+Render.
But the behaviour performance is far from consistent in terms of what the trend should be for either supporting Async or not on NVIDIA.

Cheers
 
ToTTenTranz,
look at around thread 200 to say 265 (happens in other places).
You agree it shows Compute+render time is consistently less than the individual runs of Compute and Async?
So how is this possible serially?

If you're referring to PadyEos' results, those are the same that show about 50% usage of a 8-thread CPU when the "Async" test starts.
 
again trying to draw a parallel from cpu usage to async happening is hard to do, unless you know exactly what the drivers, gpu and cpu, are doing at that point, what is the purpose of the cpu usage has not been quantified to any degree.

If we don't know and draw that parallel, it might be wrong and that is no good because it changes the way we think about the situation, in essence prejudices us and in turn force us to make incorrect assumptions.
 
Last edited:
Back
Top