DX12 Performance Discussion And Analysis Thread

CSI PC · Sep 3, 2015

Devnant said:
Here is with TDR disabled, no crashes. 980 TI driver 355.82

Sorry to ask but did you restart your computer after the change?
Seems we have a divergence to that of PadyEOS who also has a 980TI
Cheers

Deleted member 13524 · Sep 3, 2015

Devnant said:
Hi, I've created an account just because ToTTenTranz asked for more Maxwell results with CPU usage.

Here's my perf log on a 980 TI. I got a TDR crash on the single commandlist but it shouldn't matter.

Also, here is overall CPU usage on the duration of the test on a i7 5960X:

View attachment 866 View attachment 867 View attachment 868

Interesting.

In your case, there doesn't seem to be any relevant CPU activity.. but, there's no "Async" happening at all either.
Your "Async" results are almost a carbon copy of what the "pure compute" results show if you just add the 16.15ms render time on top of them:

EDIT: results with TDR disabled are practically the same. As I said, disabling TDR is simply preventing the graphics driver from interrupting the program due to a timeout in the third test:

Rurouni · Sep 3, 2015

What Nvidia driver version did he use?

Devnant · Sep 3, 2015

CSI PC said:
Sorry to ask but did you restart your computer after the change?
Seems we have a divergence to that of PadyEOS who also has a 980TI
Cheers

Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.

Devnant · Sep 3, 2015

Rurouni said:
What Nvidia driver version did he use?

355.82

Deleted member 13524 · Sep 3, 2015

So with nVidia Maxwell 2, specifically a GM200 Geforce 980 Ti we have 2 different results so far:

1 - Async exists somehow but it causes a huge load on the CPU -> PadyEos' result using 355.82

2 - Async doesn't exist at all -> Devnant's result using 355.82 too...

I wonder what's causing this "CPU-assisted Async" to kick-in.
If it was pure CPU performance, then it should be Devnant's machine the one to activate Async, since his CPU has twice the cores/threads.
Perhaps CPU frequency?

Razor1 · Sep 3, 2015

too early to determine that is even the case. Going that road about cpu frequency probably not.

CSI PC · Sep 3, 2015

Devnant said:
Yes, but to be perfectly honest I didn't completely disable TDR. I just increased the delay to 10 seconds so the test wouldn't crash.

OK that possibly explains the divergence

Can you do the test with actually disabling it like PadyEOS did?
Cheers

trandoanhung1991 · Sep 3, 2015

970 on 355.82 OCed. Didn't get a TDR. Also attached is Afterburner log if anyone is interested.

Deleted member 13524 · Sep 3, 2015

trandoanhung1991 said:
970 on 355.82 OCed. Didn't get a TDR. Also attached is Afterburner log if anyone is interested.

Well I can't spend the whole afternoon drawing graphics in excel but looking at your results within 5 different points (0, 128, 256, 384, 512), async time = compute time+render so it doesn't look like your test isn't doing any async compute either.

Devnant · Sep 3, 2015

CSI PC said:
OK that possibly explains the divergence
Can you do the test with actually disabling it like PadyEOS did?
Cheers

Just did, but that didn't change a thing. Same results. Seems like I don't get any async benefits, maybe because we are using different CPUs? I just don't know.

CSI PC · Sep 3, 2015

Thanks Devnant,
at least you were both using the same drivers.
Cheers

ka_rf · Sep 3, 2015

More Maxwell 2 results. It seems the 980 Ti that showed some asynch compute happening with weird cpu spikes is the odd man out.

CSI PC · Sep 3, 2015

ToTTenTranz said:
So with nVidia Maxwell 2, specifically a GM200 Geforce 980 Ti we have 2 different results so far:

1 - Async exists somehow but it causes a huge load on the CPU -> PadyEos' result using 355.82

2 - Async doesn't exist at all -> Devnant's result using 355.82 too...

I wonder what's causing this "CPU-assisted Async" to kick-in.
If it was pure CPU performance, then it should be Devnant's machine the one to activate Async, since his CPU has twice the cores/threads.
Perhaps CPU frequency?

Are you basing PadyEOS results from page 12?
He did the change and had a different behaviour afterwards.
Regardin Async behaviour, why are threads around 200 to 260 consistently showing Graphics+compute faster than them separate when looking at his results after the change?
Cheers

CSI PC · Sep 3, 2015

ToTTenTranz,
nothing is truly consistent so far as one has to remove certain results to get a conclusion one way or the other (yes/no support); and unfortunately as I mentioned this is compounded that none of this is being done on a controlled environment PC.
Cheers

Deleted member 13524 · Sep 3, 2015

Again: TDR won't change any outcome, it will just prevent the third test (which prevents Async from kicking in, so it wouldn't tell us if Async is working either way) from crashing.
Here's PadyEos' latest result anyway.

CSI PC, no one is removing results. In fact I've been asking people to bring more results into the equation.
As far as consistency goes, except from PadyEos' results (which we're still trying to find a way to replicate), almost all Maxwell 2 results are pretty damn consistent: Async Compute isn't working in this test at all.

comprodigy · Sep 3, 2015

Question. Wouldnt Nvidia technically be performing Async computation/shaders exactly how MS documentation and even AMDs own slide by the results that we are seeing. I could have missed something more solid on the topic in the API reference, but this is what I was able to find.

Command queue overview
Direct3D 12 command queues replace hidden runtime and driver synchronization of immediate mode work submission with APIs for explicitly managing concurrency, parallelism and synchronization. Command queues provide the following improvements for developers:

Allows developers to avoid accidental inefficiencies caused by unexpected synchronization.
Allows developers to introduce synchronization at a higher level where the required synchronization can be determined more efficiently and accurately. This means the runtime and the graphics driver will spend less time reactively engineering parallelism.
Makes expensive operations more explicit.

These improvements enable or enhance the following scenarios:

Increased parallelism - Applications can use deeper queues for background workloads, such as video decoding, when they have separate queues for foreground work.
Asynchronous and low-priority GPU work - The command queue model enables concurrent execution of low-priority GPU work and atomic operations that enable one GPU thread to consume the results of another unsynchronized thread without blocking.
High-priority compute work - This design enables scenarios that require interrupting 3D rendering to do a small amount of high-priority compute work so that the result can be obtained early for additional processing on the CPU.

Thats the command queue, which from the results were are seeing Nvidia run 31 in parallel, so it seems that Nvidia is meeting that requirement (or at least that description)

And then there is the command list

Executing command Lists
After you have recorded a command list and either retrieved the default command queue or created a new one, you execute command lists by calling ID3D12CommandQueue::ExecuteCommandLists.

Applications can submit command lists to any command queue from multiple threads. The runtime will perform the work of serializing these requests in the order of submission.

The runtime will validate the submitted command list and will drop the call to ExecuteCommandLists if any of the restrictions are violated. Calls will be dropped for the following reasons:

The supplied command list is a bundle, not a direct command list.
ID3D12GraphicsCommandList::Close has not been called on the supplied command list to complete recording.
ID3D12CommandAllocator::Reset has been called on the command allocator associated with the command list since it was recorded. For more information about command allocators, see Creating and recording command lists and bundles.
The command queue fence indicates that a previous execution of the command list has not yet completed. Command queue fences are discussed in detail below.
The before and after states of queries, set with calls to ID3D12GraphicsCommandList::BeginQuery andID3D12GraphicsCommandList::EndQuery, are not matched properly.
The before and after states of resource transition barriers are not matched properly. For more information, see Using resource barriers to synchronize resource states.

Nvidia is executing command lists in serial as well (graphics + compute) per command queue. Unless im missing something totally obvious, it seems like Nvidia is handling this exactly how they should. I havent seen anywhere, where it says that command lists on the same queue are supposed to be executed asynchronously. Please, if I am missing something here, let me know.

CSI PC · Sep 3, 2015

ToTTenTranz,
look at around thread 200 to say 265 (happens in other places).
You agree it shows Compute+render time is consistently less than the individual runs of Compute and Async?
So how is this possible serially?
As you said earlier when you said it is not async compute capable it would need Compute+Render to be around the same time of both of those combined, but in places it is pretty clear the improvements are greater than marginal.
So this is what I mean about having to ignore aspects of variables to reach conclusions that are being said.
One cannot say for sure what is happening apart from something is not right and maybe it is driver/scheduler or indeed something more architectural, but no facts can be conclusively said because it needs further testing; in fact I see no-one as even raised the consideration of what changed between Star Swarm and Ashes, both from Nitrous engine perspective and NVIDIA drivers.
Caveat there being are the results from Star Swarm comparable to Ashes on low settings, but so much has not been clarified and tested, including all of these from clean-controlled PC environment where there are no dual drivers/"boost" software/etc,
And yes I agree something is strange when looking at the various results, I do think a couple of other measurement tests posted showed that sections of threads had consistent improvements for Compute+Render.
But the behaviour performance is far from consistent in terms of what the trend should be for either supporting Async or not on NVIDIA.

Cheers

Deleted member 13524 · Sep 3, 2015

CSI PC said:
ToTTenTranz,
look at around thread 200 to say 265 (happens in other places).
You agree it shows Compute+render time is consistently less than the individual runs of Compute and Async?
So how is this possible serially?

If you're referring to PadyEos' results, those are the same that show about 50% usage of a 8-thread CPU when the "Async" test starts.

Razor1 · Sep 3, 2015

again trying to draw a parallel from cpu usage to async happening is hard to do, unless you know exactly what the drivers, gpu and cpu, are doing at that point, what is the purpose of the cpu usage has not been quantified to any degree.

If we don't know and draw that parallel, it might be wrong and that is no good because it changes the way we think about the situation, in essence prejudices us and in turn force us to make incorrect assumptions.

DX12 Performance Discussion And Analysis Thread

CSI PC

Deleted member 13524

Guest

Rurouni

Devnant

Devnant

Deleted member 13524

Guest

Razor1

CSI PC

trandoanhung1991

Attachments

Deleted member 13524

Guest

Devnant

CSI PC

ka_rf

CSI PC

CSI PC

Deleted member 13524

Guest

comprodigy

CSI PC

Deleted member 13524

Guest

Razor1

Similar threads