DX12 Performance Discussion And Analysis Thread

Yes, there is only a single graphics queue per device context when going via the DX12 API. But even then, nothing stops you from acquiring a second device context, and may it just be by starting another 3D accelerated application in parallel. In which case the driver starts to interleave the graphics command queues.

There is still only a single command processor, fetching from the merged queue.
Yes, you can have a second application with another id3d12 device instance to the same adapter, but then you will get a driver serialization of the graphics command buffers.
 
Fable Legends Early Preview: DirectX 12 Benchmark Analysis
When we do a direct comparison for AMD’s Fury X and NVIDIA’s GTX 980 Ti in the render sub-category results for 4K using a Core i7, both AMD and NVIDIA have their strong points in this benchmark. NVIDIA favors illumination, compute shader work and GBuffer rendering where AMD favors post processing, transparency and dynamic lighting.


http://www.anandtech.com/show/9659/fable-legends-directx-12-benchmark-analysis
 
If we were to go by the ms timings themselves for Fury and the 980 Ti, they add up close enough to the average frame rate that there's not much sign of concurrent execution for either Nvidia or AMD, although how much of the various categories would be covered by asynchronous compute is unclear.

There were signs of poorer timing improvement with Fiji in the test from this thread as well, however.
Fiji's lead over its predecessor is rather muted.

edit:
There's no clear difference that I could tell for the other chips, with all the caveats of guessing performance via timings applying and with generous error bars.
 
Last edited:
Agreed, something is seems to be pulling Fiji back, mainly because of the short distance between it and Hawaii.
On the other hand, the 390 seems to be head and shoulders above the GXT 980 now.

Extremetech is getting different results from Anandtech regarding the comparison between Fury X and 980Ti. The only discernible difference I see is that they're using Haswell-E with faster DDR4 memory:

Vw3Z4SN.png

KdBmomM.png
 
fable-4k-timings.png


The dark green line shows async compute shaders at work. Maxwell is fairly efficient at doing them, better that GCN

Now if we look at Fiji and Hawaii, looks like Dynamic GI is what is holding back Fiji.
 
Really? After 37 pages in this thread and you think "Compute Shaders" is the same as "Async Compute"?
 
Last edited by a moderator:
please read what the green line is, its async compute for Fable legends foliage rendering and collision detection........

that green line is all compute shaders running asynchronously.
 
Within a strict set of limitations, as shown by Ext3h.

PS: can someone give him edit privileges?


Yes that is true if Maxwell is pushed in ways that goes beyond its queue limits, it will have a disastrous effect on performance, how much can that be mitigated by drivers, I don't know, but I wouldn't be surprised if it can be mitigated quite a bit.
 
Last edited:
please read what the green line is, its async compute for Fable legends foliage rendering and collision detection........

Where exactly do you see the "async" word in there?


that green line is all compute shaders running asynchronously.
We've had compute shaders since DX10. All you see is how long it's taking to run these tasks. This has nothing to do with the ability to run them asynchronously with the graphics pipeline.
 
dude read the article, its specific to async shaders on hardware that supports it......

You are being obtuse.

http://www.pcper.com/reviews/Graphi...-Benchmark-DX12-Performance-Testing-Continues

Fable Legends is a gorgeous looking game based on the benchmark we have here in-house thanks in some part the modifications that the Lionhead Studios team has made to the UE4 DX12 implementation. The game takes advantage of Asynchronous Compute Shaders, manual resource barrier tracking and explicit memory management to help achieve maximum performance across a wide range of CPU and GPU hardware.
Compute shader simulation and culling is the cost of our foliage physics sim, collision and also per-instance culling, all of which run on the GPU. Again, this work runs asynchronously on supporting hardware.
 
"It runs asynchronously on supporting hardware"!

The fact that they're measuring how fast a certain code is running (compute shaders) does not tell you that it's running asynchronously with the graphics pipeline. You could put a Kepler chip running that benchmark and it would still tell you how long the compute shaders took to run.

It's even possible that the compute shaders are running faster on nVidia hardware because they're not running asynchronously. If all of the GPU's compute resources need to be dedicated to compute tasks, there's a good chance they will take less time.
 
so you are saying nV hardware when doing this specific compute shader work have less latency when not running the compute shader work asynchronously than GCN architecture running the same compute shader work asynchronously?

That is highly unlikely, unless GCN hardware has a serious issue in the front end or drivers which are breaking down because of the way this compute shader work was written.
 
No, I'm saying:

If all of the GPU's compute resources need to be dedicated to compute tasks, there's a good chance they will take less time.

Being able to run compute shaders asynchronously with the rendering pipeline does not make the chip run compute shaders faster.
It's a method to keep all compute resources occupied, increasing the efficiency as a whole.
 
if they were working in serial on Maxwell architecture the resultant frame rates won't be there, the reduction in time to do the compute shaders added with the graphics work load show that both hardware is doing or not doing aysnc (both not one) Maxwell 2 and GCN.

28.806 ms total for 980 ti
31.05 ms for Fury x

38.12 for gtx 980
34.98 for r9 390x

This correlates with the frame rates we see as a end result.
 
so you are saying nV hardware when doing this specific compute shader work have less latency when not running the compute shader work asynchronously than GCN architecture running the same compute shader work asynchronously?

That is highly unlikely, unless GCN hardware has a serious issue in the front end or drivers which are breaking down because of the way this compute shader work was written.

Various iterations of the synthetic in this thread show that the straightline performance of the shaders and how optimal the code generation is can really skew things.
The original version had Nvidia's times coming in under AMD until the problem size was sufficiently high.
It took an additional optimization round to get a large speedup above how the code was generated by default.

Also, if the timings are derived from what the individual tasks are logging for timestamps, they would report the duration that they perceive. This would be different from any global time improvement.

That doesn't mean that there isn't something going on, just that the data is not clear-cut.
 
The dark green line shows async compute shaders at work. Maxwell is fairly efficient at doing them, better that GCN

Now if we look at Fiji and Hawaii, looks like Dynamic GI is what is holding back Fiji.

I just read compute shaders, not async compute+graphics. On async compute+graphics compute shaders should have a lower priority than graphics operations, they are meant to fully utilize the hardware resources, not to lower the execution time of compute-shaders themself but they will increase, this means the time sum of non async operations should be higher then sum async operations (and of course if you are going to see a single operations type only -ie graphics or compute - they would be executed in a lower time compared to the fully async works).
 
The dark green line shows async compute shaders at work. Maxwell is fairly efficient at doing them, better that GCN

Interesting since Nvidia has not yet released optimized drivers for this feature in DX12. Should be interesting to see the differences once they are released.
 
what I do find interesting is the time rates for these types of compute shaders should not increase much based on resolution, because resolution shouldn't effect culling (well not much) and physics. This shows up in nV hardware, but on AMD hardware resolution has a much more drastic effect on these compute shaders.

fable-1080p-timings.png
 
I just read compute shaders, not async compute+graphics. On async compute+graphics compute shaders should have a lower priority than graphics operations, they are meant to fully utilize the hardware resources, not to lower the execution time of compute-shaders themself but they will increase, this means the time sum of non async operations should be higher then sum async operations.

Perhaps this is the quote from the review ....
The game takes advantage of Asynchronous Compute Shaders, manual resource barrier tracking and explicit memory management to help achieve maximum performance across a wide range of CPU and GPU hardware.
http://www.pcper.com/reviews/Graphi...-Benchmark-DX12-Performance-Testing-Continues
 
Back
Top