DX12 Performance Discussion And Analysis Thread

This sounds like a big deal.
Yeah. I've been waiting for a tech journalist to investigate this for quite a while. I thought Scott Wasson would after all the hints in his TitanX review: http://techreport.com/review/27969/nvidia-geforce-gtx-titan-x-graphics-card-reviewed/3

the fact that we're exceeding the theoretical rate suggests perhaps the GPU clocks are ranging higher via GPU Boost. The "boost" clock on these GeForce cards, on which we've based the numbers in the table above, is more of a typical operating speed, not an absolute peak. Three, I really need to tile my bathroom floor, but we'll defer that for later.

But he never got around to it.
 
Asynchronous means not synchronized. If we say two things are asynchronous, we mean that they are decoupled. Asynchrony does not specify implementation - calls to an asynchronous API may very well be implemented sequentially.

The feature AMD calls asynchronous compute is actually about concurrency: that the hardware can execute graphics and compute at the same time. This concurrency is the source of performance gains.

You missunderstood, it's what I was saying. Asynchronicity is only about [out-of-]order, the scheduling. And that property is a requirement to auto-parallelize the program-stream in the most efficient manner, for example concurrent [program parallelism]. Think fiber to thread auto-parallelizer. An OS chunks threads into fibers and auto-parallelizes them over the available cores, still works with one core. This is only possible because programs handle their synchronization 100% themselfs, programs are by default asynchronous. Imagine programs would have to be executed in launch order. This was needed for GPU, it's a technical debt of the API towards programmers.
 
DirectX queue API (asynchronous compute) is conceptually similar to multithreading API on modern operating systems (windows/linux/etc). OS gives you an API to start multiple threads. You can start more threads than your CPU core count. OS will interleave threads if needed. Threads (generally) proceed asynchronously of each other. There is no guarantees of timing, and no guarantees that two threads run concurrently and/or run sequentially. If ordering guarantees are needed, the programmer will manually add synchronization primitives between threads/queues.

AMD GCN functions a bit like hyperthreading in Intel CPUs. It can execute multiple queues concurrently by sharing the compute units between them. If the compute units are not fully occupied from a single queue, having work available in multiple queues can be used to fill the compute units better. Similarly to hyperthreading, the gains are highly application specific (but usually in 10%-30% region). In some corner cases the performance can even be slightly lower (for example heavy cache trashing or starvation).

DirectX queue API doesn't give any guarantees about concurrent execution. This has been a common design in threading APIs on CPU side. Otherwise the developer would have to write special case handling for hardware that doesn't have enough hardware threads/queues.
 
Last edited:
DirectX queue API (asynchronous compute) is conceptually similar to multithreading API on modern operating systems (windows/linux/etc). OS gives you an API to start multiple threads. You can start more threads than your CPU core count. OS will interleave threads if needed. Threads (generally) proceed asynchronously of each other. There is no guarantees of timing, and no guarantees that two threads run concurrently and/or run sequentially. If ordering guarantees are needed, the programmer will manually add synchronization primitives between threads/queues.

AMD GCN functions a bit like hyperthreading in Intel CPUs. It can execute multiple queues concurrently by sharing the compute units between them. If the compute units are not fully occupied from a single queue, having work available in multiple queues can be used to fill the compute units better. Similarly to hyperthreading, the gains are highly application specific (but usually in 10%-30% region). In some corner cases the performance can even be slightly lower (for example heavy cache trashing or starvation).

DirectX queue API doesn't give any guarantees about concurrent execution. This has been a common design in threading APIs on CPU side. Otherwise the developer would have to write special case handling for hardware that doesn't have enough hardware threads/queues.

Exactly. The thing that differentiates AMD from the others is not asynchrony. AMDs special ability is actually concurrency: the ability to run graphics and compute at the same time. It's true that an asynchronous API is necessary for the hardware to have the option to either run calls sequentially or concurrently. But this asynchronous API is not sufficient to provide FPS increase. You can only get that if the workload happens to execute concurrently on the hardware.

As I said, it's very unfortunate that AMD has confused the entire internet about the word asynchronous. And it's unfortunate that journalists haven't pushed back with more correct terminology.
 
It's true that an asynchronous API is necessary for the hardware to have the option to either run calls sequentially or concurrently.
Not even that. Even a multi-threaded design can be synchronous. In fact, it becomes when you are not providing non-blocking semaphores and signals, but rely solely on the blocking equivalents. (E.g. if one thread sends a signal, it can not continue until every expected recipient has reached the point where the signal is consumed.)
 
DirectX 12 actually provides two mechanisms that allow (but neither force) concurrent execution by the GPU.

1. Compute queues offer an asynchronous API for multitasking. Discussed above.

2. Automatic resource barriers are gone (DX11 assumed worst case, resulting in unnecessary GPU stalls). GPU can concurrently execute commands (draws and dispatches) submitted into a single queue (all IHVs can do this, assuming that sequential commands are either graphics or compute, but not both interleaved). DirectX 12 mandates developers to explicitly define resource barriers. Barriers provide a mechanism to ensure correct ordering between dependent resource writes and reads. This is a way to limit concurrency and reordering.

In my opinion (2) is often better, since it allows developer more control over timing (what is running concurrently). However in some cases like rendering a shadow map (lots of tiny draw calls) it is better to execute the long lasting (concurrent) compute shader from a separate queue. This is also sometimes a better way to fill stalls (reductions with dependency chains). However most reductions (such as mip chain, occlusion pyramid, average luminance, etc generation) can be written as a single compute dispatch, reducing this problem.

There are also hacks available around the concurrent compute + graphics execution limitation. Nvidia's presentation for example recommends running simple compute work as vertex shader (with no position output, writing directly to UAV). This runs concurrently with graphics tasks (on Nvidia and Intel). However vertex shader doesn't have LDS and doesn't have thread synchronization primitives, limiting the usability to simple tasks.
 
Last edited:
I thought Scott Wasson would after all the hints in his TitanX review: http://techreport.com/review/27969/nvidia-geforce-gtx-titan-x-graphics-card-reviewed/3
But he never got around to it.
Yeah several others and I even offered Scott a nice little utility to demonstrate it in practice but he was (and I imagine still is) a busy guy :)

It's some very cool hardware in any case and definitely part of the big efficiency jump in Maxwell. I suspect it'll continue to evolve into a sort of immediate mode/TBDR hybrid and we'll meet the mobile architectures somewhere in the middle.
 
Last edited:
I think AMD has put some thought into an intermediate stage that bins geometry while receiving primitives externally as an immediate mode device. The later raster and back-end resources are generally already tiled or agnostic to the change.

That would seemingly provide a more significant amount of culling effectiveness, although I don't know if it would produce results like it did for the testing for Maxwell. Whether such a hybrid scheme would "just work" all the time seems doubtful if there are state changes or the binned geometry might do something to modify the depth data the culling process depends on.
 
Whether such a hybrid scheme would "just work" all the time seems doubtful if there are state changes or the binned geometry might do something to modify the depth data the culling process depends on.
Maxwell does not to my knowledge try and "run ahead" and pre-cull/Z-test any geometry based on *later* occlusion like TBDRs do. It simply bins up non-trivial amounts of geometry and then rasterizes/runs them in a different order in screen space.

I imagine in the future they will start to expand into shading just position and running ahead with Z tests/occlusion, but for now it's not terribly crazy and shouldn't really have any failure cases (although it can of course be forced to fall back to more typical IMR under certain conditions). And certainly AMD might have been considering something even more complicated from the start, but I think NVIDIA's staging the transition between several architectural iterations is smart here.
 
Last edited:
You've got it all wrong. That's not how async works infact it's the complete opposite of that - it's about load balancing not reservation. We've evolved to Unified Shader Architecture for a reason.
According to Nvidia's Pascal presentation material Maxwell used static reservation.

Yeah several others and I even offered Scott a nice little utility to demonstrate it in practice but he was (and I imagine still is) a busy guy :)

It's some very cool hardware in any case and definitely part of the big efficiency jump in Maxwell. I suspect it'll continue to evolve into a sort of immediate mode/TBDR hybrid and we'll meet the mobile architectures somewhere in the middle.
Have you tried enabling performance counters in your test? One of my co-workers had this disable binning in his test.
 
Have you tried enabling performance counters in your test? One of my co-workers had this disable binning in his test.
Haven't played with that, no. That's actually pretty funny... I can totally see NVIDIA being like "omg we need to keep it a sekrit so turn it off if they look too closely!!" :p Although I suspect the real reason is that the perf counters are incorrect with it on or something but that's less amusing.
 
They'll also not do concurrent compute dispatches in a graphics queue if you're looking too closely (have timestamps between dispatches).

By the way: has Oxide ever clarified their scenario where Maxwell just sux when using "async compute" to the point that it's better to turn it of? Or has anyone else seen this happen and in what circumstances?
 
AFAIR, in early testings, when AotS was still beta stage at most, Nvidia cards would show a negative performance impact when switching from DX11 to DX12. This was interpreted by most as a sucktion of AC there, since the Radeon showed massive gains with this switch. Anandtech being one of the few exception to this rule, doing rather thorough analysis right from the start:
http://www.anandtech.com/show/10067/ashes-of-the-singularity-revisited-beta/6
The confusion might have come from the shifted focus to AC in later builds:
http://www.anandtech.com/show/9740/directx-12-geforce-plus-radeon-mgpu-preview/4
And it was here, contrary to the first, Multi-Adapter centered test, where the Fury X actually was faster than the 980 Ti, while in the Multi-Adapter test,a single 980 Ti beat the Fury X. Maybe people interpreted stuff into this data.
 
I'm wondering about the comments from Oxide itself. That they had to disable AC on GeForce cards. Which you would not need to do if there just wasn't any performance gain. So under what use case is going AC off that much better on a hardware that "does not do AC"?
 
When the scheduler does not cause additional stall :p

But I can tell you, it was told to the developer world, that Maxwell supported compute and graphics in concurrency, while Kepler (and Fermi, sigh!) should not add any sensible overhead due the driver serialization... But looks like this ancient story was just... a fantasy story!
 
It can the problem, is the static partitioning, so for one workload it might be ok, how about all the other workloads after, if they don't align well with the first prediction, its going to have problems............
 
I'm not that interested in theory. I want to test it. So how do I do that? That means one of the two scenarios:
a) Static partitioning so say render a shadow map (graphics) and do something with compute at the same time => observe speedup.
b) Submit some draw calls to graphics queue, submit some compute to compute queue => observe some massive drop of performance with regard to doing this in one queue.
Because I think Ryan mentioned during the Pascal launch event that a) was never enabled, at least not with the public drivers. Though if anyone Oxide would probably get some special builds where this would be turned on. And Oxide keeps complaining about b) to the point that they had supposedly disabled async compute path on GeForce. So how can one test out of some of this BS?
 
a) Static partitioning so say render a shadow map (graphics) and do something with compute at the same time => observe speedup.
b) Submit some draw calls to graphics queue, submit some compute to compute queue => observe some massive drop of performance with regard to doing this in one queue.
a) was enabled, b) wasn't.

Place your draw and dispatch calls into the same queue, and they go into different pipelines, subject to static partitioning, with all the related downsides such as a screwed up estimation.
Still works though (like, at all), as long as Nvidia devs are adding a profile to the driver, adjusting the partitioning scheme for you. Without support by NV engineers, the driver fucks up.

b) still doesn't work. At least not when monitoring GPU activity. There might be cases in which the driver attempts to re-assemble command buffers from multiple queues into a single one (once again only if an NV engineer hacks the the driver specifically for your game), but apart from that? No chance.

And no, you can't test that properly. As I said, the driver side heuristics are a complete failure, and unless someone from NV patches in a profile for your benchmark, you are not even going to be able to reproduce a).
 
Back
Top