DX12 Performance Discussion And Analysis Thread

yes it does, it seems drivers have to be improved for Fiji too, its not limited to Maxwell 2. With the 290x series, Mantle was doing fine I think that driver experience translated over to Dx12, and with Mantle we can see Fiji doesn't perform as well as expected there too.
 
What about NVIDIA apparently using PreEmption and not truly parallel execution and having to wait for draw call to finish to switch context? At least that's the case for "asynchronous timewarp", which is just another case of asynchronous shaders / compute? (it's in their VR documents, linked here too, I'm sure)
Uh, that's an ugly topic.

Preemption is only used in one case, and that's when switching between "graphics context" (1x graphics + 31x compute mode) and "pure compute context" (32x mode). Or it SHOULD have been used in that case. Should. It's utterly broken. The tests indicate that the hardware gets stuck in graphics mode until the previous batch of draw calls has run through the queues in full.

In short: It doesn't work. Thanks to the utterly broken "preemption" (heck, that IS NOT EVEN PREEMPTION WHEN YOU ARE JUST WAITING!), a task in a different context can simply starve. I have no clue how they INTENDED preemption to work originally, but it's pretty clear that it doesn't. Was probably just a driver hack originally, not a hardware feature.
 
What about NVIDIA apparently using PreEmption and not truly parallel execution
Oh, and no, Maxwell V2 is actually capable of "parallel" execution. The hardware doesn't profit from it much though, since it has only little "gaps" in the shader utilization either way. So in the end, it's still just sequential execution for most workload, even though if you did manage to stall the pipeline in some way by constructing an unfortunate workload, you could still profit from it. In general, you are only saving on idle time as there is always at least one queue which contains any work at alll. Unlike GCN, where you actually even NEED that additional workload to get full utilization.

Only Fermi, Kepler and Maxwell V1 have that issue, that they can't do ANY async compute shaders while in graphics context as they don't have any compute queues in that mode. But they aren't even feature level 12_0, so I wouldn't count them as DX12 "capable" either way.
 
Original topic:
What performance gains are to be expected fro the use of async shaders and other DX12 features. Async shaders are basically additional "tasks" you can launch on your GPU, to increase the utilization and thereby efficiency as well as performance even further. Unlike in the classic model, these tasks are not performed one after another, but truly in parallel.
Please don't use the word "asynchronous" to mean "concurrent". The vast majority of asynchronous interfaces are implemented sequentially. Asynchronous shaders specify the interface, they don't specify the implementation. If you are interested in measuring whether asynchronous shaders can be executed concurrently, in order to improve efficiency, please use "concurrent" rather than "asynchronous".
 
Uhm, hello guys, im coming here because i have a doubt about all of this, after reading some stuff at overclock.
What does Async have to do with Games going toward "Compute".
"Mahigan" in a post stated ""Game engines are heading towards being more compute oriented. That's not because of a bias towards AMD, that's because that's what DX12 and Vulcan are all about.

It just so happens that GCN hardware is more compute oriented compared to its NVIDIA hardware competition. This will change. Expect to see NVIDIA boost its compute performance going forward.""

A couple of questions arise.

1- What does this "Compute Orientation" mean? (As far as i know maxwell 2 is capable of doing, as many tflops as a 390x) I know context switching is a problem in maxwell 2.0 but im not understanding what does this mean in the end for a general user.

2- Does this mean, that maxwell 2 will be obsolete before Fiji? (Or, in the enxt 5 years be worse at games)

3- Do Async shaders, mean the same as compute orientation in games?

Thanks in advance, i just want to learn a little bit more about this.
 
1) compute shaders are used for complex calculations of lighting, physics and rendering. We don't know if its a context switching problem, because there are issues before that, the drivers. And again Context switching might not even be an issue in this case because they probably shouldn't be using context switching when doing async.

2) In 5 years both cards will be obsolete, doesn't matter if one can do compute better than another in that time frame, because graphics are still going to be pushed more as more graphics resources are available.

3) Async shaders is the ability to interleave both graphics and compute instructions into the same pipeline and by doing so reduce latency by using ALU's more efficiently without any waiting periods.
 
I just wanted to interject a quick thank you for all the explanations! I'm finding this thread fascinating as hell, but at the same time very difficult...so again my thanks. :)
 
Basically, even though Maxwell 2 may be as strong as a 390x in compute, if it can't also do graphics rendering concurrently it's pretty useless (since that's one of the most important tasks your GPU has). Currently, it appears that Maxwell 2 and GCN 1.2 (Fury X) don't handle concurrent graphics+compute well, and we're trying to figure out why, but supposedly it'll be fixed in a driver update.
 
Abstract on Hyper-Q

Hyper-Q enables multiple CPU threads or processes to launch work on a single GPU simultaneously, thereby dramatically increasing GPU utilization and slashing CPU idle times. This feature increases the total number of "connections" between the host and GPU by allowing 32 simultaneous, hardware-managed connections, compared to the single connection available with GPUs without Hyper-Q (e.g. Fermi GPUs).
http://docs.nvidia.com/cuda/samples/6_Advanced/simpleHyperQ/doc/HyperQ.pdf
 
Last edited:
I wasn't trying to suggest that 64 kernel launches are actually assembled into a single work-group to run as a single hardware thread (though I did theorise that this is possible).

I think you guys misunderstand the nature of the ACEs. Each of them manages a queue. I would expect an ACE to be able to launch multiple kernels in parallel, which appears to be what we're seeing in each line where the timings in square brackets feature many results which are identical, e.g.:

Code:
64. 5.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39]
65. 10.59ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37]
66. 10.98ms [5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 5.39 10.37 10.78]
For this test, with 1 thread per kernel, it would devolve to 1 kernel launch per cycle per ACE--if each launch were 1 cycle. At the speeds in question it would be difficult to see a difference.

One thing I was wondering about that I hoped would be teased out if we were able to vary the length of the inner loop, or modify the start and end size of the batches, was whether there was something to be derived by looking at the list rotated 90 degrees.
The unrolled loop's shortening of execution times does something similar.

There is a pattern to the stair steps that is not affected by execution time. It's not consistent across all GCN GPUs, but it seems somewhat stable within examples and their family.
I'm picking through test runs, so it's not rigorous, but one interpretation is that there is a square of dispatches in the horizontal within each batch and vertically between batches.

The recent Fiji test shows that there are almost 30x30 blocks of similar times, before the times to the right move to the next time.
Viewing the next 30 through 60 rows, that second set of times forms its own rough square. There's 1-2 times that show up past the threshold in a few rows, so there's an edge to the heuristic somewhere in there.

Tahiti-derived GPUs seems to have something of a rough 64x64.
There are breaks in the pattern in a few rows, which might depend on whether we're looking at full or salvage die. The 7950 and non-X 280 have a row or two near the end of their stride that are slower than the next.
Sea Islands hovers around 34x34, both the 290 and 7790 have this with and without unrolling, although I am focusing primarily in the 30/60/90/130 range at present and haven't gone through the numbers outside of that range.

Addendum:
There are other interpretations to the data. Since this is being pipelined, and there aren't absolute timings, there are other ways to draw boundaries and even how to make things fit within the same set of boundaries.

edit:
Interestingly, further down the unrolled Fury results, there is a range where it is 32 batches that have roughly 30 dispatch strides that share a similar time (91-122). It is followed by a run of 28 batches with a stride of 30.
 
Last edited:
Basically, even though Maxwell 2 may be as strong as a 390x in compute, if it can't also do graphics rendering concurrently it's pretty useless (since that's one of the most important tasks your GPU has).
Compute is not useless by any means. Many DirectX 11 games used compute shaders. Async is just a bonus. There are lots of games out there that use compute shaders for lighting (Battlefield 3 being the first big AAA title). Compute shader based lighting perform very well even on AMD Terascale VLIW GPUs (such a Radeon 5000 and 6000 series) and NVIDIA Fermi (Geforce GTX 400 series). Compute shaders are used in rendering because compute shaders allow writing less brute force algorithms that save ALU and bandwidth compared to pixel shader equivalents.

Full screen (~2 million threads) compute shader passes do not need any concurrent graphics tasks to fill the whole GPU. It is perfectly fine to run first graphics to raster the G-buffer and shadow maps and then run compute shaders for lighting and post processing. Games have been doing it like this since DirectX 11 launched. Everybody has been happy. I don't understand the recent fuzz.
 
Last edited:
I wasn't saying compute is useless, I was saying having all that power is useless if you don't use it. Those cards can surely match and beat a 390x in certain workloads, I'm saying if they didn't have to context-switch all the time they'd do much better--right now they're like 1000 workers who work really well on one task, switch contexts and work really well on another task, then switch back, rinse, repeat. That can't be as efficient as having 850 of them working on one task and 150 working on another.
 
There are quite a few variables involved beyond context switching or not context switching that can change the answer.
The best-case scenario, as with multithreading, is workloads whose resource requirements do not overlap.
If the demands of those workloads overlap, such as higher than ideal register or shared memory allocations, or a reduction in cache hit rate due to contention, it can lead to underutilization or stalls being added piecemeal throughout the combined run time.

If the GPU is forced out of its ideal power curve, the higher global activity might cause it to downclock.

It's weighing a small number of explicit cost events against a generally higher cost of doing business.
You can make it go either way.

What the testing being done in this thread does not exercise is how readily different schemes can handle more complicated dependence and synchronization chains. That can also change the overall impact of either method, depending on how large the context switch is relative to the idle periods.
 
Basically, even though Maxwell 2 may be as strong as a 390x in compute, if it can't also do graphics rendering concurrently it's pretty useless (since that's one of the most important tasks your GPU has). Currently, it appears that Maxwell 2 and GCN 1.2 (Fury X) don't handle concurrent graphics+compute well, and we're trying to figure out why, but supposedly it'll be fixed in a driver update.
Actually, the Fury X is handling concurrent compute kernels quite well. I wouldn't know why someone would assume other wise.

Yes, the latency is comparably high in relation to what Nvidia hat to offer. But the hardware still managed to dispatch up to 128 calls (kernels originally scheduled with only 1 thread each!) in parallel, using only a single queue of a single ACE. Almost at the limit of what had been possible based on the raw instruction throughput.

The benchmark in this thread isn't utilizing Maxwell v2 either. Out of 31 queues, only a single one was used. And the hardware still managed to dispatch up to 32 calls in parallel, also only from a single queue.

So GCN 1.2 is possibly underutilized by this benchmark by a factor of ~64, and Maxwell v2 still by a factor of ~31. Most naive approach, that is.


Once again: The benchmark in this thread only used a SINGLE queue for the compute calls, and with DX12, a single queue in software also maps to a single queue in hardware. It's also basically the worst case example, as every single compute call only launched a single thread, whereby a realistic number would have been several hundreds per compute call. It is actually amazing that both AMDs and Nvidias architectures still performed so well. In a real game engine, the engine would need to batch multiples of 32 (Nvidia) / 64 (AMD) as a single compute call, vectorize the data, throw that in the queue as a single invocation of a compute shader, and then have the hardware start crunching numbers.


We still haven't tested at all, what the UPPER limits are. The benchmark currently only yields the lower limits for an absolute worst case scenario. We still need to test what happens when:
  • The number of software queues equals the number of hardware queues
  • The number of software queues equals the number of ACEs (AMD only)
  • Each compute call dispatches at least a FULL wavefront


Only Maxwell v2 really messed up, as it would require preemption for a context switch between (pure) compute and graphics context, and the preemption isn't working at all currently, so the driver can crash.

Oh, and forced sequential execution of compute calls on Maxwell v2 resulted in horrible performance, even though that was expected. AMD somehow managed to still execute them in parallel, even though asked not to. (Probably because it could deduce from the lack of memory access that is was safe to do so.)
 
Oh, and what wasn't tested yet at all either: What happens when the compute kernels are actually causing pipeline stalls on purpose. This would require to enqueue DIFFERENT kernels which are each limited on something else, and launch enough of them to actually saturate all shaders in the first place.
 
http://nubleh.github.io/async/#38
It's doing pretty well compared to a 980ti, but there's still large parts that aren't in the blue area. Whether that's a driver issue or the way the test was written, I don't know, but for now it's in at least a similar position to the 980ti (except in the "forced sequential" case, which I hope is not a normal use-case for compute or graphics in general. D: )
 
Yes, the latency is comparably high in relation to what Nvidia hat to offer. But the hardware still managed to dispatch up to 128 calls (kernels originally scheduled with only 1 thread each!) in parallel, using only a single queue of a single ACE. Almost at the limit of what had been possible based on the raw instruction throughput.
Raw instruction throughput, or rather raw front-end dispatch capability, would exhaust the total number of dispatches in the test in microseconds.
We're really not measuring things with sufficient granularity to tease out launching kernels in parallel versus the launch process going very vast relative to measurements that wobble in the millisecond range.
Once things are launched, we are covering the effectiveness of the concurrency supported by the GPU's pipelined execution.
We also have not tested a multi-queue state, and we do not have ready visibility on what queues are actually being exercised relative to the API-visible queues. The API does not actually care how it happens, and if you have a massive number of independent kernels, a rapid serial launch process is not readily distinguishable from a parallel one.

  • The number of software queues equals the number of hardware queues
  • The number of software queues equals the number of ACEs (AMD only)
  • Each compute call dispatches at least a FULL wavefront
I do not think we can be sure that several of these cases aren't already being hit thanks to the driver and GPU's attempts to maximize utilization, particularly since so many of the dispatches have no dependences on anything else.
The full wavefront case should in theory not be that different than the 1 thread case.

We'd have to start creating dependence chains and synchronization points to make sure the optimizations for a trivially parallel case are blocked.

A more clear asynchronous test is to insert controlled stalls into one or the other work type and see if the other side can make progress.
 
Last edited:
We'd have to start creating dependence chains and synchronization points to make sure the optimizations for a trivially parallel case are blocked.

A more clear asynchronous test is to insert controlled stalls into one or the other work type and see if the other side can make progress.

Maybe Microsoft's nBody sample is helpful: https://github.com/Microsoft/DirectX-Graphics-Samples/tree/master/Samples/D3D12nBodyGravity
https://msdn.microsoft.com/en-us/library/windows/desktop/mt186620(v=vs.85).aspx
 
Back
Top