DX12 Performance Discussion And Analysis Thread

He could have also spoke more about the use of 12_1 features like Conservative Raster or Raster Ordered Views in their benchmark engine.
 
Then if this was the case why is Fiji performance right around where it should be, hanging around the 980ti, I would expect it to crush the 980ti, if what that oxide dev stated is correct, not to mention, he says he thinks lol he doesn't know he is making a guess, what, if I had access to that alpha, beta game, I would run it through a profiler and find out exactly what the difference is, it wouldn't take more then 5 minutes, don't even need to be an oxide dev to see that.
 
hmm everything from cuda developer toolkit says otherwise for Maxwell 2 not having async shaders :cry:

I think it would have been better if he didn't say anything lol

The driver reports it as being there. They tried using it. Performance was abysmal, so they put in vendor specific (Nvidia) code to disable it for the time being. Perhaps Nvidia will eventually make it useable in DX12, until then Nvidia remains the only vendor with vendor specific code (so that it doesn't tank performance on Nvidia hardware).

Then again conspiracy theorists might say it's intentionally not working well because performance with it is worse than AMD hardware. Thus make it unuseable and the majority of developers who have Nvidia hardware in their development machines then won't bother to implement it in their games. But it's probably more likely that it's just plain bad on Nvidia hardware, or they just haven't gotten around to making a DX12 driver that works well.

From the link above.

Our use of Async Compute, however, pales with comparisons to some of the things which the console guys are starting to do. Most of those haven't made their way to the PC yet, but I've heard of developers getting 30% GPU performance by using Async Compute. Too early to tell, of course, but it could end being pretty disruptive in a year or so as these GCN built and optimized engines start coming to the PC. I don't think Unreal titles will show this very much though, so likely we'll have to wait to see. Has anyone profiled Ark yet?

That's fairly interesting. And should poke Nvidia to get to work on that in the driver if they've been slacking on it.

Regards
SB
 
Last edited:
It doesn't make sense if this was the case Fiji would have the same performance advantage over competitive cards as the 290x, and that doesn't happen at any resolution even with higher or lower batch amounts.
 
Curiously, their driver reported this feature was functional but attempting to use it was an unmitigated disaster in terms of performance and conformance so we shut it down on their hardware. As far as I know, Maxwell doesn't really have Async Compute so I don't know why their driver was trying to expose that.
How sure are we that this is an actual developer? There are NO "async compute" flags or cap bits.
 
Our use of Async Compute, however, pales with comparisons to some of the things which the console guys are starting to do. Most of those haven't made their way to the PC yet, but I've heard of developers getting 30% GPU performance by using Async Compute. Too early to tell, of course, but it could end being pretty disruptive in a year or so as these GCN built and optimized engines start coming to the PC.
With deferred texturing you can get significantly higher than 30% gains, since the G-buffer pixel shader doesn't do any texturing or use much BW. Exact number of course highly depends on your lighting and post processing algorithms and how much parallelism you pipeline offers.
 
This isn't painting a good picture for Nvidia, but the comments around not seeing similar performance gains for Fiji also have a very valid point. We clearly need more data.

It'll be interesting indeed if NV doesn't support async compute but as AndyTX said in another thread, some architectures are built for maximum performance without having to rely on it which is arguably better. Obviously he's talking about Intel architectures there though so whether that applies to Maxwell too is another question. It also doesn't get away from the fact (if all this is true) that despite needinh async compute to access it, AMD GPU's may have a reserve of "untapped potential" that NV GPU's lack, and which could be a game changer for games that make use of it. If true, this is probably good for the PC GPU industry overall.

But does this also mean the consoles will be performing in line with the upper mid range Kepler GPU's like the 680 when async compute is used heavily? That would be interesting indeed. It isn't happening yet though.
 
How sure are we that this is an actual developer? There are NO "async compute" flags or cap bits.
If NVidia's driver is saying it's FL12_0 doesn't that mean that it's saying it has async compute? So they have to resort to hardware detection?
 
If NVidia's driver is saying it's FL12_0 doesn't that mean that it's saying it has async compute? So they have to resort to hardware detection?
Some features, such as async compute and ExecuteIndirect are new DirectX 12 API features. These features are supported by all GPUs.

Async compute is a little bit like hyperthreading. It helps some workloads a lot, while it (slightly) hurts some workloads. Because every GPU has a bit different bottlenecks, it is very hard to write async compute code that benefits them all. Console engines will be off course optimized first for GCN hardware. This might result in gains on PC side as well, but it is too early to say really, since even AMD PC GPUs have different bottlenecks (config is not identical to consoles). Nobody has shipped a DX12 PC console port yet and the DX12 drivers are still quite immature.
 
Last edited:
Some features, such as async compute and ExecuteIndirect are new DirectX 12 API features. These are supported by all GPUs.
Indeed. And an optional feature at that. You don't need async compute to be DX12 compliant, which is why even Fermi will (eventually) be DX12 capable.

The idea is that you call up jobs and execute them as you need them done (ala CPU threads), however I suspect you're going to need separate paths for different async configurations to get the best performance, since you have 3D, compute, and DMA/copy to concern yourself with (welcome to low level programming).
 
Async compute is like multithreading. Traditionally you had one thread (one command list). Each command could utilize any amount of parallel hardware resources with parallel_for_each (kernels invocations are practically parallel for loops). Now with async compute, you have multiple threads, each capable of issuing parallel_for_each commands and fences in between them to synchronize the execution when needed.

Having more queues (threads) is mostly helpful when the queues (threads) are issuing kernels (parallel_for_each invocations) that do not solely fill the GPU (CPU). Not all tasks can be split to hundreds of thousands of parallel work items.

For example we have many single lane kernels that setup the indirect draw/dispatch parameters for the next call (read an append counter, divide it by 256, write thread block count). Running a single lane shader like this alone on a fat GPU (Fury X) that could simultaneously execute 4096 lanes is highly wasteful. This is a perfect example of a case where multiple command queues help to utilize the GPU better.
 
Having more queues (threads) is mostly helpful when the queues (threads) are issuing kernels (parallel_for_each invocations) that do not solely fill the GPU (CPU). Not all tasks can be split to hundreds of thousands of parallel work items.
I think it actually gets a bit more complicated then that. My initial understanding was the same: more command queues more chances to keep GPU busy. Though 128 queues in my benchmark probably went a bit overboard (should have put some diagnostics in that but maybe that's where GCN crashed?).
There are basically two commands: Dispatch and Draw. They can be put into command list and command list gets executed on command queue. What I'm seeing at the moment though is that adding a bunch of (single lane) Dispatch calls to single command list executed on single queue will actually run dispatches in parallel on Kepler. Two command lists however will not run in parallel (at least not on Kepler).
The interesting bit of course is the compute + graphics bit, I'm extending in that direction at the moment.
 
I think it actually gets a bit more complicated then that. My initial understanding was the same: more command queues more chances to keep GPU busy. Though 128 queues in my benchmark probably went a bit overboard (should have put some diagnostics in that but maybe that's where GCN crashed?).
There are basically two commands: Dispatch and Draw. They can be put into command list and command list gets executed on command queue. What I'm seeing at the moment though is that adding a bunch of (single lane) Dispatch calls to single command list executed on single queue will actually run dispatches in parallel on Kepler. Two command lists however will not run in parallel (at least not on Kepler).
The interesting bit of course is the compute + graphics bit, I'm extending in that direction at the moment.
Yes. DX12 has manual hazard tracking. All modern GPUs can also fetch multiple commands from a single queue and run them concurrently, assuming the resource barriers you put in the command queue allow that. In my example single lane shader, it is required to have a barrier on both sides (before and after), preventing any parallelism from the same queue.
 
Last edited:
So async compute is optional in D3D12. I suppose over time NVidia will work out which games to say "nope" to when queried for this then.
You can't query for this in d3d12. It's just there. Just as a bunch of FUD about this feature. :)
 
Oh, so when Ryan says these things are optional features, what he's saying is that they aren't necessarily implemented as hardware capabilities.
 
"Async compute" and execute indirect are not optional, are part of the API. You cannot query the GPU to know if they are supported or not. What is optional is if the hardware takes advantage or not of them. we can call them "implementation defined" by the hardware/driver.

And if the GPU hardware doesn't support it, the "threads" are simply executed serially?

If there isn't any hardware "dedicated support" (or whatever you want to call it) everything is serialized by the driver.

All this means that the same code using asynchronous compute workarounds (along with graphics) and execute indirect will run across all D3D12 capable driver/hardware.
 
It is practically the same that happens on CPUs. You can create any amount of threads (queues) even if your CPU (GPU) just has a single core (runs a single command stream). If more threads (queues) are active at the same time than are supported by the hardware, they will be periodically context switched.
 
Back
Top