DX12 Performance Discussion And Analysis Thread

It's running asynchronously where it's supported. "Async Compute" isn't a mandatory DX12 "flag".


With the 980 ti performing around 30% than a r9 390x and 23% faster than a Fury X, that shows us that its performance is not bound by its async performance or lack there of in this benchmark. So far the only cards that seem to have a substantial benefit for async compute is GCN 1.0, and 1.1 derivatives with AOS and with this benchmark, we can add in there Maxwell 2.


"Async Compute" is the ability to start rendering and compute tasks at the same time, throughout the ALUs. If it's not running concurrently, there's no "Async Compute" happening.


Aync compute is the ability to interleave compute instructions into graphics queue when resources are available not just at the same time. Kernels have to be executed concurrently but instructions don't need to be processed concurrently. That is what RecessionCone is talking about. You have to look at how the instructions are being dispatched and which queue they are being processed in.

What happens when Maxwell 2 or even GCN at a much higher tolerance, concurrent instructions will begin to stall the the ability for the gpu's to push compute instructions into the graphics queue even though both kernels are running. So when this happens, then the GPU starts running things in serial, or what ever it does most likely just break down at some points.


What the hell does this even mean?! I was just plain and simple called "obtuse" a couple of posts ago and I'm the one needing to pay attention to my words?! Is dogpiling a thing now on B3D?!

Read above


You could almost say it's a DX12 implementation tailored for nVidia GPUs, then...
Not that I expected any less from Tim Sweeney, though. :(

Possibly but Sweeny didn't write the code for that so ...........
 
Last edited:
I don't think you can conclude anything about async compute performance from these results. The benchmark reports how long the compute task took, but it doesn't indicate if the execution overlapped with graphics. The way to measure async compute performance is to run compute and graphics serially and then again asynchronously. This benchmark only does the later.

Regarding nomenclature... AFAIK AMD coined the term async compute and for AMD it means graphics and compute work comes from different queues and the work is launched and executed simultaneously on the hardware. DX12 async support can be claimed even if the hardware aspect isn't supported. Only software support is needed.
 
Guys if there is truly simultaneous 3D/compute work going on here (with "compute shaders" or otherwise) then the sub-timings here are not going to be meaningful. You need a proper timeline picture of what's going on on the entire machine if you want to understand. For instance, it's quite possible that "compute shaders" gets longer at higher res because there's less idle-ness on the machine from the 3D work, so it takes longer for it to fit the compute work into idle CUs or similar. As I said, fundamentally timings like this are not sufficient to understand anything particular about the efficiency of the various passes on the different architectures.

All we can really conclude is that it's clearly not giving AMD a huge advantage in this workload, at least not relative to the NVIDIA cards that are presumably not doing it.

[Edit] What 3dcgi said :)
 
Last edited:
I don't think you can conclude anything about async compute performance from these results. The benchmark reports how long the compute task took, but it doesn't indicate if the execution overlapped with graphics. The way to measure async compute performance is to run compute and graphics serially and then again asynchronously. This benchmark only does the later.

Regarding nomenclature... AFAIK AMD coined the term async compute and for AMD it means graphics and compute work comes from different queues and the work is launched and executed simultaneously on the hardware. DX12 async support can be claimed even if the hardware aspect isn't supported. Only software support is needed.

In computer science, "Asynchronous" means decoupled in time. Asynchronous APIs (like OpenCL) return from function calls before the work has been completed. This is opposed to synchronous APIs, where the results of function calls are complete before they return.

In computer science, "Concurrent" means executing at the same time.

Asynchronous and concurrent are completely different concepts. In fact, in some ways they are opposite. Asynchronous explicitly means that you don't know when something executes - you can't say if two asynchronous processes are concurrent - all you know is that they are decoupled. In contrast, two concurrent processes must happen simultaneously, or else they are not concurrent. These are fundamentally different concepts, and in a forum like Beyond3D, where many people actually know things, we should use the established terminology.

All GPU compute APIs are asynchronous and have been at least since the invention of CUDA.

Most asynchronous interfaces are sequential. For example, if you've ever coded in Javascript, or Python, there are many asynchronous APIs that are implemented sequentially (like callbacks in event-driven systems, or green threads). You write code asynchronously, but the system decides how to execute it, and mostly does so sequentially.

In just the same way, DX12 asynchronous compute is an asynchronous interface. This says nothing about execution, it is perfectly legal for the system to execute asynchronous compute shaders sequentially.

The thing that everyone is excited about is *CONCURRENT* graphics and compute shaders. With DX12 and at least AMD hardware, it is possible to execute graphics calls and compute shaders *CONCURRENTLY*. This CONCURRENCY is what can potentially provide a performance benefit. The asynchrony has always been there - DX12 does not introduce asynchronous compute shaders. It instead opens the possibility for concurrent graphics and compute.

The words asynchronous and concurrent have meaning for those of us in computer science. I don't know why AMD confused them (if they are in fact responsible for this erroneous terminology). But at Beyond3D, I hope people will use technically correct terms for technical discussions.
 
It's running asynchronously where it's supported. "Async Compute" isn't a mandatory DX12 "flag".
It is not a flag at all. It is an implementation defined of a base concept of the D3D12 API.

"Async Compute" is the ability to start rendering and compute tasks at the same time, throughout the ALUs. If it's not running concurrently, there's no "Async Compute" happening.
"Async compute" states that compute tasks (aka compute shaders) are executed without explicit synchronization along with other works. Not mandatory graphics. The big deal of course, with the actual hardware and software, is running compute task along with graphics to fully maximize throughput and GPU hardware resources. But it could be also running compute tasks asynchronously with other compute tasks involving different memory resources and different execution priority. Remember that Direct3D could be used not only for traditional graphics applications only. Unfortunately with the current release we didn't see tangible improvements on compute shaders semantics.

What the hell does this even mean?! I was just plain and simple called "obtuse" a couple of posts ago and I'm the one needing to pay attention to my words?! Is dogpiling a thing now on B3D?!
I would suggest to not confuse marketing of HIVs with what the API really states.

You could almost say it's a DX12 implementation tailored for nVidia GPUs, then...
Not that I expected any less from Tim Sweeney, though. :(
Except for FL 12_1 definition (which didn't cover all Maxwell 2.0 features and capabilities), I didn't see any features tailored for a particular HIVs. On the contrary, I saw many different little limitations probably coming from hardware capabilities of "minor" IHVs hardware (ie not coming from AMD/Intel/NVIDIA): 5 SRV descriptor table limits for tier 1 and tier 2 in resource binding was one of that (fortunately it has been dropped and hardware with such limitation will use some kind of simple constant offset driver trick as far I understood).
 
Looks like Nvidia got heavily CPU limited in the Fable benchmark this time. Even at 4k.

And no, the game doesn't really make proper use of async compute at all. Only about 5% (time wise) of the workload has been offloaded to a dedicated compute queue. I've seen the GPUView dumps of Nvidia and AMD runs. No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.

So it looks essentially the same as it would have with DX11, a perfectly safe, well optimized techdemo, where the only DX12 benefit left is the reduced driver overhead. And even that isn't true for Nvidia.
Worth noting on the last point (not the async compute) when Max did the demo for DX12 and used Fable, he showed around a 20% performance gain between DX11 and DX12 in terms of fps, this was using NVIDIA hardware.
So still true to some extent of reduced driver overhead on NVIDIA.
Cheers
 
Guys if there is truly simultaneous 3D/compute work going on here (with "compute shaders" or otherwise) then the sub-timings here are not going to be meaningful. You need a proper timeline picture of what's going on on the entire machine if you want to understand. For instance, it's quite possible that "compute shaders" gets longer at higher res because there's less idle-ness on the machine from the 3D work, so it takes longer for it to fit the compute work into idle CUs or similar. As I said, fundamentally timings like this are not sufficient to understand anything particular about the efficiency of the various passes on the different architectures.

All we can really conclude is that it's clearly not giving AMD a huge advantage in this workload, at least not relative to the NVIDIA cards that are presumably not doing it.

[Edit] What 3dcgi said :)
I wanna see skylake results @720p and @1080p :p
 
Most asynchronous interfaces are sequential. For example, if you've ever coded in Javascript, or Python, there are many asynchronous APIs that are implemented sequentially (like callbacks in event-driven systems, or green threads). You write code asynchronously, but the system decides how to execute it, and mostly does so sequentially.

Both of those languages can execute true asynchronous code. Web Workers for Javascript and the multiprocessor module for Python.


In just the same way, DX12 asynchronous compute is an asynchronous interface. This says nothing about execution, it is perfectly legal for the system to execute asynchronous compute shaders sequentially.

So you think what is being discussed here is a capability that's been available on all GPUs for 8.5 years, since Xenos and G80?
You can use whatever semantics you want, but this 40-page thread has always been about the "Async Compute" depicted by AMD, which you can find in this article from anandtech.


With the 980 ti performing around 30% than a r9 390x and 23% faster than a Fury X, that shows us that its performance is not bound by its async performance or lack there of in this benchmark. So far the only cards that seem to have a substantial benefit for async compute is GCN 1.0, and 1.1 derivatives with AOS and with this benchmark, we can add in there Maxwell 2.
You can't add anything in this benchmark because there's no way to toggle Async on and off and see the difference between both modes.
Moreover, looking at Ext3h's post it doesn't look like disabling async would do any discernible difference.

Possibly but Sweeny didn't write the code for that so ...........
Tim Sweeney is the CEO of EPIC and he does participate in the development of Unreal Engine.

Except for FL 12_1 definition (which didn't cover all Maxwell 2.0 features and capabilities), I didn't see any features tailored for a particular HIVs.
I was talking about the very low weight that's been put on compute tasks (where AMD is said to excel over nVidia).
I've seen the GPUView dumps of Nvidia and AMD runs. No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.

So it looks essentially the same as it would have with DX11, a perfectly safe, well optimized techdemo, where the only DX12 benefit left is the reduced driver overhead.
 
Last edited by a moderator:
One could run web workers as well as multiple process on single core CPU, perfectly in asynchronous manner. No concurrency is required.

Thus to say that competitor cards does not really support DX12 -- due to lack of asynchronous compute, using non observable performance benefit from pure concurrency as its basis -- is misleading at best. And especially rich, coming from AMD whose DX11 "compliant" cards has never been able to demonstrate performance benefit from its own DX11 multi-threaded rendering implementation.
 
One could run web workers as well as multiple process on single core CPU, perfectly in asynchronous manner. No concurrency is required.

No one suggested concurrency is required with web workers. Only that the capability is there.
Implementation is limited because threads don't share global variables and you have to pass the data between threads through blobs with artificially created IP addresses (meaning there's a lot of delay when sharing data), but it's still the best way to take advantage of a smartphone's multi-core CPU.

Thus to say that competitor cards does not really support DX12 -- due to lack of asynchronous compute, using non observable performance benefit from pure concurrency as its basis -- is misleading at best.
Asynchronous Compute isn't mandatory to any feature level in DX12. Did anyone in this thread claim as such?

And especially rich, coming from AMD

Quotation needed.
 
Last edited by a moderator:
still under "theory", nV hardware can do async in hardware up to a certain point with less latency than AMD hardware, but when stressed after that point its kinda like this

cliff.jpg

This is roughly the point where an AMD PowerPoint would invent a word like 'overasynching'.
 
So you think what is being discussed here is a capability that's been available on all GPUs for 8.5 years, since Xenos and G80?
You can use whatever semantics you want, but this 40-page thread has always been about the "Async Compute" depicted by AMD, which you can find in this article from anandtech.

When it gets confusing like this yes, because using wrong terminology will just create a fog over what we really want to talk about. And this is why I stated "obtuse" my apology, it confused me.


You can't add anything in this benchmark because there's no way to toggle Async on and off and see the difference between both modes.
Moreover, looking at Ext3h's post it doesn't look like disabling async would do any discernible difference.

Since you don't really understand what Async vs. Concurrent is, that is why you don't know what is happening, nor willing to learn about it more.

Think it if like a super highway, where you have multiple lanes, each lane for different queue, and when cars move from one lane to another, that is async, if cars go in the original lane they are going in that's concurrent. But this does not stop one car from stopping in its original lane.

Tim Sweeney is the CEO of EPIC and he does participate in the development of Unreal Engine.


Sweeney has nothing to do with Fable Legends more over the version of the engine Fable Legends is using is not the same version that is avialable for us to download, (same version but modified to what ever they need it for), I say this because even smaller projects we still have to modify engine code to get somethings that we want, so I would expect a bigger project like FL to do quite a bit more.

Quotation needed

Do you remember right after the oxide dev stated and then all those AMD twitter tweets?
 
Last edited:
"Async compute" states that compute tasks (aka compute shaders) are executed without explicit synchronization along with other works.
No, explicit synchronization is actually required and even exposed in the API.

Implicit synchronization is what shouldn't happen. Back-pressure in the graphic command queue must not hinder compute commands from asynchronous queues from entering execution, and vice versa.

And that is what isn't working on Nvidias hardware.
 
I was talking about the very low weight that's been put on compute tasks (where AMD is said to excel over nVidia).
May bad, I misunderstood.
No, explicit synchronization is actually required and even exposed in the API.

Implicit synchronization is what shouldn't happen. Back-pressure in the graphic command queue must not hinder compute commands from asynchronous queues from entering execution, and vice versa.

And that is what isn't working on Nvidias hardware.
If you need a write access from different works to a resource (like a buffer) then yes, you need manually synchronize it through fences, eg: you have two different 'works' (just try to stay abstract) that need to access to the same resource and at least one of them will write on it, then you will have to wait for a fence before writing. If you do not have any kind of write access conflict then yes, the GPU/driver scheduler should do all the job. But quite nothing in the runtime provides implicit resource synchronization, and this happen to provide multi-threading.
 
The Work Distributor in Kepler can communicate in both directions but i don't know at which protocol level (in other words: there may be a very limited backward communication). Furthermore i'm pretty sure it's not a ARM core.
In other cases, command front ends have been composed of customized cores, and there are other providers of simple and somewhat esoteric architectures that can be licensed. Potentially, an ARM could be at once have too much capability while at the same time not providing the level of integration and random customization options to serve as an inexpensive component in the GPU's domain.

Presentations about AMD's VLIW GPUs talked about the graphics command processor actually being a block of two customized microcontrollers, with specialized firmware, queues, and local storage. Their purpose of processing queued commands meant that they target the processing the rather bulky command formats. The leaked Xbox One SDK seems to mention three such processors in the graphics front end.
The ACEs, or groups of them, might be at least one custom processor each, and there might be a microcode engine involved at various points.

I have not seen discussion about what Nvidia might be using.

The words asynchronous and concurrent have meaning for those of us in computer science. I don't know why AMD confused them (if they are in fact responsible for this erroneous terminology). But at Beyond3D, I hope people will use technically correct terms for technical discussions.
The original context was adding additional compute queue managers on a device that was initially synchronous, so I guess it's possible that since GPUs were already heavily concurrent that concurrency was something of a given. The front ends would be hardware units whose job was to accelerate the asynchronous case, while the GPU remained as concurrent as before.

ACE also looks neater from a presentation standpoint, so that might be a contributing reason for the naming choice. The units themselves seem to be overbuilt for what DX12 considers asynchronous compute, which might point to what AMD thought it could use as the primary marketing point.
 
Just a quick "Digi is an idiot" confession, when I posted my summary request of this thread I thought it was the async compute thread...talk about getting confused on semantics!

I'm in love with both threads btw, even though I don't understand a lot of it. The speculation and analysis is just bloody fascinating to me, thanks for all the edumacation!

EDITED BITS: Yes, I'm an even bigger idiot than at the start of this post. Thanks go out to the mods for helping me retain my status as the greatest democratically elected village idiot Beyond3D has ever had by changing the thread title on me. :oops:
 
Last edited:
Just to add hilarious additional vectors to the terminology fun, there is previous precedent in this specific field for using "concurrent" vs "parallel" (see slide 11). That use gets even more subtle though as it is speaking to whether or not code that is currently executing on a processor can guarantee other code is running in parallel for coordination/synchronization reasons. Even AMD's implementation of "async compute" does not provide that guarantee.

I think we can all agree that "async" is the worst of these terms for describing anything about the GPU execution. It's use basically ends at "DX12 supports asynchronous queues", but beyond that how multiple queues get mapped to multiple GPU engines and pipelined through the hardware gets complicated and very device-specific. Thus - AGAIN - you're not going to be able to draw any wide-reaching conclusions guys.
 
Last edited:
Back
Top