DX12 Performance Discussion And Analysis Thread

Razor1 · Sep 25, 2015

ToTTenTranz said:
It's running asynchronously where it's supported. "Async Compute" isn't a mandatory DX12 "flag".

With the 980 ti performing around 30% than a r9 390x and 23% faster than a Fury X, that shows us that its performance is not bound by its async performance or lack there of in this benchmark. So far the only cards that seem to have a substantial benefit for async compute is GCN 1.0, and 1.1 derivatives with AOS and with this benchmark, we can add in there Maxwell 2.

"Async Compute" is the ability to start rendering and compute tasks at the same time, throughout the ALUs. If it's not running concurrently, there's no "Async Compute" happening.

Aync compute is the ability to interleave compute instructions into graphics queue when resources are available not just at the same time. Kernels have to be executed concurrently but instructions don't need to be processed concurrently. That is what RecessionCone is talking about. You have to look at how the instructions are being dispatched and which queue they are being processed in.

What happens when Maxwell 2 or even GCN at a much higher tolerance, concurrent instructions will begin to stall the the ability for the gpu's to push compute instructions into the graphics queue even though both kernels are running. So when this happens, then the GPU starts running things in serial, or what ever it does most likely just break down at some points.

What the hell does this even mean?! I was just plain and simple called "obtuse" a couple of posts ago and I'm the one needing to pay attention to my words?! Is dogpiling a thing now on B3D?!

Read above

You could almost say it's a DX12 implementation tailored for nVidia GPUs, then...
Not that I expected any less from Tim Sweeney, though.

Possibly but Sweeny didn't write the code for that so ...........

3dcgi · Sep 25, 2015

I don't think you can conclude anything about async compute performance from these results. The benchmark reports how long the compute task took, but it doesn't indicate if the execution overlapped with graphics. The way to measure async compute performance is to run compute and graphics serially and then again asynchronously. This benchmark only does the later.

Regarding nomenclature... AFAIK AMD coined the term async compute and for AMD it means graphics and compute work comes from different queues and the work is launched and executed simultaneously on the hardware. DX12 async support can be claimed even if the hardware aspect isn't supported. Only software support is needed.

Andrew Lauritzen · Sep 25, 2015

Guys if there is truly simultaneous 3D/compute work going on here (with "compute shaders" or otherwise) then the sub-timings here are not going to be meaningful. You need a proper timeline picture of what's going on on the entire machine if you want to understand. For instance, it's quite possible that "compute shaders" gets longer at higher res because there's less idle-ness on the machine from the 3D work, so it takes longer for it to fit the compute work into idle CUs or similar. As I said, fundamentally timings like this are not sufficient to understand anything particular about the efficiency of the various passes on the different architectures.

All we can really conclude is that it's clearly not giving AMD a huge advantage in this workload, at least not relative to the NVIDIA cards that are presumably not doing it.

[Edit] What 3dcgi said

RecessionCone · Sep 25, 2015

3dcgi said:
I don't think you can conclude anything about async compute performance from these results. The benchmark reports how long the compute task took, but it doesn't indicate if the execution overlapped with graphics. The way to measure async compute performance is to run compute and graphics serially and then again asynchronously. This benchmark only does the later.

Regarding nomenclature... AFAIK AMD coined the term async compute and for AMD it means graphics and compute work comes from different queues and the work is launched and executed simultaneously on the hardware. DX12 async support can be claimed even if the hardware aspect isn't supported. Only software support is needed.

In computer science, "Asynchronous" means decoupled in time. Asynchronous APIs (like OpenCL) return from function calls before the work has been completed. This is opposed to synchronous APIs, where the results of function calls are complete before they return.

In computer science, "Concurrent" means executing at the same time.

Asynchronous and concurrent are completely different concepts. In fact, in some ways they are opposite. Asynchronous explicitly means that you don't know when something executes - you can't say if two asynchronous processes are concurrent - all you know is that they are decoupled. In contrast, two concurrent processes must happen simultaneously, or else they are not concurrent. These are fundamentally different concepts, and in a forum like Beyond3D, where many people actually know things, we should use the established terminology.

All GPU compute APIs are asynchronous and have been at least since the invention of CUDA.

Most asynchronous interfaces are sequential. For example, if you've ever coded in Javascript, or Python, there are many asynchronous APIs that are implemented sequentially (like callbacks in event-driven systems, or green threads). You write code asynchronously, but the system decides how to execute it, and mostly does so sequentially.

In just the same way, DX12 asynchronous compute is an asynchronous interface. This says nothing about execution, it is perfectly legal for the system to execute asynchronous compute shaders sequentially.

The thing that everyone is excited about is *CONCURRENT* graphics and compute shaders. With DX12 and at least AMD hardware, it is possible to execute graphics calls and compute shaders *CONCURRENTLY*. This CONCURRENCY is what can potentially provide a performance benefit. The asynchrony has always been there - DX12 does not introduce asynchronous compute shaders. It instead opens the possibility for concurrent graphics and compute.

The words asynchronous and concurrent have meaning for those of us in computer science. I don't know why AMD confused them (if they are in fact responsible for this erroneous terminology). But at Beyond3D, I hope people will use technically correct terms for technical discussions.

Infinisearch · Sep 25, 2015

Will the fable legends benchmark available for download? Soon? I did a quick search but no luck.

Alessio1989 · Sep 25, 2015

ToTTenTranz said:
It's running asynchronously where it's supported. "Async Compute" isn't a mandatory DX12 "flag".

It is not a flag at all. It is an implementation defined of a base concept of the D3D12 API.

ToTTenTranz said:
"Async Compute" is the ability to start rendering and compute tasks at the same time, throughout the ALUs. If it's not running concurrently, there's no "Async Compute" happening.

"Async compute" states that compute tasks (aka compute shaders) are executed without explicit synchronization along with other works. Not mandatory graphics. The big deal of course, with the actual hardware and software, is running compute task along with graphics to fully maximize throughput and GPU hardware resources. But it could be also running compute tasks asynchronously with other compute tasks involving different memory resources and different execution priority. Remember that Direct3D could be used not only for traditional graphics applications only. Unfortunately with the current release we didn't see tangible improvements on compute shaders semantics.

ToTTenTranz said:
What the hell does this even mean?! I was just plain and simple called "obtuse" a couple of posts ago and I'm the one needing to pay attention to my words?! Is dogpiling a thing now on B3D?!

I would suggest to not confuse marketing of HIVs with what the API really states.

ToTTenTranz said:
You could almost say it's a DX12 implementation tailored for nVidia GPUs, then...
Not that I expected any less from Tim Sweeney, though.

Except for FL 12_1 definition (which didn't cover all Maxwell 2.0 features and capabilities), I didn't see any features tailored for a particular HIVs. On the contrary, I saw many different little limitations probably coming from hardware capabilities of "minor" IHVs hardware (ie not coming from AMD/Intel/NVIDIA): 5 SRV descriptor table limits for tier 1 and tier 2 in resource binding was one of that (fortunately it has been dropped and hardware with such limitation will use some kind of simple constant offset driver trick as far I understood).

CSI PC · Sep 25, 2015

Ext3h said:
Looks like Nvidia got heavily CPU limited in the Fable benchmark this time. Even at 4k.

And no, the game doesn't really make proper use of async compute at all. Only about 5% (time wise) of the workload has been offloaded to a dedicated compute queue. I've seen the GPUView dumps of Nvidia and AMD runs. No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.

So it looks essentially the same as it would have with DX11, a perfectly safe, well optimized techdemo, where the only DX12 benefit left is the reduced driver overhead. And even that isn't true for Nvidia.

Worth noting on the last point (not the async compute) when Max did the demo for DX12 and used Fable, he showed around a 20% performance gain between DX11 and DX12 in terms of fps, this was using NVIDIA hardware.
So still true to some extent of reduced driver overhead on NVIDIA.
Cheers

Alessio1989 · Sep 25, 2015

Andrew Lauritzen said:
Guys if there is truly simultaneous 3D/compute work going on here (with "compute shaders" or otherwise) then the sub-timings here are not going to be meaningful. You need a proper timeline picture of what's going on on the entire machine if you want to understand. For instance, it's quite possible that "compute shaders" gets longer at higher res because there's less idle-ness on the machine from the 3D work, so it takes longer for it to fit the compute work into idle CUs or similar. As I said, fundamentally timings like this are not sufficient to understand anything particular about the efficiency of the various passes on the different architectures.

All we can really conclude is that it's clearly not giving AMD a huge advantage in this workload, at least not relative to the NVIDIA cards that are presumably not doing it.

[Edit] What 3dcgi said

I wanna see skylake results @720p and @1080p

Deleted member 13524 · Sep 25, 2015

RecessionCone said:
Most asynchronous interfaces are sequential. For example, if you've ever coded in Javascript, or Python, there are many asynchronous APIs that are implemented sequentially (like callbacks in event-driven systems, or green threads). You write code asynchronously, but the system decides how to execute it, and mostly does so sequentially.

Both of those languages can execute true asynchronous code. Web Workers for Javascript and the multiprocessor module for Python.

RecessionCone said:
In just the same way, DX12 asynchronous compute is an asynchronous interface. This says nothing about execution, it is perfectly legal for the system to execute asynchronous compute shaders sequentially.

So you think what is being discussed here is a capability that's been available on all GPUs for 8.5 years, since Xenos and G80?
You can use whatever semantics you want, but this 40-page thread has always been about the "Async Compute" depicted by AMD, which you can find in this article from anandtech.

Razor1 said:
With the 980 ti performing around 30% than a r9 390x and 23% faster than a Fury X, that shows us that its performance is not bound by its async performance or lack there of in this benchmark. So far the only cards that seem to have a substantial benefit for async compute is GCN 1.0, and 1.1 derivatives with AOS and with this benchmark, we can add in there Maxwell 2.

You can't add anything in this benchmark because there's no way to toggle Async on and off and see the difference between both modes.
Moreover, looking at Ext3h's post it doesn't look like disabling async would do any discernible difference.

Razor1 said:
Possibly but Sweeny didn't write the code for that so ...........

Tim Sweeney is the CEO of EPIC and he does participate in the development of Unreal Engine.

Alessio1989 said:
Except for FL 12_1 definition (which didn't cover all Maxwell 2.0 features and capabilities), I didn't see any features tailored for a particular HIVs.

I was talking about the very low weight that's been put on compute tasks (where AMD is said to excel over nVidia).

Ext3h said:
I've seen the GPUView dumps of Nvidia and AMD runs. No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.

So it looks essentially the same as it would have with DX11, a perfectly safe, well optimized techdemo, where the only DX12 benefit left is the reduced driver overhead.

madyasiwi · Sep 25, 2015

One could run web workers as well as multiple process on single core CPU, perfectly in asynchronous manner. No concurrency is required.

Thus to say that competitor cards does not really support DX12 -- due to lack of asynchronous compute, using non observable performance benefit from pure concurrency as its basis -- is misleading at best. And especially rich, coming from AMD whose DX11 "compliant" cards has never been able to demonstrate performance benefit from its own DX11 multi-threaded rendering implementation.

Deleted member 13524 · Sep 25, 2015

madyasiwi said:
One could run web workers as well as multiple process on single core CPU, perfectly in asynchronous manner. No concurrency is required.

No one suggested concurrency is required with web workers. Only that the capability is there.
Implementation is limited because threads don't share global variables and you have to pass the data between threads through blobs with artificially created IP addresses (meaning there's a lot of delay when sharing data), but it's still the best way to take advantage of a smartphone's multi-core CPU.

madyasiwi said:
Thus to say that competitor cards does not really support DX12 -- due to lack of asynchronous compute, using non observable performance benefit from pure concurrency as its basis -- is misleading at best.

Asynchronous Compute isn't mandatory to any feature level in DX12. Did anyone in this thread claim as such?

madyasiwi said:
And especially rich, coming from AMD

Quotation needed.

Florin · Sep 25, 2015

Razor1 said:
still under "theory", nV hardware can do async in hardware up to a certain point with less latency than AMD hardware, but when stressed after that point its kinda like this

This is roughly the point where an AMD PowerPoint would invent a word like 'overasynching'.

Razor1 · Sep 25, 2015

ToTTenTranz said:
So you think what is being discussed here is a capability that's been available on all GPUs for 8.5 years, since Xenos and G80?
You can use whatever semantics you want, but this 40-page thread has always been about the "Async Compute" depicted by AMD, which you can find in this article from anandtech.

When it gets confusing like this yes, because using wrong terminology will just create a fog over what we really want to talk about. And this is why I stated "obtuse" my apology, it confused me.

You can't add anything in this benchmark because there's no way to toggle Async on and off and see the difference between both modes.
Moreover, looking at Ext3h's post it doesn't look like disabling async would do any discernible difference.

Since you don't really understand what Async vs. Concurrent is, that is why you don't know what is happening, nor willing to learn about it more.

Think it if like a super highway, where you have multiple lanes, each lane for different queue, and when cars move from one lane to another, that is async, if cars go in the original lane they are going in that's concurrent. But this does not stop one car from stopping in its original lane.

Tim Sweeney is the CEO of EPIC and he does participate in the development of Unreal Engine.

Sweeney has nothing to do with Fable Legends more over the version of the engine Fable Legends is using is not the same version that is avialable for us to download, (same version but modified to what ever they need it for), I say this because even smaller projects we still have to modify engine code to get somethings that we want, so I would expect a bigger project like FL to do quite a bit more.

Quotation needed

Do you remember right after the oxide dev stated and then all those AMD twitter tweets?

Ext3h · Sep 25, 2015

Alessio1989 said:
"Async compute" states that compute tasks (aka compute shaders) are executed without explicit synchronization along with other works.

No, explicit synchronization is actually required and even exposed in the API.

Implicit synchronization is what shouldn't happen. Back-pressure in the graphic command queue must not hinder compute commands from asynchronous queues from entering execution, and vice versa.

And that is what isn't working on Nvidias hardware.

Alessio1989 · Sep 25, 2015

I was talking about the very low weight that's been put on compute tasks (where AMD is said to excel over nVidia).

May bad, I misunderstood.

Ext3h said:
No, explicit synchronization is actually required and even exposed in the API.

Implicit synchronization is what shouldn't happen. Back-pressure in the graphic command queue must not hinder compute commands from asynchronous queues from entering execution, and vice versa.

And that is what isn't working on Nvidias hardware.

If you need a write access from different works to a resource (like a buffer) then yes, you need manually synchronize it through fences, eg: you have two different 'works' (just try to stay abstract) that need to access to the same resource and at least one of them will write on it, then you will have to wait for a fence before writing. If you do not have any kind of write access conflict then yes, the GPU/driver scheduler should do all the job. But quite nothing in the runtime provides implicit resource synchronization, and this happen to provide multi-threading.

3dilettante · Sep 25, 2015

huebie said:
The Work Distributor in Kepler can communicate in both directions but i don't know at which protocol level (in other words: there may be a very limited backward communication). Furthermore i'm pretty sure it's not a ARM core.

In other cases, command front ends have been composed of customized cores, and there are other providers of simple and somewhat esoteric architectures that can be licensed. Potentially, an ARM could be at once have too much capability while at the same time not providing the level of integration and random customization options to serve as an inexpensive component in the GPU's domain.

Presentations about AMD's VLIW GPUs talked about the graphics command processor actually being a block of two customized microcontrollers, with specialized firmware, queues, and local storage. Their purpose of processing queued commands meant that they target the processing the rather bulky command formats. The leaked Xbox One SDK seems to mention three such processors in the graphics front end.
The ACEs, or groups of them, might be at least one custom processor each, and there might be a microcode engine involved at various points.

I have not seen discussion about what Nvidia might be using.

RecessionCone said:
The words asynchronous and concurrent have meaning for those of us in computer science. I don't know why AMD confused them (if they are in fact responsible for this erroneous terminology). But at Beyond3D, I hope people will use technically correct terms for technical discussions.

The original context was adding additional compute queue managers on a device that was initially synchronous, so I guess it's possible that since GPUs were already heavily concurrent that concurrency was something of a given. The front ends would be hardware units whose job was to accelerate the asynchronous case, while the GPU remained as concurrent as before.

ACE also looks neater from a presentation standpoint, so that might be a contributing reason for the naming choice. The units themselves seem to be overbuilt for what DX12 considers asynchronous compute, which might point to what AMD thought it could use as the primary marketing point.

digitalwanderer · Sep 25, 2015

Just a quick "Digi is an idiot" confession, when I posted my summary request of this thread I thought it was the async compute thread...talk about getting confused on semantics!

I'm in love with both threads btw, even though I don't understand a lot of it. The speculation and analysis is just bloody fascinating to me, thanks for all the edumacation!

EDITED BITS: Yes, I'm an even bigger idiot than at the start of this post. Thanks go out to the mods for helping me retain my status as the greatest democratically elected village idiot Beyond3D has ever had by changing the thread title on me.

digitalwanderer · Sep 25, 2015

Florin said:
This is roughly the point where an AMD PowerPoint would invent a word like 'overasynching'.

Nope, EXTREME-ASYNCING!

I.S.T. · Sep 25, 2015

digitalwanderer said:
Nope, EXTREME-ASYNCING!

No, that'd be ATI and Nvidia back in like... 1999.

Andrew Lauritzen · Sep 25, 2015

Just to add hilarious additional vectors to the terminology fun, there is previous precedent in this specific field for using "concurrent" vs "parallel" (see slide 11). That use gets even more subtle though as it is speaking to whether or not code that is currently executing on a processor can guarantee other code is running in parallel for coordination/synchronization reasons. Even AMD's implementation of "async compute" does not provide that guarantee.

I think we can all agree that "async" is the worst of these terms for describing anything about the GPU execution. It's use basically ends at "DX12 supports asynchronous queues", but beyond that how multiple queues get mapped to multiple GPU engines and pipelined through the hardware gets complicated and very device-specific. Thus - AGAIN - you're not going to be able to draw any wide-reaching conclusions guys.

DX12 Performance Discussion And Analysis Thread

Razor1

3dcgi

Andrew Lauritzen

Moderator

RecessionCone

Infinisearch

Alessio1989

CSI PC

Alessio1989

Deleted member 13524

Guest

madyasiwi

Deleted member 13524

Guest

Florin

Merrily dodgy

Razor1

Ext3h

Alessio1989

3dilettante

digitalwanderer

digitalwanderer

I.S.T.

Andrew Lauritzen

Moderator

Similar threads