No DX12 Software is Suitable for Benchmarking *spawn*

As you stated async is disabled on a driver level for maxwell so yeah there won't be a change in performance for their cards.....
Not so fast with that statement. We have now had multiple examples where the performance on Maxwell with asynchronous scheduling on/off ranged both from taking a significant hit to not only measurable but also noticeable gains. The differences in the resulting schedule (compared to the hand tuned sequential order) - even if it was just the software scheduler filling in for parallel execution - can apparently make a significant difference.
 
We have now had multiple examples where the performance on Maxwell with asynchronous scheduling on/off ranged both from taking a significant hit to not only measurable but also noticeable gains.
I somehow must have missed those. Where have we seen that?
 
I somehow must have missed those. Where have we seen that?
Maybe it's just my memory playing tricks on me, but wasn't it the Fable Legends demo where both vendors claimed at least measurable performance boost from using the compute queue? I'm unable to find/recall the source right now, and I'm sure it couldn't be tested with the shipped preview version as it didn't had a toggle option any more.

Well, better ignore that edge case then. If it really was true, it would have indicated a lack of optimization on the single queue schedule anyway.
 
I'm saying I haven't seen anything in either direction. Neither a hit neither a boost with regard to async compute on Maxwell.
 
Oh, taking a hit? Take the whole AotS thing for example.
It's mostly straightened out by now (only still replicable with older Windows 10 builds), the GTX 970 for example doesn't take a hit at all any more, the GTX 980Ti on contrary still does.

And that did happen despite confirming that the command lists themselves where mostly identical, except for being submitted to different queues.
 
No, not really. AotS isn't an example. Taking a hit from switching from DX11 to DX12 can in no way whatsoever mean that "async compute" is the culprit.
All we have from AotS:
- there were supposedly some performance problems on Maxwell and it got disabled for NVidia on their request.
- last post from Kollock indicates async path is still forcibly disabled on NVidia. Which puts Pascal shit out of luck.
 
No, not really. AotS isn't an example. Taking a hit from switching from DX11 to DX12 can in no way whatsoever mean that "async compute" is the culprit.
All we have from AotS:
- there were supposedly some performance problems on Maxwell and it got disabled for NVidia on their request.
- last post from Kollock indicates async path is still forcibly disabled on NVidia. Which puts Pascal shit out of luck.
That was 5 months ago that he posted that. As far as I know, it has been reenabled since (long before Pascal launch), meaning the application is using the compute queue again per default.
Whether the driver is natively supporting async compute is entirely unrelated, but it does influence how the DX12 run time environment behaves.

I'm actually talking about solely toggling the use of the compute queue in the application - not switching between DX11 and DX12.
One person I'm in contact with has recently tested specifically the 970 and 980 Ti with regard to the performance when solely toggling the compute queue usage in AotS, and found that the 970 no longer showed any measurable performance penalty, while the 980 Ti still did.

PS: Getting to the root cause of these performance problems is part of the exercise. Because they should never have occurred when the only difference is using a software scheduler at runtime, instead of submitting the command lists in a fixed order right away. It means either the scheduler itself is too slow, or it has made a bad decision at some point - one which the developer avoided when concepting the static schedule.
 
As far as I know, it has been reenabled since (long before Pascal launch), meaning the application is using the compute queue again per default.

Not according to this:

DX12 Performance Discussion And Analysis Thread

Just checked with Dan Baker. Async is still functionally disabled on Ashes when it detects an NVIDIA card, including the GTX 1080 (since they don't have one to test against yet).

That's the last I heard about it. I don't know if it has changed since, but it definitely doesn't look like it changed long before Pascal or even after.
 
Aren't rendering options usually only disabled if they result in a performance decrease versus a standard way of doing it? In both Doom and AOTS async compute was disabled on Nvidia hardware. With AOTS, the developers mention that this was due to performance regression on Nvidia hardare at the time (pre-pascal). I'm guessing the situation was the same with Doom. I can only guess that async not working on Pascal prior to release was a failure of Nvidia's devrel. Either that or it is very difficult to get Doom's method of doing async compute in Vulkan working better than not having async compute enabled. After all, Nvidia obviously had access to Doom's Vulkan rendering path as they demo'd Pascal on it over a month prior to Doom Vulkan rendering path being released to the public.

Regards,
SB
 
Lots of speculation and guesswork being thrown around as fact these days. We really need better tools to provide transparency into the behavior of modern apps and hardware.
 
To be clear, my post above wasn't about Pascal not being able to do Async Compute or even Maxwell 1/2. So far it's been a performance regression or performance neutral on Maxwell 1/2, but should be a performance advantage for Pascal.

Regards,
SB
 
Lots of speculation and guesswork being thrown around as fact these days. We really need better tools to provide transparency into the behavior of modern apps and hardware.
I very much doubt that better tools would help. It's not guesswork that's being thrown around, it's basically religious convictions. Just look at what's going on around Timespy. Futuremark released a well written explanation about what's going on there and what we have now? Confirmed it's not a proper DX12 benchmark (but hey it's not as it's not using any FL_12_1 features right :devilish:). You can't win an argument with a toddler with logic.

Either that or it is very difficult to get Doom's method of doing async compute in Vulkan working better than not having async compute enabled.
There are only two questions about async compute that anyone that's not a driver developer should ask:
1. can GPU X run graphics and compute concurrently?
2. how fast can GPU react once it gets a work of a higher priority?
With the 1st one:
GCN obviously the answer is yes. For Pascal it's also yes. And in fact it's also yes for Maxwell with one giant asterisk that it may be yes on a hardware level but it's a no on driver level so it's a no for all practical purposes.
There is no NV way of doing async and AMD way of doing async. You have a queue that eats draw commands and you have a queue that eats dispatch commands. That's it. You could use 10 compute queues if you wanted to, but that won't help increase performance as internet seems to be convinced this days, it will actually hurt performance even on GCN. If someone does not agree with that you send him coding a small benchmark that will prove what ever point he's trying to make.
With the 2nd one:
That's already a question that basically only developers will deal with. Questions around here on the topic "why would you need async compute if it doesn't run graphics and compute at the same time?" proves that.

With regard to software:
-Ashes of the Singularity: as said above latest information from developers say async forcibly off for NV
-Rise of the Tomb Raider: async on and I guess no way to turn it off (so no way to check gain)
-Hitman: async on for AMD, no way of telling if it's always on (so no way to check gain)
-Timespy: async on/off option regardless of IHV
-Doom: async on/off option. Works for AMD, work in progress for NV
 
There are only two questions about async compute that anyone that's not a driver developer should ask:
1. can GPU X run graphics and compute concurrently?
2. how fast can GPU react once it gets a work of a higher priority?
Frankly no one who isn't a driver developer should even ask those questions. I'll make an exception for #2 if you're a developer at Oculus or Valve and writing a VR compositor. Beyond that, you have no control over that anyways either.

The only questions everyone else should be asking is - how fast is my workload? Scheduling details and decisions:
a) are much more complicated than this vague consumer notion of "async/concurrent compute supported!"
b) involve multiple layers of software and hardware
c) get tweaked and changed fairly frequently
d) are highly architecture dependent in terms of optimal strategies

Seriously guys, this who freakout about async/concurrent compute and scheduling is getting completely out of hand, even from people who should know better. Just measure the overall performance of a given workload and declare a winner and on the tech front lets move on to something more interesting already!
 
One rather academic exception stemming from AMD's originally higher aspirations for GCN would be HPC development (where multiple queues is a feature that spans vendors) and those implementing or using various system or middleware services on a console (a probable aspirational goal for Sony and its 64 queues at the very least).

However, in those cases, the question is not asked because the answer is found in whether the developer is told the options are offered, it's in the SDK, or that the hardware has been chosen or co-designed to do so. The middleware/console SDK thing is itself sort of low-level as well.

Even then, answering the question of whether anything positive comes from trying to rely on those features based on reviewing of empirical evidence remains.
 
Frankly no one who isn't a driver developer should even ask those questions. I'll make an exception for #2 if you're a developer at Oculus or Valve and writing a VR compositor. Beyond that, you have no control over that anyways either.

The only questions everyone else should be asking is - how fast is my workload?
#2 is highly important question to know for any developer who wants to do low latency GPGPU. In this use case your game logic CPU thread would have it's own high priority compute queue.

Example: You want to offload (large scale) physics simulation to the GPU and need the results back during the same game logic frame (game logic frame != render frame). Currently this use case works adequately only on AMD GCN and Nvidia Pascal.

Some recent console games use GPGPU for game logic. Porting these games to PC would require properly working high priority compute queue support. Of course you could rewrite the GPGPU code for CPU (for example using ISPC). But that's a lot of extra work + need to maintain two code bases. And most consumer CPUs don't have more than 4 cores and even AVX(1) is not a given. Would be a completely different discussion if everybody had a 8 core CPU with AVX-512 :)
 
Last edited:
So, Futuremark guys wrote two different path: one with two queues (a direct queue and a compute queue) and one with only a single graphics queue, so they turn on/off "async compute" (I still do not like at all this term).
Looks like for GPUs without a hardware support for parallel execution of direct queue and compute queue the difference is pretty low.
http://i.imgur.com/fVkMUxC.png

My humble opinion: the second path is completely useless. Drivers are improving and the overhead caused by concurrent execution of two queues (direct and compute) is becoming insignificant. In most "real" application when you can turn more settings and parameters I guess a single path could become absolutely reasonable and invest the time saved to better optimize for low-level hardware.

Skylake sample (graphics: 367 vs 365):
http://www.3dmark.com/spy/106871
http://www.3dmark.com/spy/107115

Possibly because the press and consumers keep freaking out about everything related to it? :) If it's really just an API concept that people are free to implement however they want, all that should matter is the total delivered performance of a game or benchmark, which is almost getting lost in the noise at this point to be honest :S
As you said, everything get lost in the noise ( :
 
Last edited:
Really, quite no performance lost for "async compute" on executing on "unsupported" driver/hardware. The cost of the driver serialization is pretty, pretty low finally now.
It would be interesting now to see how AOS behave with a single render path on "unsupported" driver/hardware configuration. But I am aware it still uses two rendering paths.
 
Last edited:
Back
Top