No DX12 Software is Suitable for Benchmarking spawn

Alessio1989 · Jul 21, 2016

CarstenS said:
So, how did you arrive at this conclusion that FM wrote two paths?

Futuremark state that http://www.futuremark.com/pressreleases/a-closer-look-at-asynchronous-compute-in-3dmark-time-spy

Another option would have been using two queues but forcing a serial execution of them via primitive synchronization with fences. But I guess it would been less efficient then using a single queue.

Malo · Jul 21, 2016

I think Ryan's article finally released explains how there can be performance degradation pre-pascal on certain types of workloads. Moving forward Pascal onwards the difference should be fairly minimal, with AMD's ALU usage finally reaching peaks bringing performance much closer between the 2 rivals.

It's not as if we have a huge amount of DX12/Vulkan games yet anyway but likely it will take significant time to eventually cycle out the Maxwells considering how well they sold.

Andrew Lauritzen · Jul 21, 2016

sebbbi said:
#2 is highly important question to know for any developer who wants to do low latency GPGPU. In this use case your game logic CPU thread would have it's own high priority compute queue.

The reason I say it's not relevant (for PC as we are talking about) is not because there are no cases in which it could be useful to have more "realtime" constraints, but because there is actually no control in DX12 over priorities or preemption. The two are also separate concepts but they constantly get confused in this conversation. Truly time sensitive work (like VR and desktop composition) should preempt, not run concurrently and so the granularity of preemption is what matters. More cooperative/sustained but still somewhat high priority work can get away with queue priorities.

But again, none of this is exposed to user applications on the PC which is why I say it's not even relevant to application developers right now

sebbbi said:
Of course you could rewrite the GPGPU code for CPU (for example using ISPC). But that's a lot of extra work + need to maintain two code bases. And most consumer CPUs don't have more than 4 cores and even AVX(1) is not a given. Would be a completely different discussion if everybody had a 8 core CPU with AVX-512

Most consumer CPUs with 4 or less cores though *do* have an iGPU, which can happily run low latency and concurrently with a dGPU

CarstenS · Jul 21, 2016

Alessio1989 said:
Futuremark state that http://www.futuremark.com/pressreleases/a-closer-look-at-asynchronous-compute-in-3dmark-time-spy

Another option would have been using two queues but forcing a serial execution of them via primitive synchronization with fences. But I guess it would been less efficient then using a single queue.

I see where you come from. Nevertheless I would assume that the statement "The same queue structure is used regardless of the underlying hardware" is rather in line what already seemed established in this thread, that there is in fact only one codepath, which is interpreted differently by the drivers, according to the respective hardware. Which is, after all, exactly what drivers arevthere for.

But maybe I am simply reading all this wrong.

Ext3h · Jul 21, 2016

CarstenS said:
I see where you come from. Nevertheless I would assume that the statement "The same queue structure is used regardless of the underlying hardware" is rather in line what already seemed established in this thread, that there is in fact only one codepath, which is interpreted differently by the drivers, according to the respective hardware. Which is, after all, exactly what drivers arevthere for.

But maybe I am simply reading all this wrong.

No, you are reading it correctly. One code path, and the application does see a dedicated compute and 3D engine, and is submitting work to both. With the use of fences to achieve synchronization. That has been stated multiple times, even prior to that press release.

For Maxwell, the driver on the other hand is unaware that a compute engine was "exposed", it only provides access to a single instance of the 3D engine. (And obviously also copy engine etc.) Nvidia isn't lying when they said they never enabled the compute engine for Maxwell and older generations.

What you don't see, is that compatibility/emulation layer part of the D3D12 runtime Microsoft patched in, which provides scheduling for hardware where over allocation of the queues provided by the driver becomes necessary.
For hardware exposing multiple queues / engine instances, this layer only acts as a message broker for synchronization via fences. For hardware without, it also performs arbitration.

The graphs in GPUView screenshots don't tell you what the application sees, or how it's using the API.
it only shows you how the work is ultimately scheduled to the driver/GPU.

I guess we do actually lack a tool to intercept and quantify the API usage?

CarstenS · Jul 21, 2016

Ext3h said:
No, you are reading it correctly. One code path, and the application does see a dedicated compute and 3D engine, and is submitting work to both. With the use of fences to achieve synchronization. That has been stated multiple times, even prior to that press release.

Good. I was wondering because Alessio seemed to imply otherwise.

MDolenc · Jul 21, 2016

CarstenS said:
Good. I was wondering because Alessio seemed to imply otherwise.

Alessio is correct. There are two code paths. One with async off, one with async on. Based on the option you chose before running the test.
It doesn't have three coda paths though: one with async off, one with async on for NV an another with async on AMD. It doesn't matter how many times some people write that this would be needed. Fact is it's not required and would not provide any benefit.

Ext3h said:
For Maxwell, the driver on the other hand is unaware that a compute engine was "exposed", it only provides access to a single instance of the 3D engine. (And obviously also copy engine etc.) Nvidia isn't lying when they said they never enabled the compute engine for Maxwell and older generations.

Why are you obsessed with whether compute gets scheduled over mixed queue or over compute queue on NV (GPUView shows that CUDA kernels go to driver compute queue by the way)? It doesn't mean there would be a performance benefit if d3d dispatches would be sinked through compute queue.
What's the evidence that MS patched a "compatibility/emulation layer" in?

Jawed · Jul 21, 2016

MDolenc said:
It doesn't have three coda paths though: one with async off, one with async on for NV an another with async on AMD. It doesn't matter how many times some people write that this would be needed. Fact is it's not required and would not provide any benefit.

It would appear that id is writing a Doom asynchronous compute path explicitly for NVidia. Let's see what turns up...

CarstenS · Jul 21, 2016

MDolenc said:
Alessio is correct. There are two code paths. One with async off, one with async on. Based on the option you chose before running the test.

Ah, thanks. That was the bit I missed. Now it all makes sense - as well as FM's decision to provide the on/off switch in the Advanced Edition for further analysis.

Otto Dafe · Jul 21, 2016

MDolenc said:
...

Why are you obsessed with whether compute gets scheduled over mixed queue or over compute queue on NV (GPUView shows that CUDA kernels go to driver compute queue by the way)? It doesn't mean there would be a performance benefit if d3d dispatches would be sinked through compute queue.
What's the evidence that MS patched a "compatibility/emulation layer" in?

Wouldn't nV actually prefer having multiple queues as input to their driver in all cases, whether handled async in hardware or not? It seems like being able to merge multiple queues would give a lot of insight into developer intent, hence flexibility, and optimization possibilities. So in other words, why not expose multiple compute engines on say, Maxwell, and then optimally reorder commands into a single 3d queue (if that's what they want on Maxwell)? If MS does have a compatibility layer wouldn't it be doing the same thing naively?

MDolenc · Jul 21, 2016

What makes you think they don't?

CSI PC · Jul 21, 2016

Jawed said:
It would appear that id is writing a Doom asynchronous compute path explicitly for NVidia. Let's see what turns up...

Is there any references for this.
Some of the lead programmers have been rather quiet on this subject and the response I have seen from Bethesda just says:

Does DOOM support asynchronous compute when running on the Vulkan API?

Asynchronous compute is a feature that provides additional performance gains on top of the baseline id Tech 6 Vulkan feature set.

Currently asynchronous compute is only supported on AMD GPUs and requires DOOM Vulkan supported drivers to run. We are working with NVIDIA to enable asynchronous compute in Vulkan on NVIDIA GPUs. We hope to have an update soon.

Sort of reminds me the situation we see saw for Ashes of the Singularity and the way it is disabled for Nvidia cards *shrug*.
Although personally I think most would be more interested to see if there is any beneficial extensions for Nvidia and Vulkan in Doom coming anytime soon - that is where most of the benefits are coming from for AMD in Doom.
Thanks

Otto Dafe · Jul 21, 2016

MDolenc said:
What makes you think they don't?

I think they do, I wanted to know what you think, because I'm not well-read in the subject.

NightAntilli · Jul 21, 2016

MDolenc said:
You could use 10 compute queues if you wanted to, but that won't help increase performance as internet seems to be convinced this days, it will actually hurt performance even on GCN. If someone does not agree with that you send him coding a small benchmark that will prove what ever point he's trying to make.

Why is that? If GCN has 8 ACEs, and each can handle 8 queues, we're talking about being able to schedule 64 queues to the compute units. Even the RX 480 is able to handle up to 36 queues.

As long as you're 'moving' what would be in the graphics queue to the 36 compute queues, rather than adding new additional compute queues to the already 'full' graphical queue, there's no reason as to why it would run slower. Right? Or am I completely off?

As a side-note, the required fences in DX12 can limit GCN's flexibility. I believe that's why in Doom Vulkan we see much larger gains than what we're able to see under DX12. I believe GCN has built-in fine grained synchronization, but don't quote me on that.

Deleted member 13524 · Jul 21, 2016

Golem.de published a thorough RX480 vs GTX1060 face-off of all the current DX12 games + Vulkan Doom:

I have to say I'm surprised at how the RX 480 manages to surpass the GTX1060 in almost all 4K tests, considering it has less than half the theoretical pixel fillrate.
Maybe the GTX 1060 is bandwidth starved?

EDIT: Something that ALL other websites seem to have missed: Rise of the Tomb Raider is still slower in DX12 than in DX11 in both architectures, even after Patch #7 that introduced further performance improvements and async compute.
I guess the DX12 path is only useful for people with multiple GPUs in that game.

Malo · Jul 21, 2016

For cards that aren't really designed for 4k gaming, they still put up great numbers. I'm focused mainly on 1080p numbers with a look forward for growth in gaming features over time to keep up at 1080p, that's far more relevant for $200-250 price range IMO. I'm going to be able to create quite a beast for the living room for a good budget.

3dilettante · Jul 21, 2016

Ext3h said:
For Maxwell, the driver on the other hand is unaware that a compute engine was "exposed", it only provides access to a single instance of the 3D engine. (And obviously also copy engine etc.) Nvidia isn't lying when they said they never enabled the compute engine for Maxwell and older generations.

So when an application makes a request for the driver to allocate a queue or to use it, the driver doesn't receive the request?

What you don't see, is that compatibility/emulation layer part of the D3D12 runtime Microsoft patched in, which provides scheduling for hardware where over allocation of the queues provided by the driver becomes necessary.

Then why can't Fermi support be finalized despite the non-presence of Nvidia's drivers?

For hardware exposing multiple queues / engine instances, this layer only acts as a message broker for synchronization via fences. For hardware without, it also performs arbitration.

Going by AMD's HSA work or other driver documentation, multiple hardware queues and engines are exposed and instantiated by the driver. Does it forget all of this information later and allow compatibility layer that doesn't have any of that data take over?

NightAntilli said:
Why is that? If GCN has 8 ACEs, and each can handle 8 queues, we're talking about being able to schedule 64 queues to the compute units. Even the RX 480 is able to handle up to 36 queues.

64 queues is at least partly influenced by Sony's desire to trick out its console with room to grow for middleware and system services.
It's admitted to be overkill.

The CU count does not yield how many queues a GPU can handle. A minimum granularity would be one wavefront, and a single CU can support 40.
Even then, with HWS and oversubscription directed by the front end and driver, the GPU can track far more than 64.

Currently, the workloads being demanded of compute are not leveraging it so heavily that they feel a need to use more than one. Doom's developers said for the purposes of their engine that it didn't matter. The front end processor's handling of a dispatch command is a single part of the process, and due to how many commands create more work on the back end a single queue can process a lot of commands in a limited number of clock cycles for shaders whose wavefronts can last for milliseconds from initial launch to final release.

What having more queues would be about is handling cases of general types of compute and graphics where there's differing behaviors in terms of synchronization and burstiness. For the current uses cases within an application, that can be generally satisfied with direct and one compute. The game or benchmark has a set of operations with a reasonable set of inputs and dependences, and a pretty straightforward path from frame start to end.
What Sony wanted is a front end that could handle multiple clients: an arbitrary amount of middleware, system services, virtualized/secured resources, and potentially compute types with very different scales of synchronization.
For the PC space, it might matter to a hypervisor trying to host multiple compute clients, but neither Time Spy or Doom have 64 different clients or workloads that have no relation to each other.

Deleted member 13524 · Jul 21, 2016

Malo said:
For cards that aren't really designed for 4k gaming, they still put up great numbers. I'm focused mainly on 1080p numbers with a look forward for growth in gaming features over time to keep up at 1080p, that's far more relevant for $200-250 price range IMO. I'm going to be able to create quite a beast for the living room for a good budget.

Well for 1080p they really are future-proof, but you can definitely consider downscaled 1440p in many games too.
Those 4GB RX480 at $200 are going to be a steal for 1080p gaming, when (if?) they become widely available.

Silent_Buddha · Jul 21, 2016

ToTTenTranz said:
Golem.de published a thorough RX480 vs GTX1060 face-off of all the current DX12 games + Vulkan Doom:

Hmm. I find it curious that in every game tested when there is a Dx11 versus Dx12 or OGL versus Vulkan, there is consistent performance degradation on the GTX 1060 when going to the newer API. That isn't always the case with the GTX 1070/1080.

It appears that the 1060 is the better card if using the older rendering APIs, but that the 480 is generally better when using the newer rendering APIs. So overall, they're relatively evenly matched.

Regards,
SB

MDolenc · Jul 21, 2016

NightAntilli said:
Why is that? If GCN has 8 ACEs, and each can handle 8 queues, we're talking about being able to schedule 64 queues to the compute units. Even the RX 480 is able to handle up to 36 queues.

As long as you're 'moving' what would be in the graphics queue to the 36 compute queues, rather than adding new additional compute queues to the already 'full' graphical queue, there's no reason as to why it would run slower. Right? Or am I completely off?

There's an associated cost with each queue. If you have 64 dispatches you can pack them into one command list and execute them with one execute command list call. If you want to issue them over 64 queues that's then 64 command lists and 64 execute command list calls on the CPU side. And you gain absolutely nothing from that.
I don't know where people got the idea that you have to use every single available queue? One queue is perfectly enough. One dispatch with enough work items can fully occupy any GPU.

As 3dilettante already mentioned above: you only need more queues if you're doing some more complex work flows.

NightAntilli said:
As a side-note, the required fences in DX12 can limit GCN's flexibility. I believe that's why in Doom Vulkan we see much larger gains than what we're able to see under DX12. I believe GCN has built-in fine grained synchronization, but don't quote me on that.

Fences are used for synchronization between queues or between CPU and D3D12/Vulkan queue. They are there in Vulkan and D3D12. If you want to know when queue has reached certain point you have to use them.

No DX12 Software is Suitable for Benchmarking spawn

Alessio1989

Malo

Yak Mechanicum

Andrew Lauritzen

Moderator

CarstenS

Moderator

Ext3h

CarstenS

Moderator

MDolenc

Jawed

CarstenS

Moderator

Otto Dafe

MDolenc

CSI PC

Otto Dafe

NightAntilli

Deleted member 13524

Guest

Malo

Yak Mechanicum

3dilettante

Deleted member 13524

Guest

Silent_Buddha

MDolenc

Similar threads

No DX12 Software is Suitable for Benchmarking *spawn*

Yak Mechanicum

Moderator

Moderator

Moderator

Moderator

Deleted member 13524

Guest

Yak Mechanicum

Deleted member 13524

Guest

Similar threads

No DX12 Software is Suitable for Benchmarking spawn