DX12 Performance Discussion And Analysis Thread

Darius · Sep 4, 2015

giannhs said:
can you run the same test on older nvidia cards? specifically on cards that we know that they cant do any asynchronous process

I personally don't have any, but I'm not sure what it would reveal other than that they're not doing async compute either. Maybe if we can find something that does clearly run async on Maxwell 2 and doesn't on Maxwell 1, we'll at least confirm that NVidia wasn't totally fabricating their claims.

Deleted member 2197 · Sep 4, 2015

From @Speccy's link ...

So, Oculus has this concept of asynctimewarp that they’ve been thinking about for awhile. The question for NVIDIA is how can we actually enable Oculus to implement asynctimewarp on current hardware?
We’re doing this by exposing support in our driver for a feature called a high-priority graphics context. This is very similar to what your OS desktop compositor already does to ensure the desktop can be rendered at a consistent framerate while a 3D app is running. It’s a separate graphics context that takes over the entire GPU, preempting any other rendering that may already be taking place. In other words, it operates using the time-multiplexing method that was just explained.
With this feature exposed, we can have one ordinary rendering context, to which the app submits its usual work, plus a second, high-priority context where we can submit timewarp work.

3dilettante · Sep 4, 2015

Darius said:
I'm curious what relevance a 3 second task has to a game that generally needs to get things done in a max of 30ms though? I can understand it being an issue outside of gaming, but isn't this test kind of extreme in this context?

The class of use cases would be frame-spanning or produce results that do not directly feed into the graphics for a specific frame, like AI calculations, or physics. Whether that happens to a significant extent is uncertain, but it came out in the marketing.

In other cases, I've considered the possibility for using a few long-lived wavefronts that mostly idle for the purposes of reserving some sliver of resources for latency-sensitive operations like audio. With increasing preemption, perhaps that workaround is not as needed, but at least for console-level systems doing things like keeping a thread running as long as possible with a fixed affinity for latency uniformity is still done.

Darius · Sep 4, 2015

Speccy said:
How does this mesh with NVIDIA's commentry that Async Timewarp works via preemtption at the draw call boundries only? i.e. slide 23 notes:

Not sure it's relevant here, this operation isn't running up against a vsync wall nor is anything being pre-empted.

giannhs · Sep 4, 2015

Darius said:
I personally don't have any, but I'm not sure what it would reveal other than that they're not doing async compute either. Maybe if we can find something that does clearly run async on Maxwell 2 and doesn't on Maxwell 1, we'll at least confirm that NVidia wasn't totally fabricating their claims.

well using specific 3rd party extension that doesnt belong to the dx part as a fact isnt really the way to do it and if im correct we will see the same results on older cards too

Darius · Sep 4, 2015

giannhs said:
well using specific 3rd party extension that doesnt belong to the dx part as a fact isnt really the way to do it and if im correct we will see the same results on older cards too

You mean PhysX? I can't tell if it's running asynchronously anyway cause we don't have the separate measurements for how long the compute and graphics loads would have taken on their own.

giannhs · Sep 4, 2015

Darius said:
You mean PhysX? I can't tell if it's running asynchronously anyway cause we don't have the separate measurements for how long the compute and graphics loads would have taken on their own.

i mean both games written for nvidia cards and physx
dont get me wrong its something new from all those infos i got already thats why i need someone to test that hypothesis

RecessionCone · Sep 4, 2015

Darius said:
You mean PhysX? I can't tell if it's running asynchronously anyway cause we don't have the separate measurements for how long the compute and graphics loads would have taken on their own.

You mean you can't tell whether it's running *concurrently*.

This whole thread is very confused about what asynchrony means. In fact, it's very common to have asynchronous interfaces that execute sequentially.

A better description for what people on this thread are interested in is "concurrent graphics and compute". Asynchronous compute for GPUs is as old as CUDA. But the ability to run graphics workloads concurrently with compute workloads is what this thread is really about, and is a relatively new thing.

Just how useful it is in practice remains to be seen. There are always overheads with these sort of scheduling systems, whether implemented in hardware or software or both.

giannhs · Sep 4, 2015

its not really new gamecube was using asynchronous engines back then (although i miss the name they had for it back then..)

Darius · Sep 4, 2015

RecessionCone said:
You mean you can't tell whether it's running *concurrently*.

This whole thread is very confused about what asynchrony means. In fact, it's very common to have asynchronous interfaces that execute sequentially.

A better description for what people on this thread are interested in is "concurrent graphics and compute". Asynchronous compute for GPUs is as old as CUDA. But the ability to run graphics workloads concurrently with compute workloads is what this thread is really about, and is a relatively new thing.

Just how useful it is in practice remains to be seen. There are always overheads with these sort of scheduling systems, whether implemented in hardware or software or both.

Yeah, I did think the terminology was a little confusing. As I understand it, there's no debate whether Maxwell can run multiple compute workloads concurrently, but the issue is whether it can run graphics and compute concurrently. That test showed GCN doing it clear cut. Neither side claims to have more than one graphics queue, so as long as we're using a test that Maxwell interprets as purely graphics, we shouldn't expect it to run anything but serially. What still remains a mystery to me is why the same workload is compute to GCN but graphics to Maxwell.

Deleted member 13524 · Sep 4, 2015

Would the ability to handle two concurrent graphics queues help in VR?

Fantasma · Sep 4, 2015

Speccy said:
How does this mesh with NVIDIA's commentry that Async Timewarp works via preemtption at the draw call boundries only? i.e. slide 23 notes:

Maybe the answer is in the slide 55: support for D3D11 only. If it runs within the DX11 API, it should not be able to run asynchronous. Physx is an additional API, as far as I know, which can add its own instructions out of DX11.

tobi1449 · Sep 4, 2015

Darius said:
What still remains a mystery to me is why the same workload is compute to GCN but graphics to Maxwell.

Even bigger question for me is if the classification for this is in the hardware (bad) or firmware/driver (better since fixable).

Sinistar · Sep 4, 2015

It appears to me that Nvidia does not support running compute, and graphics concurrently, so they are emulating it by converting compute commands into graphic commands.

Darius · Sep 4, 2015

Sinistar said:
It appears to me that Nvidia does not support running compute, and graphics concurrently, so they are emulating it by converting compute commands into graphic commands.

That can't be it, because when the compute workload is done in isolation it's still considered graphics by Maxwell.

3dilettante · Sep 4, 2015

This is about asynchronous compute, the synchronous form was already baked in--meaning a graphics context could run graphics and compute commands already.
Perhaps rather than having context that is architecturally incapable of housing graphics functionality, this test is being given a second graphics context that just happens to only have compute commands.

The ability to host multiple graphics contexts should be within the Nvidia implementation's capabilities, given its multi-user products.

Darius · Sep 4, 2015

3dilettante said:
This is about asynchronous compute, the synchronous form was already baked in--meaning a graphics context could run graphics and compute commands already.
Perhaps rather than having context that is architecturally incapable of housing graphics functionality, this test is being given a second graphics context that just happens to only have compute commands.

How is the appropriate context determined? The same code is being interpreted by GCN as compute and by Maxwell as graphics.

Like is there a flag that needs to be set, marking the code as compute? Or is it something the driver/hardware determines on its own?

Ext3h · Sep 4, 2015

3dilettante said:
The ability to host multiple graphics contexts should be within the Nvidia implementation's capabilities, given its multi-user products.

That is possibly even working properly - as long as the driver may assume that the work items in the queue are independent.

IMHO, the awful performance when enforcing serial execution speaks for a lack of dependency management in the hardware queue. This would enforce a roundtrip to the CPU between every single step.

Unlike GCN, where "serial execution" doesn't appear to actually mean serial. It's possible that the driver only enforces the memory order in that case, and still pushes otherwise conflicting jobs to the ACEs and uses the scheduling capabilities of the hardware. This could possibly also explain the better performance when enforcing "serial" execution as the optimizer may now treat subsequent invocations as dependent and may therefore possibly even concatenate threads, which ultimately leads to reduced register usages.

It's a long shot, but it might be that Nvidias GPUs have no support for inter shader semaphores while operating in graphics context.

3dilettante · Sep 4, 2015

Darius said:
How is the appropriate context determined? The same code is being interpreted by GCN as compute and by Maxwell as graphics.

Like is there a flag that needs to be set, marking the code as compute? Or is it something the driver/hardware determines on its own?

The context is defined when the queues commands are being sent to are defined.
However, what the graphics system categorizes them as internally shouldn't bother the API as long as they act appropriately.

Perhaps there's a wrinkle in the behavior that caused the timestamps to not work in the Nvidia compute queue?
https://forum.beyond3d.com/posts/1869354/

Ext3h said:
That is possibly even working properly - as long as the driver may assume that the work items in the queue are independent.

This seems like that can be assumed since DX12 has an explicitly asynchronous compute queue outside of programmer-defined synchronization points, or for independent user contexts for virtualized graphics products that would also be by definition independent.

With the latest IP, AMD has involved significant hardware management for both scenarios, whereas software appears to be more involved with Nvidia's implementations.

IMHO, the awful performance when enforcing serial execution speaks for a lack of dependency management in the hardware queue. This would enforce a roundtrip to the CPU between every single step.

It's possible that there's more driver-level management and construction of the queue.
Possibly, the 32 "queues" are software-defined slots of independent calls the driver can determine that the GPU can issue in parallel, possibly by a single command front end.
If running purely in compute, this seems to stair-step in timings as one would expect.

Unlike GCN, where "serial execution" doesn't appear to actually mean serial. It's possible that the driver only enforces the memory order in that case, and still pushes otherwise conflicting jobs to the ACEs and uses the scheduling capabilities of the hardware. This could possibly also explain the better performance when enforcing "serial" execution as the optimizer may now treat subsequent invocations as dependent and may therefore possibly even concatenate threads, which ultimately leads to reduced register usages.

It's a long shot, but it might be that Nvidias GPUs have no support for inter shader semaphores while operating in graphics context.

AMD's separate compute paths may provide a form of primitive tracking separate from the primitive tracking in the geometry pipeline.
It does seem like the separate command list cases can pipeline well enough. Perhaps there is a unified tracking system that does not readily handle geometric primitive and compute primitive ordering within the same context?

Darius · Sep 4, 2015

So if I understand you correctly, the graphics was sent to a predefined "graphics queue" in DX12, and the compute was sent to a predefined "compute queue" in DX12. And then Maxwell internally redirected the compute to the graphics? Presumably because it determined it couldn't run that code in a compute context?

DX12 Performance Discussion And Analysis Thread

Darius

Deleted member 2197

Guest

3dilettante

Darius

giannhs

Darius

giannhs

RecessionCone

giannhs

Darius

Deleted member 13524

Guest

Fantasma

tobi1449

Sinistar

I LIVE

Darius

3dilettante

Darius

Ext3h

3dilettante

Darius

Similar threads