DX12 Performance Discussion And Analysis Thread

Shame on me. I messed up with the data on GCN 1.2. :(

It is not the HWS which is capable of doing balancing. It's the Graphics Command Processor which got doubled in width on GCN 1.2. That thing is able to hold 128 grids in flight, as opposed to 64 in 1.1. The ACE/HWS still only manages 64 grids each, no more. That's why GCN 1.2 was much faster in "single commandlist" mode (="Everything in the 3D queue") than it was in pure compute or split mode.

That means I also misinterpreted the data on Maxwell. It actually has 32 shader slots which are also all active while in compute mode. But it has only a single(!) compute shader slot while in graphics mode while the other slots are reserved (hard wired) for other shader types, which is why it failed so badly. And yes, it does need to switch in between graphics and compute queue mode, it can't do it in parallel. This is unrelated to the Hyper-Q feature, which is operating unrelated to these regular 32 slots, which is why dmw.exe and alike can cut ahead.
There is no parallel DX12 compatible compute and 3D paths in hardware. Only one 3D queue, which can switch between compute and graphics mode.

I failed at interpreting "single commandlist" correctly, and never gave it a second thought.
 
TG should stand for Thread Group. So it's basically a resource descriptor for a linked group of wavefronts. "IA" is still unknown.
I think at this stage there is likely no concept of wavefront. It is likely a descriptor of the size of a workgroup, with the program pointers, VMID, the resource it needed (LDS allocation size) and perhaps also the data it may need to inject (for pixel/vertex shader?). The SPI will then "break" a workgroup into wavefronts after receiving them.

But it also means that (difficult to allocate)) thread groups already scheduled, can only be jumped by compute commands handled by a different ACE. Assuming that GCN 1.1 and 1.2 still only have a single CS Dispatcher per ACE.
I think this is an expected behaviour. The commands from the same queue should likely be committed the in program order, since we already have concurrency in the form of multiple queues. HSA's platform specification also states that:
Code:
All preceding packets in the queue must have completed their launch phase.

Carrizo gonna make this more complicated, since you now have mid-wave preemption (for CUs) and also potentially mid-dispatch preemption (for ACEs).
 
Glad to see Lionhead's comments! It should take care of some of the guessing that's been going around in this thread.
 
I would like to know if they are going to use features like CR, RoV, PS stencil ref, or even TR (I am still not aware of a single game using D3D TR...). Anyway, did they used a private UE implementation or did they wrote by their own the D3D12 backend? Actual UE DX12 RHI implementation is quite primitive (but looks like some performance optimization will appear into UE 4.10...).
 
the way the licensing is set up for Unreal Engine 4, if they like something a dev did they can added it back in the main branch of the UE4 engine. So I wouldn't be surprised if many of the features of Fable Legends are introduced back in UE4 at a later date.
 
NVIDIA was confident that the alpha benchmark of Ashes of the Singularity wasn’t representative of general performance under DirectX 12, and now we can easily understand why. As pointed out by Lionhead, most modern games are far heavier on the GPU and reducing CPU overhead, while still undoubtedly a benefit, will be less important in these cases.

http://wccftech.com/fable-legends-dx12-benchmark-results/

I think we'll probably only get a good performance indication once the game is released and benchmarked.
 
Last edited by a moderator:

From that link:

"Microsoft and Lionhead Studios have assured us that asynchronous compute is indeed activated and on for the test, across the board, and that it doesn’t turn off based on the presence of a card that doesn’t support it.

Read more: http://wccftech.com/asynchronous-compute-investigated-in-fable-legends-dx12-benchmark/#ixzz3nUuiuGAQ"

So that suggests he previous claim from Anandtech forums that there is a second build which allows async to be toggled on and off resulting in much higher performance on AMD cards was false.

Also, it sounds like they aren't doing a huge amount concurrently which is where AMD should/might have the advantage over Nvidia. i.e...

"Multi-engine is the official D3D12 term for that feature. We’re using it quite heavily across the scene, including dynamic GI compute shaders, GPU-based culling for instanced foliage, Forward Plus compute shaders (light gathering). In addition, all the foliage physics in the scene (bushes, grass, etc.) is simulated on the GPU that also runs concurrently with the graphics.

Read more: http://wccftech.com/lionhead-dx12-features-fable-legends/#ixzz3nUvFWICF"
 
"Multi-engine is the official D3D12 term for that feature. We’re using it quite heavily across the scene, including dynamic GI compute shaders, GPU-based culling for instanced foliage, Forward Plus compute shaders (light gathering). In addition, all the foliage physics in the scene (bushes, grass, etc.) is simulated on the GPU that also runs concurrently with the graphics.

Read more: http://wccftech.com/lionhead-dx12-features-fable-legends/#ixzz3nUvFWICF"
This ain't correct. It's not running concurrently on Nvidias hardware. That just isn't possible, at all.

What is working though, is that compute commands which are committed in batch the compute engine, are in fact running concurrently against each other. While compute commands in the 3D engine are only executed in sequential order, up to 32 grids (depending on how large the chip is) can be in flight while the scheduler is running in compute mode. Which naturally increases utilization of the GPU, even when not running concurrently with 3D workload.

GPUView only shows which queue the workload was submitted to, but not which engine mode was used for a specific command list.

And no matter what, any Nvidia GPU won't be able to execute 3D and compute shaders concurrently and will result in various stalls if you attempt to do so.

"Microsoft and Lionhead Studios have assured us that asynchronous compute is indeed activated and on for the test, across the board, and that it doesn’t turn off based on the presence of a card that doesn’t support it.

Read more: http://wccftech.com/asynchronous-compute-investigated-in-fable-legends-dx12-benchmark/#ixzz3nUuiuGAQ"

So that suggests he previous claim from Anandtech forums that there is a second build which allows async to be toggled on and off resulting in much higher performance on AMD cards was false.
Nope. The numbers were from AMDs press deck, not from a build Anandtech had access to. And you can be sure that AMD was able to turn off async compute by forcefully rerouting any commands to the 3D engine, which is more than just rerouting anything to the 3D queue which Nvidias driver does either way.

And it also makes sense. Even though GCN is even capable of executing compute shaders from 3D engine concurrently (yeah, surprise!), it still complicates scheduling on the GPU, partially because it means that all barriers used for the 3D path are also applied to the compute commands.

And as for Nvidia: See above.
 
Thank you.
Finally someone starts to puts some meaningful graphs. Hope to see more deeper analysis in the feature, because it's unclear what is really going on on the TitanX profiling session, looks like that all compute works are executed from the default queue (ie they are serialized directly from the application).
This also looks strange: "but is there for another background task not associated to the benchmark at all."... and it also quite irregular, while the GCN gen 3 session is the opposite, generating a pretty regular graph.
 
Since the AOS benchmark came out, new gpu reviews at Guru3D have included this as part of their testing regiment. It's interesting to note that performance on GCN hardware basically in the same regardless of what model is used. You do see more variation with Maxwell 2 especially with factory OC'd models which is one of the advantages of having an OC architecture. I can only assume Maxwell 2 performance will improve once Nvidia releases a drivers optimized for async compute, but at this point is seems to be holding it's own.

For the graphs below Batches is identified as the following:

“Normal” batches contain a relatively light number of draw calls, heavy batches are those frames that include a huge number of draw calls increasing scene complexity with loads of extra objects to render and move around. The idea here is that with a proper GPU you can fire off heavy batches with massive draw calls.



index.php


index.php


http://www.guru3d.com/articles-categories/videocards.html
 
This ain't correct. It's not running concurrently on Nvidias hardware. That just isn't possible, at all.

I read the quote to be more a statement that that is how the game is configured to be used by GPU's that support concurrent compute/graphics. Not necessarily that all GPU's were behaving that way. So in effect, the advantage that AMD has over Nvidia in this case is specific to vegitation physics.

Nope. The numbers were from AMDs press deck, not from a build Anandtech had access to.

This was in reference to a user on the Anandtech forums who claimed he has access to a different build of the benchmark to the press which had a toggle to enable or disable async compute, and he was apparently seeing a massive boost in performance in AMD hardware as a result. i.e. Fury X easily beating Titan X.

A dodgy claim to be sure but some claimed he was a reliable source.
 
This also looks strange: "but is there for another background task not associated to the benchmark at all."... and it also quite irregular, while the GCN gen 3 session is the opposite, generating a pretty regular graph.

Something to do with GeForce Experience? Shadow Play perhaps? Just a (very) wild guess.
 
"Multi-engine is the official D3D12 term for that feature. We’re using it quite heavily across the scene, including dynamic GI compute shaders, GPU-based culling for instanced foliage, Forward Plus compute shaders (light gathering). In addition, all the foliage physics in the scene (bushes, grass, etc.) is simulated on the GPU that also runs concurrently with the graphics.

Read more: http://wccftech.com/lionhead-dx12-features-fable-legends/#ixzz3nUvFWICF"

Some of these things might not be as taxing in a canned benchmark. Foliage physics is the clearest example. one might not see a difference between going async or doing what nvidia does.
 
Some of these things might not be as taxing in a canned benchmark. Foliage physics is the clearest example. one might not see a difference between going async or doing what nvidia does.


What’s interesting is that there is actually very little happening in the compute queue. We’re looking at a total of 18.91%. This is in stark contrast to what Lionhead has told us. But keep in mind that while we aren’t seeing much in this pre-defined demo, that doesn’t mean that the game itself will act in the same way.

Read more: http://wccftech.com/asynchronous-co...fable-legends-dx12-benchmark/2/#ixzz3nXHwnsWA
 
Not familiar with GPUView but doesn't that 18.91% represent the percentage of the frametime? To figure how much is actually going on there relative to the entire scene would have to take into account how fast the hardware would complete certain amounts of work as well as the type of work.

Can they conclude much from 18.91%? There could be a crap ton of compute going on but being done really quickly. Or very little being done relatively slowly.

What about the 5% claim attributed to lionhead?

Wish lionhead would be as transparent as oxide were.

For the nano it looks like there are some compute tasks in the graphics queue - maybe nothing. The tasks in the compute queue also seem predictable at predictable intervals. Just the same thing happening at regualar intervals. (Not familiar with GPUView)

http://cdn.wccftech.com/wp-content/uploads/2015/10/ViewNano.png
 
Back
Top