DX12 Performance Discussion And Analysis Thread

It's occurred to me that NVAPI stuff has been wired into UE4 for the Fable Legends test, which is why some passes are dramatically faster on NVidia.
None of those extensions are required on DX12, and thus by Fable.

Looks like Nvidias driver is really a picky eater.
Not really - the vast majority of that stuff is good advice on everyones driver. Compare to AMD and Intel's recommendations at GDC and you'll see that it's mostly common. Example shameless self promotion:
https://software.intel.com/sites/de...ndering-with-DirectX-12-on-Intel-Graphics.pdf
http://intelstudios.edgesuite.net/idf/2015/sf/aep/GVCS004/GVCS004.html

The divergence on NVIDIA vs. AMD/Intel is more around stuff like them packing depth+stencil together (and apparently not being too overly confident in the performance of their ROV stuff :)). Overall fairly minor details that can easily be handled by engines if they want optimal performance everywhere.

Personally, im more worried about this one:
  • Don’t create too many threads or too many command lists
    • Too many threads will oversubscribe your CPU resources, whilst too many command lists may accumulate too much overhead
  • Don’t create too many threads or too many command lists
There's nothing to worry about there either - same advice for everyone (see my GDC slides above). This goes along with the regular advice of scaling out to the size of the machine but no further. Parallelism has a cost and there's no point in paying it further once you've filled the machine.This is true on both the CPU and GPU - go as wide as you have to with your algorithm and no wider; run the serial algorithm in the leaves.

So, if you can, you should avoid an algorithm that requires task-level barriers.

Alternatively, you create multiple, parallel, queues each of which has its own sequence of tasks separated by barriers. But, now you need an algorithm which can be chopped up into such small pieces as to be spread over multiple queues.

If you can chop up an algorithm like this, then you can also stripe the tasks in a single queue: A1, B1, C1, A2, B2, C2.
Right, you do not require separate queues to avoid stalls at resource barriers - this is precisely what the "split barrier" stuff in DX12 is. As long as you ensure enough overlap/time between the "begin/end" of your split barrier the GPU may not incur any real cost at all, as many barrier operations can be farmed out to simultaneous hardware engines (blit, etc).
 
None of those extensions are required on DX12, and thus by Fable.
It's my hypothesis that there is NVAPI code in UE4, allowing alternative, non-D3D12, algorithms that interact with NVidia's hardware for enhanced performance.

I don't know if there ever has been or will be NVAPI stuff in UE, so it's pure speculation.
 
Actually the master branch of UE includes the non-NDA version of NVAPI, which support feature level 10.0+ (the old non FL_10_1 compilant feature set supported by old NV Geforce, pre-GTX 300 series IIRK) and depth-bound test for DX11...
 
I'm starting to think Nvidia is addressing some things not done in the AOS pre-alpha demo but time will tell.

The firm explains that these easy to digest bullet point guidelines will be useful to programmers as: "The DX12 API places more responsibilities on the programmer than any former DirectX API".
...
Nvidia says that developers will find themselves responsible for resource state barriers and the use of fences to synchronize command queues. Meanwhile illegal API usage won't be caught or corrected by the DX-runtime or the driver so developers will have to be stringent with their code and should "strongly leverage the debug runtime and pay close attention to any errors that get reported." Finally a good familiarity with the DX 12 feature specifications is recommended.

Nvidia's do's and don'ts list includes plenty of recommendations regarding the parallel nature of DX12 and how to make the best of it. The list is largely vendor unspecific but the last set of tips does concern Maxwell GPU features.
http://hexus.net/tech/news/graphics/86783-nvidia-publishes-dx12-dos-donts-checklist-developers/
 
Another bench comparing async on and off. This benchmark version might not be the same as the one reviewers got.

http://forums.anandtech.com/showthread.php?p=37724231#post37724231

Aysnc ON result @ 1080 is far above anything reviewers were able to get. Async OFF on the other hand is in the ballpark.
It theory he might be right. AMD provided some numbers in the Gaming Evolved Update press deck, and it contains async on and off performance. If we calculate the overall performance manually, we will get nearly the same performance difference.
 

Attachments

  • amdasync.jpg
    amdasync.jpg
    132.6 KB · Views: 78
you do not require separate queues to avoid stalls at resource barriers - this is precisely what the "split barrier" stuff in DX12 is. As long as you ensure enough overlap/time between the "begin/end" of your split barrier the GPU may not incur any real cost at all
I was going to say exactly the same. Split barriers are the correct way to hide GPU stalls, assuming you have some other work to perform while the barrier is transitioning from one state to another.

There are some expensive resource transition cases that force the GPU to do various operations such as color/depth decompression, fast clear resolve or ROP/L2 cache flushes. Some of these operations can be performed in backgroud if there is enough work inside the split barrier. Barrier might also need to perform cache flushes. Depending on GPUs memory hierarchy a cache flush might invalidate more data than needed, causing other work to stall. These kind of stalls are unavoidable by split barriers.
 
Those site should really learn to read and use GPU-View, because what they actually reporting does not tall anything about D3D12 features implementation and efficiency.
 
It would be important to know what CPU used for his test, because Fury X is much faster on an i7-5960X, than any 4-core Core series CPU.

I think the Fury X in that test was using the Catalyst 15.9, which weren't available for reviewers at the time the demo was released.
 
They are free to roam. Don't know yet if it has undesired side-effects.
They probably have. Nvidia warned that barriers and fences are only committed externally once per command list at the end. Internal barriers behave correctly. Can't figure right now if this just imposes a performance penalty or if it can cause deadlocks.

I think it CAN cause deadlocks on Nvidias hardware, if two compute lists are each starting a barrier, each waiting for the opposite one to finish. If these two are in different queues, each with commands enlisted before and after each barrier, it will ONLY succeed on hardware capable of actual concurrent execution with true async message passing capabilities.
 
They probably have. Nvidia warned that barriers and fences are only committed externally once per command list at the end. Internal barriers behave correctly. Can't figure right now if this just imposes a performance penalty or if it can cause deadlocks.

If it deadlocks it's your own fault because your algorithm is buggy. Barriers are less sensitive to this than fences though.

I see no reason that it should ill-behave. Either the barrier is forced in the begin command-list (end must come anyway), or whenever it fits, or in the end command-list. The question is only if case 1 and 3 is worst performing than the desired case 2, and I expect the answer to be: depends.

Besides it's cumbersome to keep split-barriers in one command-list if you use a copy-queue, because you want the begin as early as possible and the end before the copy, but there are not much commands before the copy in the copy-queue, and you want to execute the copy ASAP. Or if you begin 20 barriers, then immediately end 20 barriers and then copy 20 resources and then begin the barriers then end it again, you don't need split barriers. Or if you do it interleaved you have huge command-submission overhead.
You could side-step and sync the resource-barriers with fences, but that is curing a bleeding cut with an axe.
 
GCN1_2.png

Another try on figuring out how AMDs hardware is actually working.
  • Refined with some new insights
  • Distinction between Queues, Command Lists, Grids
  • Added Work Distributor to GCP, not enough data on that one though
  • Added Command List slot balancing
  • Synchronisation via Global Datashare still isn't annotated

Few new assumptions:
  • HWS on GCN 2.0 is balancing occupation of two ACEs each, but only on a Command List level. Backpressure from one queue can bleed into adjacent ACE for a total of 16 Command Lists, respectively 128 Grids in flight. Took me quite a while to figure that out., but actually makes sense. It's possibly doing even more than that, but I don't have conclusive data yet.
  • Command Lists slots are only occupied until the last Grid / command has been scheduled. This is unlike Maxwell, where the single(!) Command List slot remains occupied until the Grid / command has FINISHED execution. -> Undersized Command Lists are achieving MUCH better utilization on GCN as they won't block!
  • It's not actually 64 compute command queues as spread throughout media, but 64 compute command list slots. Each ACE is only addressable a single queue, but will start execution of up to 8 subsequent command lists in parallel.
  • Each ACE has an 64 Grid wide Work Distributor integrated, which can be filled from any active Command List, managed by that ACE.
  • I was told that each ACE is supposed to be able to dispatch 1 Wavefront per cycle., from any of the Grids in flight.

Disclaimer: As usual, data isn't fully confirmed yet. Also still no redistribution please. Feedback and corrections are welcome.
 
View attachment 965

Another try on figuring out how AMDs hardware is actually working.
  • Refined with some new insights
  • Distinction between Queues, Command Lists, Grids
  • Added Work Distributor to GCP, not enough data on that one though
  • Added Command List slot balancing
  • Synchronisation via Global Datashare still isn't annotated
*snip*
There should be something called Shader Processor Interpolator (SPI) between the front-ends and the compute units, per the documentations. This thing is the real one who schedules, configure ahead and finally launches wavefronts to the Compute Units. So it is very likely that the front-end dispatches thread groups instead of wavefronts to the SPI, since a thread group must be bound to just one CU.

Moreover, it is also possible for each "shader engine" (group of CUs) to have its own SPI, but I am not quite sure about this. By the way, VGLeaks had once posted details with diagram about the front-end architecture of the PS4 GPU.
 
There should be something called Shader Processor Interpolator (SPI) between the front-ends and the compute units, per the documentations. This thing is the real one who schedules, configure ahead and finally launches wavefronts to the Compute Units.
I think you are referring to this? http://www.vgleaks.com/orbis-gpu-compute-queues-and-pipelines
That's only GCN 1.0 for the bottom chart, but the PS4 GPU looks ... weird? That's neither GCN 1.0 nor 1.1. Also more than a single SPI block. An the compute pipeline looks ... different.

That would mean SPI is performing actual dispatch to CU. There can (or can not be) multiple SPI units. These are not embedded into CU units. SPI appears to be responsible for filling in registers and buffers for each CU.

Each GCN 1.0 CS Dispatcher appears to be managing actual compute commands, and transforming grids into workgroups. I wonder what "TG" and "IA" stand for.

It's weird. It's as if the Liverpool GPU in the PS4 doesn't even have a CS Dispatcher. I don't think the chart is actually accurate. The missing CS Dispatchers in Liverpool would be placed on bottom of the CS Pipes which are feeding into the SPI. But that ain't looking right yet. There's missing more.

It's actually 8 ring buffers per compute pipe, so also 8 ring buffers per ACE in 1.1 and 1.2. I think the heaps are no longer split between ACEs in 1.2.

"Draw Engine" in the compute shaders looks like it is responsible for fences and alike, or in short: Command lists. That thing was most likely redesigned in 1.1 to support accept multiple active command lists, which can be filled from any of the ring buffers, and once again in 1.2 in order to support balancing.

The missing Draw Engine on Livepool appears reasonable, that would only mean it lacks fence and constant update support.

I think the CS Dispatcher is the one responsible for holding track of grids in flight. With a hard limit of 64 grids per Dispatcher. The CS block down in SPI is only for resource allocation and is fed in sequential. Whereby the actual dispatch rate can be scaled by adding more SPI blocks.


Well, looks like I will have to update the graph once again.
 
More insight :)

TG should stand for Thread Group. So it's basically a resource descriptor for a linked group of wavefronts. "IA" is still unknown.

And I'm no longer sure if it's actually the CS Dispatcher which is holding track of commands in flight, but instead the "Draw Engine". So the CS Dispatcher is not keeping track of anything, it's only the Draw Engine which ensures that no grid is scheduled twice. There appears to be a queue in between the CS Dispatcher, and the CS resource allocator in the SPI block,. But it also means that (difficult to allocate)) thread groups already scheduled, can only be jumped by compute commands handled by a different ACE. Assuming that GCN 1.1 and 1.2 still only have a single CS Dispatcher per ACE.


The following is just a guess, but I assume the Draw Engine works as follows in GCN 1.1:

It holds 8 unique pointers, one on each ring buffer. If one of these may be increased, it knows that it can fetch another command. In addition, it has another 64 execution slots for compute commands in flight.

When the next command is a barrier end, it will check whether the condition is met and skip this queue otherwise.
When the next command is a fence, it will check whether reference counter is zero and the condition is met and skip this queue otherwise.
When it is a regular compute command, it will check for free execution slots., assign the command, and probably increase a reference counter for that queue, and skip this queue otherwise.
When it is a barrier start, it will check if the reference counter is zero, and then trigger the transition, and skip this queue otherwise.

Whenever the attached CS Distributor is idle, it will evict the previously active compute command and reduce the reference counter on the corresponding queue. It will then continue to flag the next occupied slot as active and to commit it to the CS distributor.

So much for 1.1 behavior at least. I don't even think any more that it is actually command list aware. It's just doing a good job at pipelining. And it's simple. So simple that it might actually be full ASIC.

The HWS on 1.2 is slightly more complex. It is command list aware.
Question would be: How?
That's actually not looking like ring buffers any more, as skipping or offloading simply wouldn't be possible with ring buffers, at least not without copy. So it's probably one additional level of indirection, with the ring buffers only containing references to the actual command list. And a single side port per Draw Engine to hand over command lists to the other paired Draw Engine in case all 8 handles from the 1.1 Draw Engine are already occupied.

Not a big change actually, just one tiny indirection. But it's unlocking even more potential when dealing with undersized command lists, with up to 16 command lists from a single queue in flight concurrently, when I'm not mistaken.
GCN 1.1 only achieved a single active command list per queue, and the rest was just efficient pipelining.



Now, I'm wondering once again where Nvidia is failing?
Implicit fences between command lists possibly? Sounds like there is some batched cleanup performed on the end of each command list. That would of course block pipelining commands from subsequent command list, and explain why GCN is running circles around Maxwell when the command lists are too short. I wonder if Maxwell wouldn't actually profit from using multiple software queues as well. If there really is just an implicit barrier, it shouldn't have any impact on concurrent queues.

For once, I'm going to assume that they did not lie about the changes to async compute capabilities on Maxwell v2, that they actually do have multiple, concurrent compute queues available and that they are working.
Such an implicit barrier inside each single queue would still explain all anomalies we have seen so far. And AFAIK all "real" benchmarks so far only used a single compute queue as well, so they would naturally all run into this tiny catch when the workload is degenerated.
 
Back
Top