DX12 Performance Discussion And Analysis Thread

Sorry for slight off topic but relevant to my previous post and Silent_Buddha mentioning it should be ignored from a scientific perspective.
Science does have a recent modern example:
A few years back a group of scientists at CERN believed they had found a property-behaviour for going faster than light, in theory and so in the far real world this had never been possible and all evidence to date suggested it was impossible.
The reason this was further investigated was that their results and data were consistent and repeatable, even though it went against tests by others and theory.
Yes eventually it was proven they messed up with their test environment, but it was only further studied-analysed by other scientist (who do not like wasting time and resources) due to the fact the original scientists managed to create a situation that was consistent and repeatable for them.

Cheers
 
Guys I hate to throw a wet blanket on this (I'm sort of amused that we have people messing around with GPUView voluntarily in their spare time :)), but I think it's time to ask what the point in this exercise is? As a few people have sort of argued on the previous page, none of the results from synthetic workloads are going to generalize to "real" workloads here, as by its very nature this is all completely workload and architecture dependent. Trying to drill into specifics of the implementation on one piece of hardware (reverse engineering) is interesting from a curiosity point of view, but don't be under the illusion that these tests are going to be predictive of real workload performance, or even that one "real" async compute workload is going to be predictive of another. It's roughly like saying that one architecture is "good at compute" or some similarly general/meaningless statement.

Of course I don't realistically expect people to stop digging - as it is sort of fun - but just keep the limitations of this data in perspective and remember that fanboys all across the internet like to grab bits of data out of context from here to support whatever preconceived notions or brand loyalty they have :) Let's not be those folks here at least.
 
Last edited:
Yeah, we should stop discussing stuff in a bid to learn how GPUs work because some idiots on a forum far away are going to misinterpret it.
That's not what I said; in fact I explicitly said it's fun to reverse engineer :) I'm just cautioning against drawing any broader conclusions or predictions from these sorts of tests. They fundamentally cannot be predictive of how various implementations might behave in other workloads (or on other dates/drivers :)). That's my only real point.
 
Yeah, we should stop discussing stuff in a bid to learn how GPUs work because some idiots on a forum far away are going to misinterpret it.

There have been a lot of misunderstand and bad interpretations of the work who was done here ... We have all see the articles who was based from this test and the result posted here .. Somewhat, it will not be ridiculous to take a little bit care of this. Not everyone got the knowledge of the peoples who run here...
 
I don't think it's the fault of anyone here if those outside of this forum misinterpret the data. However, it is evident that the testing thus far has been too simple and has controlled for very few variables, and doesn't have much predictive value for real-world scenarios.
 
Guys I hate to throw a wet blanket on this
Well some conclusions based on this playing around deserve a full blown fire hose, not just a wet blanket. :)

What was it "forced async"? Seriously guys (& girls? :))? We came up with there here, it wasn't copied from somewhere else?

Yeah, we should stop discussing stuff in a bid to learn how GPUs work because some idiots on a forum far away are going to misinterpret it.
Of course not. :) But the spillover is a bit depressing.

You are right, somewhat, I oversimplified that statement. Wavefront size is 32 for Nvidia and so is therefore optimal workgroup size, however, with 2048 shaders and only 32 concurrent kernels, you can only achieve full shader saturation with at least 64 threads per kernel on average.
How is that full saturation? So you issue 64 warps. Then what? Next clock you can't issue new instructions from any of the threads that are being worked on until those warps are done (ILP aside). So SMM won't issue anything useful for another 5 clocks at best (warp is executing a MAD) and 100s of clocks at worst (warp is executing memory access). The same goes for GCN.

Now, one reason why this test might fail on NVidia is that GM2xx re-uses rasterisation hardware within each graphics processing cluster (GPC) to generate hardware threads for compute. If that hardware is fully loaded with the fill-rate portion of this test, then it's doomed.
I don't think that's the case.
There are multiple types of queues in d3d12. Of interest here are direct and compute queues. Direct can execute graphics and compute, compute can only execute compute. App currently uses two queues: one compute and one direct and there's a scenario where both go through direct queue.
Now dispatching a number of independent kernels from single compute queue (single command list) will run them concurrently on NV GPUs. What's interesting here is that when using multiple compute queues those dispatches don't run concurrently anymore (the first version of the app). Multiple queues just seem to get serialized. What happens in the "super slow" scenario where everything is run on one direct queue is that all compute dispatches and draw calls run serially. Meaning it won't even run 32 compute dispatches concurrently for some reason. This same scenario will also happen on direct queue if you only dispatch compute kernels without draws.
 
How is that full saturation? So you issue 64 warps. Then what? Next clock you can't issue new instructions from any of the threads that are being worked on until those warps are done (ILP aside). So SMM won't issue anything useful for another 5 clocks at best (warp is executing a MAD) and 100s of clocks at worst (warp is executing memory access). The same goes for GCN.
My fault, forgot about instruction latency on Maxwell. Well, so the lowest reasonable limit for warp count is even much higher. Even better, more parameters to play with.

Guess you were referring to the numbers of this paper? http://lpgpu.org/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf

GCN has only 4 cycles latency for the simple SP instructions. Which I already accounted for, at least for "pure" SP loads, not to mention that this is also the minimum latency for all instructions on GCN and GCN also features a much larger register file.

The 6 cycle SP latency for Maxwell however is ... weird. I actually though that Maxwell had a LOWER latency for primitive SP instructions than GCN, but the opposite appears to be the case???

App currently uses two queues: one compute and one direct and there's a scenario where both go through direct queue.
No, in both scenarios, the application is always using a dedicated direct and an dedicated compute queue (at least from the software view). What changes though, are the flags set on the compute queue, and whether there is a CPU side barrier between committing to the direct and compute queue or not. Oh, and whether there are any draw calls at all.
What happens in the "super slow" scenario where everything is run on one direct queue is that all compute dispatches and draw calls run serially. Meaning it won't even run 32 compute dispatches concurrently for some reason. This same scenario will also happen on direct queue if you only dispatch compute kernels without draws.
Uhm, no. That's not the case. It's just refusing the compute dispatches concurrently because it has been told not to. The software queue was explicitly flagged to be executed sequentially.
For that test, it's actually GCN which behaved odd, as it chose to ignore that flag, probably because the driver assumed it was safe to do so.


Don't try to make too much sense from the results, so far the benchmark is neither limited by anything except for the concurrency on a call level, nor does it even measure the impact of concurrency on a SMM/CU level.

The current benchmark only yielded very few usable results so far, and that is the the number of calls the schedulers of Maxwell, GCN 1.0/1.1 and GCN 1.2 can each dispatch, with the biggest surprise being that the HWS (two ACEs "fused") in GCN 1.2 apparently differ in behavior from the plain ACEs used in GCN 1.1 and 1.0.
As well as provoking the driver crash on Nvidia from starving calls and revealing and that driver optimization on GCN where it could ignore the sequential flag.


Apart from that, you can rather clearly see that power management screwed the numbers both on Maxwell and GCN. Maxwell went into boost whenever it had no draw calls at all (or at least received a constant speedup for any other unknown reason), and GCN didn't leave 2D clocks fast enough when confronted only with a single "batch" of compute calls.

Which made especially these two graphs as inconclusive as it could possible get:
http://www.extremetech.com/wp-content/uploads/2015/09/3qX42h4.png
http://www.extremetech.com/wp-content/uploads/2015/09/vevF50L.png
Just to pick two graphs which got abducted from this thread and ... got misinterpreted, because they mostly measured at lot of random noise. A+ for scientific method.
 
[snip]
Apart from that, you can rather clearly see that power management screwed the numbers both on Maxwell and GCN. Maxwell went into boost whenever it had no draw calls at all (or at least received a constant speedup for any other unknown reason), and GCN didn't leave 2D clocks fast enough when confronted only with a single "batch" of compute calls.
[snip]

Disabling power management features on the cards could provide constant clocks and control for that variable. I know MSI Afterburner (should) allow you to completely disable power management functions on AMD cards (PowerPlay, ULPS, etc.), in the least, keeping the cards in 3D clocks at all times. You may also have to raise the power limit % on some AMD cards as well, though I'm not sure if that's controlled by PowerPlay. I'm not sure about the nVidia side, however.
 
How is that full saturation? So you issue 64 warps. Then what? Next clock you can't issue new instructions from any of the threads that are being worked on until those warps are done (ILP aside). So SMM won't issue anything useful for another 5 clocks at best (warp is executing a MAD) and 100s of clocks at worst (warp is executing memory access). The same goes for GCN.
My fault, forgot about instruction latency on Maxwell. Well, so the lowest reasonable limit for warp count is even much higher. Even better, more parameters to play with.
[...]
GCN has only 4 cycles latency for the simple SP instructions. Which I already accounted for, at least for "pure" SP loads, not to mention that this is also the minimum latency for all instructions on GCN and GCN also features a much larger register file.

The 6 cycle SP latency for Maxwell however is ... weird. I actually though that Maxwell had a LOWER latency for primitive SP instructions than GCN, but the opposite appears to be the case???
You had me fooled for a moment. Yes, there is latency involved on Maxwell, and no, it isn't exactly low. But you forgot something: It's a pipeline. A single thread can issue another 5 SP instructions before the first one has finished. It just must not access the result register before the latency has passed. This is fundamentally different from GCN, where at least the primitive SP instructions behave as if they were true 1T instructions, but on a quarter speed hardware. So a single warp can actually achieve full SP saturation on Maxwell hardware, not all instructions with seemingly high latency are actually causing pipeline stalls in the first place.
 
Are current win10/dx12 drivers for amd/nvidia WHQL certified yet? Any chance all the results up to this point will be forfeit with a new driver release? (at least from nvidia)
 
Are current win10/dx12 drivers for amd/nvidia WHQL certified yet? Any chance all the results up to this point will be forfeit with a new driver release? (at least from nvidia)
Yes and almost certainly. WHQL only really tests for basic functionality - there is no performance testing or anything and even for functionality there isn't 100% coverage. These are still early days for DX12 drivers and will continue to be until a good number of games are in the market.
 
Nothing, at least nothing you couldn't read from the specs.
When issued concurrently: The actual possible gains from async scheduling.
The source of a delay is independent to a significant degree to the effect on queue management.
A stall due to shared memory contention or texturing is not going to phase the queue processor differently than a stall due to a wait on a timer.
Even in the single-queue case, hardware is capable of a large amount of overlap due to pipelining, so the shaders can readily become bandwidth or register-constrained in the absence of asynchronous queue processing.
It's when there there are barriers on the queue that there is a defined difference for this functionality.

Concurrent issue from the queues is also not in itself a useful goal. GCN's ACEs can individually at most launch one wavefront per cycle, and I'm blanking on whether there is a global limit based on the number of shader engines that can initiate a wavefront per cycle. The single compute queue test may very well be a serial process for significant amounts of time.
Concurrent processing within the back end can still happen even if the queues have commands pulled serially. It would fall below our ability to measure readily if we want to pick out the nanoseconds lost to a serial read versus parallel.
The noise floor we have is a million times too high to tell the difference, and the kernels themselves most probably have far too many other sources of overhead to reveal it.

The entire idea behind async computing is to increase the utilization by interleaving different workloads, which allows you to achieve full utilization of the entire GPU despite not having optimized your individual shaders onto the pipeline of each GPU architecture.
The primary idea behind async compute is to explicitly define non-dependent workloads that can move past each other in the absence of an explicit synchronization point.
Items beyond that are gravy, and with the optimizing graphics subsystems we have not every benefit given to asynchronous compute is missing from the synchronous variety.

You are right, somewhat, I oversimplified that statement. Wavefront size is 32 for Nvidia and so is therefore optimal workgroup size, however, with 2048 shaders and only 32 concurrent kernels, you can only achieve full shader saturation with at least 64 threads per kernel on average.
Later references to this point showed that this is not the case, since we have not stipulated single-instruction kernels. That aside, I think it would be worthwhile to keep batch sizes that are sub-optimal for a specific architecture as a data point.

Well, out of order launches respectively scheduling full wavefronts are only one half of async compute. The actual concurrent execution of multiple wavefronts per SIMD is the other half.
This might be an eventual optimum, but it is an implementation detail similar in how software does not need to care about whether a CPU is out of order.
They wouldn't go through the effort of adding the caveat about concurrent execution not being guaranteed at every opportunity if it were that critical to them.
 
The primary idea behind async compute is to explicitly define non-dependent workloads that can move past each other in the absence of an explicit synchronization point.
Items beyond that are gravy, and with the optimizing graphics subsystems we have not every benefit given to asynchronous compute is missing from the synchronous variety.
Right, and it's worth again pointing out that you don't even require the separate queues model in DX12 to do this. As long as the command processor itself is not the bottleneck (it rarely is), lots of stuff can be fed into the front of the pipeline and run in parallel via simple pipelining. There would be no point in the ubiquitous "UAV overlap" DX11 extensions (and DX12 spec change in this area - unrelated to multiple queues) otherwise.

That's why I'm highly cautioning about reading too much into micro-benchmarks like this. I'm pretty sure I could even write a shim that would make this test run fast on all DX12 hardware regardless of the underlying architecture. Thus as far as this example is concerned, it is just testing how the driver handles an unrealistic workload as much as anything else.
 
OK, one last batch of graphics, partially extracted from the data collected in this thread, from various whitepapers, other benchmarks, and who knows what sources else.

TL;DR
AMDs and Nvidias architectures are fundamentally different when it comes to command handling.

What do they have in common?
Both AMD and Nvidia are supporting concurrent execution of compute shaders in addition to workload from the graphics queue at least since Ferm/GCN 1.0.

All architectures from Nvidia have a "Work Distributor" unit for this purpose, which can handle a number of concurrent (outgoing) command streams. The level of concurrency varies between 6 and 32 command streams, depending on the specific chip.

With AMD, the actual distribution is handled by the ACE (Asynchronous Compute Engine) units. Each ACE unit can handle 64 command streams, the HWS (Hardware Scheduler) can even handle up to 128 command streams each.

Where do they differ?
Since GCN 1.0, AMDs cards had dedicated command queues for compute and graphics commands. Both the GCP (Graphics Command Processor) and the ACE can communicate to handle global barriers and to synchronize queues. This feature is present on every single GCN card. All compute queues exposed by the driver are mapped 1:1 onto hardware queues and can be executed asynchronously, that means without any hardwired order, excluding explicit synchronization points.

With Nvidia, newer cards (GK110, GK208, GMXXX) have a (probably programmable) "Grid Management Unit" in front of the "Work Distributor" which can aggregate commands from multiple queues, including dynamic scheduling and alike, prior to dispatching the workload onto the "Work Distributor".

The "Work Distributor" can only handle a single source of commands, and therefore requires that all queues are mangled into a single queue first, either by the "Grid Management Unit" or by the driver.

This "Grid Management Unit" is currently used to provide a feature called "Hyper-Q" in addition to handling the primary command queue. This feature is similar to the compute queues handled by the ACEs in AMD GPUs in that it can handle a number of truly asynchronous queues, but it lacks certain features such as the support for synchronization points.

What is the result?
So the current state with Nvidia hardware is that all queues used in DX12 are mangled into a single into a single queue in software, whereby the driver has to be rather smart at interleaving commands to maximize the occupation of the "Work Distributor". Ultimately this means that concurrent shaders can finish out of order and new compute shaders can fill the gaps, but execution can only start in the order originally decided by the driver.

Overloading the queues or complex dependency graphs can cause a quite significant CPU load with Nvidia hardware, up to the point where the use of Async shaders even decreases utilization.

AMDs hardware is working as expected, all compute queues are executed fully independently, except for explicit synchronization points.


Disclaimer: I don't claim correctness for any of the graphics. Most of the data shown is backed by evidence, but numbers may still be wrong. Please do not distribute these graphics without explicit permission, and refrain from from linking directly onto this post. I won't be able to correct any errors in this post due to restrictive edit rules inn this forum.
 

Attachments

  • GCN1_0.png
    GCN1_0.png
    35.1 KB · Views: 83
  • GCN1_1.png
    GCN1_1.png
    36.4 KB · Views: 71
  • GCN1_2.png
    GCN1_2.png
    40.5 KB · Views: 89
  • GF1XX.png
    GF1XX.png
    29.8 KB · Views: 67
  • GK10X.png
    GK10X.png
    31.2 KB · Views: 65
  • GK11X.png
    GK11X.png
    39.6 KB · Views: 71
  • GK208.png
    GK208.png
    39.3 KB · Views: 65
  • GM10X.png
    GM10X.png
    39.9 KB · Views: 66
  • GM20X.png
    GM20X.png
    39.5 KB · Views: 81
Just an hints: D3D12 does not allow multiple graphics queues (there is only one graphics queue per adapter node). And if you gonna see the AMD GPU diagrams, you will notice only one "graphics command processor" (and multiple ACEs).
 
Just an hints: D3D12 does not allow multiple graphics queues (there is only one graphics queue per adapter node). And if you gonna see the AMD GPU diagrams, you will notice only one "graphics command processor" (and multiple ACEs).
Yes, there is only a single graphics queue per device context when going via the DX12 API. But even then, nothing stops you from acquiring a second device context, and may it just be by starting another 3D accelerated application in parallel. In which case the driver starts to interleave the graphics command queues.

There is still only a single command processor, fetching from the merged queue.
 
Ext3h, I think you're spot-on, especially with regards to complex dependency. When the workloads become more stochastic I think the nVidia hardware may have more difficulty as the latency involved when making reactive decisions to changes in workload may be too high between the GPU and CPU. I wonder if they'd try to stick with their software-based approach with Pascal and beyond, perhaps using an ARM chip on the card itself to reduce that latency.
 
Oh, I noticed that you're showing the grid management unit and work distributor on the hardware itself in nVidia's case. Is this true? I couldn't find any information to confirm or deny whether this was software or hardware side.
 
Oh, I noticed that you're showing the grid management unit and work distributor on the hardware itself in nVidia's case. Is this true? I couldn't find any information to confirm or deny whether this was software or hardware side.
The work distributor is pure hardware. I'm entirely sure about that. This is backed by the greatly variating width of this unit on a per chip base. It's also supposed to dispatch wavefronts at a comparably high rate and it needs to access the SMM/X states directly to find under-occupied ones.

The grid management unit is on the GPU as well, that thing also handles MPI on Kepler Tesla cards. But I don't exactly know how that thing actually looks like. It might be a custom micro controller with a lot of ASIC (similar to the ACEs on GCN), but it might also be "just" a regular ARM core. It is very unlikely that it is fully hardwired - implementing MPI in hardware would be a vast waste of resources. Nvidia has carefully avoided to disclose any informations on that unit, except for "It exists".

I certainly hope for Nvidia that it already is a regular ARM core or something of similar flexibility, because that would mean they could actually "fix" the cards via firmware update.
Absolute worst case would be if it actually is hardwired.
Second worst case would be if the work distributor had no way of communicating its occupation back to the hardware manager - which isn't unlikely either, there's the chance that it is really just accepting "grids" and blocking if fully occupied. In that case, the grid management unit would still be incapable of balancing graphic and compute command queues. This is a possibility, as I haven't found any charts indicating that there was any additional communication channel between these two units, apart from dispatching compute grids / draw calls.

When the workloads become more stochastic I think the nVidia hardware may have more difficulty as the latency involved when making reactive decisions to changes in workload may be too high between the GPU and CPU.
It's actually even worse than that. A lot worse.

The hardware is basically incapable of handling any type of back-pressure at all, as it is. More than 1 graphic commands queued (including the one in execution), and you will possibly stall all compute commands from entering the work distributor. Queuing more than 31 compute commands will vice-versa block graphic commands from reaching the work distributor.

Within these limits you actually have some type of hardware accelerated, truly asynchronous and concurrent execution.

If you exceed these limits just slightly, you are surrendering yourself to the driver and the embedded heuristics. Which may either result in a success, sequential execution, or a catastrophic failure. Clear tendency towards the last one with increasing complexity. With a bonus chance of triggering fatal race conditions. That was not a joke.
 
Back
Top