DX12 Performance Discussion And Analysis Thread

I'm struggling to see how NVidia is failing by any sensible metric when Graphics + compute completes in 92ms on GTX980Ti and 444ms on Fury X. Or compute only which is 76 versus 468ms. AMD, whatever it's doing, is just broken.

Or maybe Fiji is just spending 25.9ms sleeping, then waking up momentarily to execute a kernel that should take about 8 microseconds.
The discrepancy seems too significant to not have been noticed in some other context. The PS4 at least uses asynchronous compute, and if this scaling behavior were universal I think it would have been remarked upon. The latest version actually cut the compute load per kernel, and I am pretty sure they can generate more dispatches in a frame than what is being done here.
I wonder what is being missed here.

Some of the patches for HSA include discussion of the generation of run lists that the schedulers use to pick kernels to lanch, does this need an intermediate layer of analysis for DX12 for the ACEs?
The AMD GPU with the simplest front end seems to have the least difficulty with the increasing kernel count, is this because it takes less work to add to queues if there's less of a problem space to analyze?
Are the vendors consistent in how they are assigning dispatches to queues?

Oddly enough, the forced-synchronous execution mode has a stair-step pattern for GCN that actually matches the known number of simultaneously presented queues for the 8-ACE GPUs. Although Tahiti also has the same stride, for some reason.
That case would seem to have the least amount of prior analysis needed, given less choice is given to the driver or GPU.

3dilettante: wouldn't it be interesting if active TDR is slowing down these tests...
It does seem like the trend line for Nvidia really flattened in the one case where we know it was off.
I think this needs to be tested both ways for both vendors to see if there is a pattern to this.

Would someone please explain what the differences between the first and the second test are?[/QUO
Asynchronous compute versus forced-synchronous.
 
GPU usage on Fury-X is odd.

Compute : ~10% all the time
Graphics only: 40%
Graphics + compute: ~10% all the time
Graphics, compute single commandlist: 80-90% (usage seems stable in afterburner, does spike like in many pictures; highest recorded usage by afterburner 91%)

Dunno how to get them all to screenshot, so just gave summary.
 
How many compute queues are being set up with the latest test version?
I recall that it was a large number initially, but is it now one graphics and one compute for the mixed case?
 
So, i bought a GTX980 Ti on monday, still boxed since i'm awaiting other parts. What do all these readings and graphs say? Is this Async stuff just a overblown thing?
I can still return my card and get a AMD instead.. Hell, still running a 7970 Matrix P atm.
 
GPU usage on Fury-X is odd.

Compute : ~10% all the time
Graphics only: 40%
Graphics + compute: ~10% all the time
Graphics, compute single commandlist: 80-90% (usage seems stable in afterburner, does spike like in many pictures; highest recorded usage by afterburner 91%)
"Usage" is hard to define, because it's likely to be a number that measures multiple parts of the GPU to determine load. It's also likely to be measuring multiple points across some parts, e.g. the load on each shader engine (there's 4).

For example the shader engines might literally be busy 1/4 of the time in the compute run and perhaps core usage only counts for 40% of the entire "usage" metric.
 
It's the same workload, and not one that is trying something too exotic.
The first test and the latest test have different parameters - they are two different scheduler/simultaneous command benchmarks. It isn't the point that they are technically pushing the same type of work through, it's that they are different in behavior of handling said work.
 
Thanks, did this. Was then able to finish without driver crash.

I should note that you should enable TDR again when finished benchmarking as leaving it disabled greatly increases the chance of system instability. And yes, enabling/disabling it requires a restart. Pro tip, running an infinite loop on your display gpu with tdr disabled = bad idea.

MDolenc, did you use the disable timeout flag? I wonder if it completely disables it or just extends the timeout.
 
The discrepancy seems too significant to not have been noticed in some other context. The PS4 at least uses asynchronous compute, and if this scaling behavior were universal I think it would have been remarked upon. The latest version actually cut the compute load per kernel, and I am pretty sure they can generate more dispatches in a frame than what is being done here.
I wonder what is being missed here.
Multiple threads issuing work to the GPU?

The command list run is favourable to AMD. But it isn't exactly pretty, either.

Some of the patches for HSA include discussion of the generation of run lists that the schedulers use to pick kernels to lanch, does this need an intermediate layer of analysis for DX12 for the ACEs?
It may well do, but I don't see how this test is illuminating in that regard. The numbers we're seeing are orders of magnitude wrong.

For example I can complete 128 kernel launches of SGEMM using 32x32 square matrices in 4.5ms with some OpenCL code I have lying around, on Tahiti. 512 kernel launches take 16.7ms.

The AMD GPU with the simplest front end seems to have the least difficulty with the increasing kernel count, is this because it takes less work to add to queues if there's less of a problem space to analyze?
If there's 8 ACEs and all the work in this test is going to a single ACE, maybe there's a problem getting work from a single ACE to the entire GPU? It doesn't seem logical to me, though, to have an architecture with this limitation, since surely this is the most common case.

I doubt there's analysis going on here. I think it's more likely there's a choice of allocation heuristics in the hardware and the driver tries to choose what it thinks is best.
 
Benchmarked: Ashes of the Singularity
Sept. 1, 2015

Right now, we can see that DX12 definitely makes a difference in performance, giving the game developers a lot more power. But with great power comes great responsibility, and some developers may not be able to handle DX12, at least not without more time and effort.

The next fight is shaping up to be Lionhead’s Fable Legends, and that will perhaps be a more neutral battleground as it’s neither an AMD nor an Nvidia title. In fact, it appears Microsoft (who owns Lionhead) is determined to put forth a message that DX12 is unified. Microsoft doesn’t want DX12 to appear as a fractured landscape, one where AMD or Nvidia rules, a place where processor graphics gets left in the dust. In that sense, Fable should be the most likely vendor-agnostic approach to DX12 we’re going to see in the near term. We’re certainly looking forward to testing it, though it may be a few months.

Ultimately, no matter what AMD, Microsoft, or Nvidia might say, there’s another important fact to consider. DX11 (and DX10/DX9) are not going away; the big developers have the resources to do low-level programming with DX12 to improve performance. Independent developers and smaller outfits are not going to be as enamored with putting in more work on the engine if it just takes time away from making a great game.
http://tools69.com/benchmarked-ashes-of-the-singularity/
 
Big developers have the capability to make their own engines, small developers use engines already made. Since most of them are moving to DX12 support, it's a matter of how optimized the engines are for either architectures and the effects they support, no?
Yes, the article says DX12 may require manual optimizations to get the highest GPU performance possible and is something only Big developers will have resources for. Smaller developers won't have the resouces to spend alot of time optimizing and may focus more on getting the product to market and will spend less time writing GPU optimizations ... maybe an engine like the Unreal might be ideal for a small developer, and there are a few others.
 
Last edited by a moderator:
And generalizing:
Maxwell cards are now also crashing out of the benchmark as they spend >3000ms trying to compute one of the workloads.
 
we are missing telmundo actors/actresses

!no hay shaders ascyronous en mi casa!

damn it no reverse exclamation on my keyboard lol
 
Back
Top