DX12 Performance Discussion And Analysis Thread

Razor1 · Sep 3, 2015

RedditUserB said:
*snip

Do you guys notice any different in Fiji/Fury?

Look at the ACEs. There's normally 8 in other GCN, Fiji has 4 ACEs and 2 HWS.....

Some radically different way to handle queues?

juanchotazo99 said:
Interesting. Does Tonga have the same behavior as Fiji under this test?

that's interesting, ok didn't know that, and no it doesn't look like tonga in the test either.

firstminion · Sep 3, 2015

Razor1 said:
I dont' think the oxide dev is being forth coming, he is unwilling to share any hard data on how they reached that conclusion from their end, but he seems to be free to talk about it in generalizations which is not being transparent at all. Have you watched politicians talk, this is how he is talking, giving out this is what might be happening, if he is a dev at Oxide he should be able to tell us exactly what is going on in his program, with a definitive stance and data to back his comments up. He doesn't need to share source code, just psuedo code is enough, and profiler data, is that too much to ask for, when he is saying what he is saying?

Have anybody even asked him for that?

You don't expect him to "divine" what you want him to do, right?

Razor1 · Sep 3, 2015

yes he has been asked

firstminion · Sep 3, 2015

I perused the overclock.net topic and failed to see that, only some complains that there's only one DX12 benchmark out there.

Data here supports that there are issues with nvidia async anyway. Don't see what he would need to add.

RedditUserB · Sep 3, 2015

firstminion said:
I perused the overclock.net topic and failed to see that, only some complains that there's only one DX12 benchmark out there.

Data here supports that there are issues with nvidia async anyway. Don't see what he would need to add.

This test here is so simple, yet so many have misunderstood it's purpose.

Can you imagine Oxide release a debug dump of their engine and resulting Async calls/processing by the different architectures? Holy mother, chaos would ensue.

They have said enough. They claim its software emulation. They disabled it due to heavy CPU usage.

The testers here find heavy CPU usage for NV GPUs when it tries to operate async compute. The pattern suggests its not running in parallel.

Unless the program here is wrong, it's quite obvious now.

3dilettante · Sep 3, 2015

Phyxsyus said:
I just ran the new AsyncCompute benchmark, results are kinda different from the last one. Could anyone kindly explain my results? If ms are going up, does this mean it's not doing asycn?

R9 280 OC

Can you point to areas of particular interest?

Uniform results with so many different rigs is hard to come by, and there may be confounding factors since these GPUs are trying to balance requests from elsewhere while running the test.
Some of the long-running tests, like the single command list graphics and compute tests that crash, are operating at a threshold where Windows is starting to take notice that operations are taking a long time to complete, and it might start trying to preempt things at inconvenient times. If the GPU is slow enough to respond, the driver is reset.
Without a full restart, there might be some parts of the system that might not fully go back to normal after that.

As far as the timings, that is something that can vary, and I think this test can be ambiguous.
The evidence used to infer that asynchronous compute is happening is that the time for a batch of compute coupled with a fixed graphics is less than their separate times.
One assumption baked into this is that we have a graphics workload time value that is listed once, but it's not explicitly pointed out in the later asynchronous tests. Nothing else we're seeing is that consistent, so maybe that isn't always the value we need.

There are a few places in the async test where the timestamp is that very long result that seems to imply that the graphics load finished last, rather than the ms time for a compute dispatch.
That this seems to waver between batches of similar size does strongly hint that the compute and graphics portions can move more freely relative to each other--even if the times aren't necessarily that great.

There are also a few interesting times listed throughout where the compute kernels do not come in at a near-multiple of the base execution time, so there are times where instead of seeing something like ~24.7 or ~49.4, there are 30s and 39s, which points to kernels being pushed to the side for a while by something else.

That measures how much simultaneous or concurrent execution is going on, but that is also not entirely the same as being asynchronous, if the GPU is able to pipeline things aggressively. Commands of each type could be kicked off in the same order, and since that process should be very fast compared to the lifetime of the compute and graphics loads, there might be overlap even though their being plucked from the queue is technically being done in a fixed order. Or it could be. I'm not sure if this is permitted in this case, but CPUs do this all the time. I'm not sure about the exact rule for pulling these commands off the queue.

So it's usually the case that the more performance margin we have, the more likely that random stalls and overlap are not going to hide a pattern. I do not think this test is complex enough to defeat all the ways the GPU is going to try to maintain utilization, and since the kernels do not depend on others for results, there may be optimizations going on in the background that are different than what an engine would do when the results of the compute are needed for something else.

One of the things we don't get right now with this method is how asynchronous compute can work around stalls in one queue over the other, since nothing is waiting on anything.
If there were a way to get absolute time stamps for when specific commands came off their queue, it would be a less ambiguous scenario.

GCN's behavior and the descriptions of how it is being used are strong enough to say it is able to do all this, so in a way it's more of a sanity check on this test that it is consistent with that.

Nub · Sep 3, 2015

Guys, if you're looking at the charts i plotted to prove that the Maxwell cards are able to do async compute+graphics (like this one), you need to remember that the "whole bar", the "red bar", and the "blue bar" parts of the chart are 3 separate runs.

There will always be small variations in execution time even when you're executing the exact same computer code, no matter how simple the operation is.

The only question you should ask yourself when looking at the chart is: do the red and blue bars appear to be mostly stacked on top of each other or do they seem to be mostly overlapping each other?

RedditUserB · Sep 3, 2015

Nub said:
Guys, if you're looking at the charts i plotted to prove that the Maxwell cards are able to do async compute+graphics (like this one), you need to remember that the "whole bar", the "red bar", and the "blue bar" parts of the chart are 3 separate runs.

There will always be small variations in execution time even when you're executing the exact same computer code, no matter how simple the operation is.

The only question you should ask yourself when looking at the chart is: do the red and blue bars appear to be mostly stacked on top of each other or do they seem to be mostly overlapping each other?

It's easier to represent a line for the difference between Async Compute time vs Compute only + Graphics only combined.

Plot it as a line graph, compute only ms line, graphics only ms line, then async mode line. Much less confusing.

Nobu · Sep 3, 2015

Razor1 said:
http://nubleh.github.io/async/#18

http://nubleh.github.io/async/#38

No mental exercise, looking up the two Fiji tests that might be too much for you, any case can we say Fiji doesn't have hardware supported async because of this test? The results are poor compared to older GCN architecture, although better than nV's hardware, but they mirror the up and down cycles of lower and higher latency at certain points and the step by step increase based on load.

Similar, but not the same. Maxwell 2, the bars are partially dropping down into the blue area. Fiji, they're dropping down, completely, into the blue area. In both cases, I suspect driver issues, but you can't say for sure that would explain why it's not dropping all the way down into the blue area. More tests required.

Nobu · Sep 3, 2015

The 980 Forceware is the closest to mimicing Fiji, I'd say it's fair to say they're suffering from the same problem. The other ones seem to be suffering from multiple problems, since they can't do fully concurrent compute+graphics.

Nub · Sep 3, 2015

RedditUserB said:
It's easier to represent a line for the difference between Async Compute time vs Compute only + Graphics only combined.

Plot it as a line graph, compute only ms line, graphics only ms line, then async mode line. Much less confusing.

Simple is good, but where's the fun in that?

hesido · Sep 3, 2015

RedditUserB said:
Why is the Async test on GCN does nothing to CPU usage or temps, but on Maxwell, CPU usage/temp spikes up very high and GPU usage drops to 0%?

Does this not indicate that there is software emulation of Async Compute "support" going on? This is what Oxide has referred to directly, high CPU usage due to driver emulation. I'd imagine in a real gaming workload, the compute task would be a lot more complex, if its offloaded to the CPU, it will hammer the CPU into a stall.

It would be crazy to emulate any GPU work through CPU. Something else should be at play here.

Worst case, NVidia would run them serially, but not on the CPU, if that's what you mean.

RedditUserB · Sep 3, 2015

Nub said:
Simple is good, but where's the fun in that?

Great! Thank you that's much clearer and easier to read!

Green line should NOT be ~ the sum of blue + red line if Async Compute is functional (parallel execution).

Green line should be matching EITHER red or blue line if Async Compute is functional.

If the Green line is ~the sum of blue + red line, THERE IS NO PARALLEL EXECUTION. Async Compute is non-functional.

Agreed?

There is a lot of silly spawning up everywhere:
http://blog.logicalincrements.com/2015/09/nvidia-directx-12-and-asynchronous-compute-dont-panic-yet/

^^ These people are comparing the blue, red, green line on NV vs AMD and seeing that its lower on NV, they are concluding that ASYNC is working on NV, even better than on AMD.

RedditUserB · Sep 3, 2015

hesido said:
It would be crazy to emulate any GPU work through CPU. Something else should be at play here.

It would be very crazy but that's what Oxide said, they found massive CPU usage due to software emulation. That's what Maxwell owners here in this thread have found in their testing, during Async Compute phase, CPU usage spikes up a lot.

Nobu · Sep 3, 2015

That "R290 Latest Drivers" is impressive, running graphics+compute in less time than either individually.

Not sure how to interpret the 980 Forceware with that plot, except that it's usually graphics+compute, but sometimes compute takes a little bit longer? I'm tempted to say it's something about the program that's causing the bumps in latency, and that it's actually always graphics+compute, but I can't really without knowing why this is happening. :/

Nobu · Sep 3, 2015

Err, nvm (the "R290..."), read that graph wrong. I sware I'm partially color blind.

serversurfer · Sep 3, 2015

Nub said:
Simple is good, but where's the fun in that?

Actually, I liked your original visualizer better. If everyone agrees this test is basically just useful for testing for the presence of fine-grained compute by measuring the savings offered by running jobs concurrently, your visualizer does a nice job of simplifying that as, "Look for overlap; gaps are bad."

If the overlap is really the most/only relevant information we can get from this test, I was wondering if it may be useful and further reduce confusion to plot that all by itself. Could you plot the Faster By as a line graph, and then give the user the ability to multi-select cards to compare savings? Bonus points if you can have buttons that plot averages for different GPU families.

drSeehas · Sep 3, 2015

Jawed said:
... the Cape Verde chip ... my 7770 is 1GHz.

ahezards is a HD 7790 and this is Bonaire, not Cape Verde.

ka_rf · Sep 3, 2015

For RedditUserB and others if it helps.

CSI PC · Sep 3, 2015

Does anyone have the NVIDIA 349.56 drivers and able to repeat this test, or able to share that driver to others here?
These were the drivers used for Star Swarm : http://www.anandtech.com/show/8962/the-directx-12-performance-preview-amd-nvidia-star-swarm/3

It may be academic considering it has never been fully explained how the nitrous engine works within Star Swarm but worth considering IMO; back then Oxide was still heavily talking (in their presentations) about asynchronous gaming design and how Star Swarm was to show the strength of Mantle.
With that said, I do wonder if the performance benefit for NVIDIA back then on the Extreme setting is comparable to the results we see with Ashes and Low setting; has Oxide ever gone into detail how the Star Swarm-Nitrous demo worked or clarified how the engine involved since then for Ashes?
Cheers

DX12 Performance Discussion And Analysis Thread

Similar threads