DX12 Performance Discussion And Analysis Thread

still under "theory", nV hardware can do async in hardware up to a certain point with less latency than AMD hardware, but when stressed after that point its kinda like this

cliff.jpg

:LOL:
 
Looks like Nvidia got heavily CPU limited in the Fable benchmark this time. Even at 4k.

And no, the game doesn't really make proper use of async compute at all. Only about 5% (time wise) of the workload has been offloaded to a dedicated compute queue. I've seen the GPUView dumps of Nvidia and AMD runs. No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.

So it looks essentially the same as it would have with DX11, a perfectly safe, well optimized techdemo, where the only DX12 benefit left is the reduced driver overhead. And even that isn't true for Nvidia.
 
Anyone feel like generously giving me an update on what's going on in this thread, in baby language? So I can understand it and all?
A great start would be this post from Ext3h.


It is absolutely running asynchronously.
It's running asynchronously where it's supported. "Async Compute" isn't a mandatory DX12 "flag".


The question is whether it is running CONCURRENTLY.
"Async Compute" is the ability to start rendering and compute tasks at the same time, throughout the ALUs. If it's not running concurrently, there's no "Async Compute" happening.


Please, Please, PLEASE pay attention to the words you use in technical discussion.

What the hell does this even mean?! I was just plain and simple called "obtuse" a couple of posts ago and I'm the one needing to pay attention to my words?! Is dogpiling a thing now on B3D?!


No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.
You could almost say it's a DX12 implementation tailored for nVidia GPUs, then...
Not that I expected any less from Tim Sweeney, though. :(
 
Looks like Nvidia got heavily CPU limited in the Fable benchmark this time. Even at 4k.
Which reviews show Nvidia being CPU-limited at 4K? There seems to be evidence to the contrary, since Anandtech's results generally show no sensitivity to CPU choice until 720p, and Techreport's factory-overclocked 980 Ti is demonstrably faster relative to Fury than other reviews with stock cards.
 
Which reviews show Nvidia being CPU-limited at 4K? There seems to be evidence to the contrary, since Anandtech's results generally show no sensitivity to CPU choice until 720p, and Techreport's factory-overclocked 980 Ti is demonstrably faster relative to Fury than other reviews with stock cards.
Extremetech accidentally managed to throttle the CPU at 1.7Ghz by choosing the wrong power profile and that resulted in Fury X outranking the 980 Ti even at 4k and 1080p. On 720p, the Fury X took only a 2% performance hit from reduced clock speed. 980 Ti lost about 30%.

Not fair, I know. And not intended either. But still surprising.

Bear in mind that Extremetech was also using a Haswell-E CPU with 20MB L3 cache, so that thing is a beast when it comes to hiding CPU related latencies as it suffers from virtually no L3 cache misses at all. When they increased the clock speed, that was probably also the reason why they had the only 720p run where the 980 Ti could actually beat the Fury X.
 
I wouldn't put to much into the ExtremeTech review ....
 
Me neither. I know its faulty.

But it yielded some nice evidence on the 720p run. They got the 980 Ti to perform both worse than everyone else, and better than everyone else. The almost 160FPS in 720p with a stock clock 980 Ti are just as surprising.


But also bear in mind, that MS demanded that Fable should only be tested in 3 profiles: 1080p and 4k in full details, 720p with minimum details. So the 720p run may not be representative at all, nobody knows what got changed in that run.
 
Extremetech accidentally managed to throttle the CPU at 1.7Ghz by choosing the wrong power profile and that resulted in Fury X outranking the 980 Ti even at 4k and 1080p. On 720p, the Fury X took only a 2% performance hit from reduced clock speed. 980 Ti lost about 30%.
So the 980 Ti is CPU-limited when the CPU is massively downclocked and the resolution is at 720p.

Where should I be looking for the rankings changing at 4K between the 980 Ti and Fury X?
 
Those results are AMD's PR provided results, not results from compiled by ExtremeTech.
I must say I am not a fan of comparing stock NVIDIA to stock AMD as their sales model/channel seems to be a bit different where NVIDIA provides greater flexibility for their partners to differentiate from the reference design in terms of performance based upon noise, heat design, and importantly greater clocking capability; ExtremeTech used a stock reference 980/980ti and lets be honest only very early technology adopters should have these as they are not as good as the slightly later AIB manufacturers.
Still not ideal but maybe they should use one or two manufacturer brands that design both AMD and NVIDIA, say ASUS and MSI - still not ideal but at least it is meant to be closer to optimum design for both without being at the very extreme.

I am shocked they reported performance from AMD PR directly for the 390 and 380, ironically that performance would put the 390 around the Nano at pcgameshardware.de: http://www.pcgameshardware.de/DirectX-12-Software-255525/Specials/Spiele-Benchmark-1172196/
Still the 390x is looking good in all tests so far from various sites.

Cheers
 
So the 980 Ti is CPU-limited when the CPU is massively downclocked and the resolution is at 720p.

Where should I be looking for the rankings changing at 4K between the 980 Ti and Fury X?
I don't know. They are not online any more. Maybe they were just a fluke. Now both graphs show them ranked evenly, and oddly enough, both seemed to have received a performance boost on 1080p, which indicates some common CPU limits. Perhaps particle physics.


But the CPU limit isn't only there when downclocked. It only became obvious. It's even there at regular clock on a i7-4960X (Anandtech). Still, only 720p, sure, but it is there. Only an entirely oversized i7-5960X (costing twice as much as the GPU) could leverage the CPU limit far enough to the let the 980 Ti outperform the Fury X.

While AMD for once did not have a CPU limit at all, at that resolution.

Draw your own conclusions.
 
I must say I am not a fan of comparing stock NVIDIA to stock AMD as their sales model/channel seems to be a bit different where NVIDIA provides greater flexibility for their partners to differentiate from the reference design in terms of performance based upon noise, heat design, and importantly greater clocking capability; ExtremeTech used a stock reference 980/980ti and lets be honest only very early technology adopters should have these as they are not as good as the slightly later AIB manufacturers.

Agreed. Specially for the GTX 980, stock clock models are incredibly hard to find nowadays.
 
But the CPU limit isn't only there when downclocked. It only became obvious.
If it's not obvious, there's little justification in saying Nvidia is limited by it. There's no reason to state that an item whose influence is a second-order effect compared to a more dominant bottleneck cannot have some impact.
If an artificial case of a downclock to a specific and non-representative speed is sufficient to indict one vendor, I have bad news for both when I require a downclock to 1 MHz.

It's even there at regular clock on a i7-4960X (Anandtech). Still, only 720p, sure, but it is there.
That makes it applicable to a claim of being CPU-limited at that resolution, although given the vast gulf in capability between an i7 and an i3, saying it is CPU-limited may not be fully accurate without more elaboration.

Only an entirely oversized i7-5960X (costing twice as much as the GPU) could leverage the CPU limit far enough to the let the 980 Ti outperform the Fury X.
While AMD for once did not have a CPU limit at all, at that resolution.
AMD's performance was sensitive to changes in CPU choice, just not in a manner that was intuitive.

Draw your own conclusions.
One vendor has a higher CPU dependency, although in absolute terms it requires a significant drop in CPU performance to make it clear.
 
Looks like Nvidia got heavily CPU limited in the Fable benchmark this time. Even at 4k.

And no, the game doesn't really make proper use of async compute at all. Only about 5% (time wise) of the workload has been offloaded to a dedicated compute queue. I've seen the GPUView dumps of Nvidia and AMD runs. No draw call overload, backpressure only in the graphics queue and no more than a single compute command every few graphic batches, only copy commands where ever issued asynchronously.

So it looks essentially the same as it would have with DX11, a perfectly safe, well optimized techdemo, where the only DX12 benefit left is the reduced driver overhead. And even that isn't true for Nvidia.

Is this true? That's bad if so. It would mean they left the real benefits for the xbox one and took it down a notch for PC.

Nvidia....

What are the chances AMD can have their driver force compute shaders to be run asynchronously concurrently...
 
Someone mentioned that extremetech results were provided by AMD. I want to provide the full context of the quote the person made for clarity. It doesn't make as much sense for a 390 to beat a stock 980 without good usage of async. Not what we would expect but it seems to be the case based on the below.

Why include AMD results?
In our initial coverage for this article, we included a set of AMD-provided test results. This was mostly done for practical reasons — I don’t actually have an R9 390X, 390, or R9 380, and therefore couldn’t compare performance in the midrange graphics stack. Our decision to include this information “shocked” Nvidia’s PR team, which pointed out that no other reviewer had found the R9 390 winning past the GTX 980.

Implications of impropriety deserve to be taken seriously, as do charges that test results have misrepresented performance. So what’s the situation here? While we may have shown you chart data before, AMD’s reviewer guide contains the raw data values themselves. According to AMD, the GTX 980 scored 65.36 FPS in the 1080p Ultra benchmark using Nvidia’s 355.98 driver (the same we driver we tested). Our own results actually point to the GTX 980 being slightly slower — when we put the card through its paces for this section of our coverage, it landed at 63.51 FPS. Still, that’s just a 3% difference.

It’s absolutely true that Tech Report’s excellent coverage shows the GTX 980 beating past the R9 390 (TR was the only website to test an R9 390 in the first place). But that doesn’t mean AMD’s data is non-representative. Tech Report notes that it used a Gigabyte GTX 980, with a base clock of 1228MHz and a boost clock of 1329MHz. That’s 9% faster than the clocks on my own reference GTX 980 (1127MHz and 1216MHz respectively).

Multiply our 63.51 FPS by 1.09x, and you end up with 69 FPS — exactly what Tech Report reported for the GTX 980. And if you have an NV GTX 980 clocked at this speed, yes, you willoutperform a stock-clocked R9 390. That, however, doesn’t mean that AMD lied in its test results. A quick trip to Newegg reveals that GTX 980s ship in a variety of clocks, from a low of 1126MHz to a high of 1304MHz. That, in turn, means that the highest-end GTX 980 is as much as 15% faster than the stock model. Buyers who tend to buy on price are much more likely to end up with cards at the base frequency, the cheapest EVGA GTX 980 is $459, compared to $484 for the 1266MHz version.

Highlights that benchmark results need more information than just the name of the GPU. Frequencies should be mentioned at least. This is one of those really annoying things about getting data from benchmarks.
 
Is this true? That's bad if so. It would mean they left the real benefits for the xbox one and took it down a notch for PC.
I don't recall seeing a DX11 vs DX12 comparison, so saying that there is no reduction in driver overhead for Nvidia is a dubious assertion.
There's no requirement that implementations become magically equal with DX12.

What are the chances AMD can have their driver force compute shaders to be run asynchronously concurrently...
If, and this is an if, the explicitly listed compute category is asynchronous compute, we see the overall contribution it makes to frame time. It could go to zero ms and the overall picture would only change a little.
That's only so much of the ~33ms frame time.
Even Ashes of the Singularity was noted after the kerfluffle started to not seriously push the envelope there, either.


Someone mentioned that extremetech results were provided by AMD. I want to provide the full context of the quote the person made for clarity. It doesn't make as much sense for a 390 to beat a stock 980 without good usage of async.
There are possibly hundreds of reasons why things could go one way or the other.
For one thing, numbers provided by AMD purporting a lead for a card that is not reflected by reviews actually does make sense, in view of what has already happened.
 
There are possibly hundreds of reasons why things could go one way or the other.
For one thing, numbers provided by AMD purporting a lead for a card that is not reflected by reviews actually does make sense, in view of what has already happened.

I think the ET article explains it well. The figures provided by AMD are supported by a review (actually apparently just one review looked at the 390). AMDs result was actually more favorable than ETs own result for the 980. Taking into account Clock differences you get your explanation. If the question is what version of the 980 should be used... who knows. Use the reference? Use the fastest? OC it to 2GHz?

From what I have seen it doesn't seem like the game is making much use of asynchronous compute but I'll have to read more. I am seeing claims that lionhead has not ported it over from the xbox one yet. They never did demonstrate it on PC even though they had a demonstration on a 980 in the past that showed other dx12 features.

This benchmark might not belong here.
 
The Work Distributor in Kepler can communicate in both directions but i don't know at which protocol level (in other words: there may be a very limited backward communication). Furthermore i'm pretty sure it's not a ARM core.

Edit: Fermi's was not bidirectional. So there is a change since Kepler for all the series above (e.g. GTX 680, 750, 960 and so on). Maybe discribed via CC 3.0
 
Back
Top