DX12 Performance Discussion And Analysis Thread

What console have to do with that ? ... Dou you know how many " consoles " games run like sh... on AMD PC hardware ? What is the conclusion we can have by having games who run well on consles ( With AMD GPU`s and processors ) and run like sh... on AMD PC gpu`s ?
Because most games are developed for the console market first in mind, and because they have greater limitations in terms of hardware capability (albeit with a better integrated memory system) then those games have a lot of effort in squeezing as much performance as they can out of it.
You are aware of how many games have been "downgraded" from the initial game core shown and developed on PC and then when the final game was released for all platforms including PC?
You aware how Witcher 3 changed their render engine (looked great originally on PC), which is why there was complaints about that being downgraded - again this ties in with consoles.
The trend and list of how game performance has switched from NVIDIA to AMD is not speculation, nor is it speculation how the different revisions of Keplar-to-Maxwell are impacted by this; note it is not that NVIDIA has stopped supporting Keplar performance but how the game is implemented these days with primary focus on consoles and the AMD criteria.
And interesting how some very recent AAA games work better with the older Hawaii GCN architecture than Fiji, and in others still very close to its performance - yeah appreciate this is multiple reasons.

Anyway case in point as I mentioned was Alpha test of Doom that has not had any work to improve the performance for specific GPU architecture on PC, and again AMD came out comfortably on top - Bethesda has mentioned currently this is more a console optimised state on PC.

This is not bashing AMD, in fact they did a great call and risk with the limited cash they had to control console market, I feel the 390/390x (with their better dynamic power regulation-management to the 290/290x equivalent) are probably the best mid tier cards around at the moment.

More of a soft win initially as margins are tighter with consoles, but has a longer return when considering game development.

Cheers
 
Well the problem with that is the past Xbox was also AMD made (cpu and graphics), and that didn't make a difference when pc gaming was concerned. And of course the API of choice for PS4 is different...., so for people making PC by doing a console port form PS 4 will have to switch to DX.
 
More consistent frame pacing in the Dx12 RotTR on my system. Avg frames about the same but feels a lot better to play (higher min frames). Some flickering issues and nothing extraordinary in comparison to Dx11 but certainly a welcome addition. If fraps worked with Dx12 I would've made a frame times comparison between the two but i can't get it to work.
 
Well the problem with that is the past Xbox was also AMD made (cpu and graphics), and that didn't make a difference when pc gaming was concerned. And of course the API of choice for PS4 is different...., so for people making PC by doing a console port form PS 4 will have to switch to DX.
Although some Sony staff mentioned there is strong similarities between DX12 and their own low level, but then the point is that developing DX12 also means the developers need to understand the hardware much better for optimisation, and that then comes back to AMD's favour in the context I present.
TBH many AAA games that are cross platform do focus on both PS4 and XBOX as the priority; this is seen by how many outsource their game for porting when it comes to PC and this has been a disaster more recently.
Cheers
 
More consistent frame pacing in the Dx12 RotTR on my system. Avg frames about the same but feels a lot better to play (higher min frames). Some flickering issues and nothing extraordinary in comparison to Dx11 but certainly a welcome addition. If fraps worked with Dx12 I would've made a frame times comparison between the two but i can't get it to work.
Ryan Shrout did a great article regarding some development by Intel and a colleague of Andrew Lauritzen for monitoring performance accurately on DX12/UWP:
http://www.pcper.com/reviews/Graphics-Cards/PresentMon-Frame-Time-Performance-Data-DX12-UWP-Games

Cheers
 
Finally a good tool that everyone can use (GPUView is not so mainstream after all). All we need now is that all IHVs publish drivers conforming to the correct behaviour in the different presentation modes (I am especially looking at you, discrete graphics card OEMs).
 
So sometime before i read that 660m will have full dx12 support. So i was happy until i tested dx12 mode in rise of the tomb raider on my lenovo y580 laptop with latest driver. No difference between dx11 and dx12. Not a single fps.

What it is? Bad drivers or my 660m is missing smth on hardware level?
 
So sometime before i read that 660m will have full dx12 support. So i was happy until i tested dx12 mode in rise of the tomb raider on my lenovo y580 laptop with latest driver. No difference between dx11 and dx12. Not a single fps.

What it is? Bad drivers or my 660m is missing smth on hardware level?

From the benchmarks I've seen, the only actual benefit was maybe a slightly higher minimum FPS. For a few people.
Average was mostly down. Not really the best DX12 example out there.
 
I can confirm after playing Dx12 RotTR for about 2 hours: GPU bound cases see a drop in performance, CPU bound cases see increased performance. Overall, higher min frames on my system more consistent framerate/frametime although i'm only guessing for the last part, didn't try presentmon yet.

CPU: i5 3570k OCed to 4.4ghz GPU: GTX 970 OCed to 1493 core/7900 mem.
 
Sorry for self-quote, but to follow up that old sub-argument in the thread relating to memory performance on Fiji vs. Hawaii:
What I've seen in earlier OpenCL tests was that Fiji compared to Hawai only pulls apart in large-block copy. Read and write OTOH were even a bit slower comparing Fury X to R9 390X. This was with very early drivers in july last year, so it could have improved since then. Need to re-test at some point. Or it could be some peculiarity of OpenCL or it's driver itself, I don't know.

Gotten around to re-do the tests with recent drivers and compile the results. The program used may or may not be representative, but it's ONE datapoint. So don't overinterpret it, just take it for what it's worth - another small piece in the puzzle which might fit somewhere.
edit: FWIW, similar comparisons between GM204 and GM200 sees up to ~50% higher performance for the latter also in read/write-only modes, not only copy.

7iM1CtQ.png


There are way too many games where a full Fiji can hardly put some distance from full Hawaii, which suggests that Fiji's substantially higher CU count is mostly standing idle and the chip may be sitting on a geometry bottleneck.
Maybe (with a big MAY) it's also because HBM (or it's implementation/servicing memory controllers) is/are not as universally delivering as indicated by the raw bandwidth numbers.


Strange thing about PCGH graphs is that red line is not dropping to the min fps value stated.
Computerbase numbers show solid gain on FX-3870, maybe Nvidia does not yet know how to put i7 to good use under DX12.
What do you mean by the red line? Maybe I can clarify.
 
Last edited:
The results for the Copy operations are odd. I'd guess to explain these, we would have to look at the instructions the compiler emitted. I suspect that the copy didn't even include a copy to the CU/L1, but was handled straight away by the memory controller.

Which would mean effective bandwidth available to the shaders is significantly lower than the specified peak bandwidth.

There is one line missing in that chart which would allow to dismiss that, and that is "copy" with simple integer arithmetic / bit shift applied, as that should pretty much reflect the bandwidth effectively available to shaders. Just applying a bitmask is possibly not sufficient to enforce a transfer to L1.
 
@CarstenS Is that COPY number actually the number of bytes copied, or is it the theoretical memory bandwidth, computed as the sum of the equivalent read and write ops?

Given that is goes beyond 400GB/s, it looks like the latter one, as the theoretical peak of of 512GB/s is pretty close.

But could still be effective throughput, assuming a few tricks.
 
Maybe (with a big MAY) it's also because HBM (or it's implementation/servicing memory controllers) is/are not as universally delivering as indicated by the raw bandwidth numbers.
Those results are strange because according to Anandtech's @Ryan Smith the ROPs in Fiji are much more efficient than Hawaii's because of color compression and substantially higher bandwidth.
With the same ROP count and similar clocks, the Fury X gets 48% higher Texel fillrate and 240% higher Pixel fillrate in 3dmark Vantage. It's a huge difference and I doubt it would come from color compression alone.
 
Maybe (with a big MAY) it's also because HBM (or it's implementation/servicing memory controllers) is/are not as universally delivering as indicated by the raw bandwidth numbers.
The throughput while the transfer size appears to fit on-die storage keeps the 390X and FuryX roughly equivalent, with a consistent small shortfall for Fury.
Fury doesn't start to pull away with the copy test until after 2MB, or the size of the L2.
The unidirectional tests don't give Fury that outcome.

The copy test might be where HBM's higher raw channel count and bandwidth can benefit, in terms of having more capacity to queue transactions and possibly more room to handle R/W turnaround.
The smaller strides act like there's more overhead with Fury, and the 1MB of extra L2 doesn't seem to do much to help (or there's a correlation where Fury matches 390 until the L2 is wholly exhausted).


The results for the Copy operations are odd. I'd guess to explain these, we would have to look at the instructions the compiler emitted. I suspect that the copy didn't even include a copy to the CU/L1, but was handled straight away by the memory controller.
There's no instruction that does that for GCN that I can recall. If the controller could do that, wouldn't there be more capacity with 2x the channels to see a gain?
 
Ext3h: Sadly, the program used does not allow for any variables. I doubt though that Copy is simply the sum of read and write. Reason: R9 390X is faster in read and write, yet slower in copy.

ToTTenTranz: Texelfill obviously benefits from higher (+45%) number of TMUs in Fiji, so I'll dismiss that number. Pixelfillrate OTOH is remarkably strong (it's 140% higher btw and 240% as high), albeit you can see from the difference between HD 7970 and R9 285 that Framebuffer compression makes quite a bit of a difference here. Assuming we're limited by ROP throughput for Vantage's pixelfill test, that's ~43% clock for clock due to better ROPs. Also, Fijis double L2 cache (in total and per MC) should help a bit. Might wanna compare Damien's numbers on Fiji (Pixels with blending): http://www.hardware.fr/articles/937-4/performances-theoriques-pixels.html
FWIW, at least 3DMark and I think Damien's test as well uses the graphics pipeline for memory access. Might be just not fully optimized compute portion? Or the more random nature of compute kernels? Don't know for sure yet.

MDolenc: Yes, it uses compute via OpenCL. And probably not a very optimized access pattern to boot. It's the ancient OpenCL-Benchmark v1.1

--
edit: FWIW, similar comparisons between GM204 and GM200 sees up to ~50% higher performance for the latter also in read/write-only modes, not only copy.
 
Last edited:
Ext3h: Sadly, the program used does not allow for any variables. I doubt though that Copy is simply the sum of read and write. Reason: R9 390X is faster in read and write, yet slower in copy.
What I meant is, that it is either the actual throughput (bytes copied), or that number x2 to signify that a copy operation (theoretically) means double the bandwidth.

Should be the latter one, as it is "faster" than individual write and read speeds, but I don't know for sure.

Neither do we know how the read and write ops were prevented from being optimized away, nor if the data was compressible or not, nor how many kernel instances were invoked in parallel, nor how the buffers were traversed. That's already a lot of unknowns for such a small benchmark.

Can you post the kernel source as a reference to that chart?
 
Back
Top