DX12 Performance Discussion And Analysis Thread

Time for a little summary of the entire topic.

Said ahead:
Then benchmark provided by @MDolenc and the plots provided by @Nub were a great help. Even though they did not prove what they were originally intended to, they still provided insight on other, previously unpublished implementation details on both GCN and Nvidias architectures. This helped a lot in gaining insight on the actual capabilities of Kepler and Maxwell, far beyond anything Nvidia has officially published on these architectures.

It has also revealed something different, less fortunate: About every single tech review site out there, reporting on that topic, made mistakes. And not exactly small ones. This goes from pulling strange figures from an anonymous quote without notation of author or anything alike, to quoting out of context from public discussions (like this thread), up to picking random unknown specs from either vendor and putting them up against each other just because of similarities in the naming of the specs. Also a lot of wrong quoting from other tech review sites, without checking for their sources.

In some cases even vendor published papers were just plain wrong or contradicting, or referred to chip versions which got never released in that form. Respectively to incomplete API drafts, which would have been compatible with hardware while the final API version wasn't. And a lot of "equivalent models" which omitted some less fortunate implementation details.

Or to put it in different words: You can't trust anyone. Most of the information spread by online and print magazines on this topic has proven to be wrong or without foundation.

I can't even guarantee correctness for my own analysis. My analysis is based on an extrapolated hardware model (rather than just aggregating random numbers) which was capable of predicting non-synthetic results quite well, but it might still be wrong.

Text part first: http://ext3h.makegames.de/DX12_Compute.html

Updated graphs on Nvidias and AMDs architectures may follow later. I still have to get confirmation on some details first.
 
What about GK208?
Same as GK110?
GK208 is gimped down beyond reason. Supposingly has a GMU, but no unlocked execution slots yet. The work distributor is only 6 grids wide, which means sequential execution for BOTH drawcalls and compute grids while in 3D mode. Arguably one of Nvidias cards which would in fact profit the most from using the compute engine, for gaining a chance at parallelism at all. Even more so given the horrid latencies of the Kepler shader architecture which requires concurrency.

GM107 is the other oddity. GMU is present, but only 16 wide work distributor. It's the only Maxwell v1 chip in existence, and is also lacking v2's unlocked mode for the GMU.
 
GK208 is gimped down beyond reason. Supposingly has a GMU, but no unlocked execution slots yet. The work distributor is only 6 grids wide, which means sequential execution for BOTH drawcalls and compute grids while in 3D mode. Arguably one of Nvidias cards which would in fact profit the most from using the compute engine, for gaining a chance at parallelism at all. Even more so given the horrid latencies of the Kepler shader architecture which requires concurrency.

GM107 is the other oddity. GMU is present, but only 16 wide work distributor. It's the only Maxwell v1 chip in existence, and is also lacking v2's unlocked mode for the GMU.
There are actually two Maxwell v1 chips - the other is GM108.
 
I own a GK208.
I would be glad to clarify uncertainties.
How can I help?
I got the data for the GK208 already. I'm rather confident that it will scale better in compute heavy DX12 applications such as Fable, than you would expect from the performance in DX11 applications.

But I just realized something different. Apparently, AMD is also scaling grid level concurrency depending on the chip size. And they only introduced that scaling after GCN 1.0, all previous cards appear to have an identical frontend.
 
But I just realized something different. Apparently, AMD is also scaling grid level concurrency depending on the chip size. And they only introduced that scaling after GCN 1.0, all previous cards appear to have an identical frontend.
Where do you get your info from? Articles or whitepapers...?
 
Where do you get your info from? Articles or whitepapers...?
This specific piece? From the benchmark from this thread. Look closely, and you'll see that there are different step sizes for GCN 1.1 and 1.2 architectures, but only a uniform one (64) for GCN 1.0, despite the actual chip being smaller and not exactly requiring that width. I still haven't figured to which function unit this limit is actually attributed to, on AMDs hardware, but it's clear there is some scalable part in the architecture.

I wish there where any papers on NVs and AMDs architectures which would give a detailed insight, but there are none.

Only that one architectural overview chart for GCN (and the leaked PS4 block chart) which are both incomplete. At least it is somewhat possible to extrapolate the latter one, knowing which capabilities have been omitted. The first one is mostly useless as it is only telling you scalable GCN is supposed to be, when you don't see that scaling in any of the released chips, and the diagram is also hiding several layers of hardware.

And as for Nvidia, not even such a thing. The largest diagrams you will ever see only covers 2-3 components in isolation each, even in their own papers.
The only stuff properly researched properly by 3rd party is various assembler instruction latencies for Fermi, Kepler and Maxwell. Have fun puzzling the rest together, based on conclusions drawn from their own best practise guides (mostly CUDA, there's at least *some* public documentation for that), anomalies, but mostly from reverse engineering their hardware, always assuming the most minimalist circuit design required to provide a certain observed functionality.

To be honest, Nvidia is actually easier to reverse engineer than AMDs hardware. For NV, you can observe how the hardware evolved, how function units got patched in or received minor overhauls, and you can observe the loose ends from historic design decisions, resulting in abnormal performance in edge cases which are luckily consistent across multiple generations. Apply some basic knowledge about hardware design, and you can tell where they re-purposed function units, pipes and caches.

AMDs hardware is the one giving me an headache, it's rather sophisticated and it's making it really difficult to attribute measured bottlenecks to a specific function unit. I wouldn't trust my own numbers on AMDs hardware, there's a good chance something entirely different was limiting. Yet it appears so plausible, saying the numbers do somewhat match up.

And most articles ... it's incredible how many mistakes and misinterpretations most articles contain.
Seriously, try checking their sources for once. None of the magazines I checked were reliable at all. It's all fine as long as they are only using raw specs pulled from Wikipedia. Or benchmarking 3rd party software. But as soon as they start interpreting statements made on public conferences, mixing up marketing statements with random rants with actual internal details, it's getting messy. Even worse when they start quoting each other (or just not properly quoting), figures start appearing out of nowhere, put together in a rush in the attempt to publish another article by the evening. It gets still worse when you know that both vendors like to provide their own NDA protected "analyses" on their competitors capabilities as part of their press kits, which many magazines like to quote from improperly, presenting it as their own findings, always discarding the disclaimers.


The DX12 API with the minimalistic driver is literally the first time that you can observe the hardware yourself in a controlled environment, regarding both 3D and compute capabilities. (Well, would have been possible with Mantle as well, if it had become public.) For NV at least as long as they are not hacking their drivers to hide these details.
So I can only encourage others to take the chance, and to prey open every single detail about the hardware, as long as the drivers aren't messing with the results yet. And if you can get into contact with developer studios, inquire what specific edge cases they encountered with DX12. (And you don't want their filtered "conclusions". You want the raw picture.)
 
So basically all conjecture at this point with little substance other than your own personal conclusions/bias. Hopefully it will be taken in that context and not plastered on the net as a "statement of truth" by someone claiming to be well versed in the innards of all GPU architectures.

(And you don't want their filtered "conclusions". You want the raw picture.)
One point I agree with and at this point I'd say we are not even close.

Edit: It's interesting that all your posts are focused on this one thread, I think back then it was AOS benchmark related. Any reason for the absence of posts prior to joining 1.5 months ago?
 
It's interesting that all your posts are focused on this one thread, I think back then it was AOS benchmark related. Any reason for the absence of posts prior to joining 1.5 months ago?
The AotS topic caught my attention because all the explanations to be found on tech review sites were not conclusive or outright didn't made any sense respectively were contradicting each other. I had no interest in this field before, only some minor OpenGL and OpenCL experiments.

So I focused onto comparing two specific architecture families in terms of a specific feature: High level scheduling.

All I knew before were the presentations which Nvidia and AMD held each when they introduced their architectures. And that did neither explain the capabilities nor performance penalties observed. At least the benchmark in this thread (at least the later versions) did yield results which didn't matched any known hardware specs.

By chance, I got contact to one of the Oxide devs, but I really can't disclose the details, that's not up to me. They chose not to publish the details, so neither can I. And I can not disclose any possible other sources either.

Believe me, I hate the situation as much as you do. I would prefer if my claims could be properly reviewed as well, and if I didn't have to stay so vague. But the information policy regarding graphics hardware is, unlike the software domain, toxic. Both dGPU vendors are trying to protect their IP, and they are doing so quite aggressively.

And yes, I recognize that I should have had flagged speculations in this thread more clearly as such. I made in fact a few false claims, misinterpreted data and alike which I had to revise later on. I'm by no means an expert for these architectures in general, leave alone other architectures, but I got good insight into the one feature I focused on.
 
Interesting discussion from a theoretical standpoint but after 900 posts we're still missing a scenario where any of this actually matters. That will probably change as engines evolve but it seems speed trumps scheduling efficiency in today's titles.
 
Interesting discussion from a theoretical standpoint but after 900 posts we're still missing a scenario where any of this actually matters.

Not for long.
Ashes of the Singularity will be out in a week publicly in Alpha with its benchmark mode publicly available.
 
Last edited by a moderator:
Not for long.
Ashes of the Singularity will be out in a week publicly in Alpha with its benchmark mode publicly available.
A primary consideration will be how relevant that benchmark is to the actual game play-mechanics and whether it is designed to create a more stressed-specific test scenario.
I am a bit leery of benchmarks in games these days after Dragon Age Inquisition where one could find one brand of manufacturer performed worst and yet was better in the actual game (especially with the more intensive maps) - one of the professional reviewers looked at this and found it repeatable; this is going back awhile though but still brings into question how reliable those benchmarks can be, especially if a product is aligned somewhat with either the red or green team.

I am interested to know whether the efficient DX12 mechanics associated with Star Swarm and worked well with NVIDIA are also integral to AotS, appreciate they are not truly like-for-like game play engines and maybe greater complexity with AotS (including I assume more advanced rendering) means what worked really well back then for NVIDIA cannot be implemented in the current and more advanced real game.

Cheers
 
But the information policy regarding graphics hardware is, unlike the software domain, toxic. Both dGPU vendors are trying to protect their IP, and they are doing so quite aggressively.
AMD has shared lots of low level technical information about GCN in their latest ISA documents. CodeXL also shows shader microcode for various GCN architectures.

http://developer.amd.com/wordpress/...hern_Islands_Instruction_Set_Architecture.pdf
http://amd-dev.wpengine.netdna-cdn..../07/AMD_GCN3_Instruction_Set_Architecture.pdf

What kind of information you would need that is missing?

Intel's open source driver documents are also very good and provide huge amount of low level hardware details. Nvidia seems to be the only one with very limited amount of low level hardware documents available.
 
Last edited:
GCN's ISA documentation appears to be too low-level and confined to what happens within the GPU domain for this thread's context.

One difference that I have mulled discussing in the Intel Gen architecture thread is that Intel's method for interacting with special-function and shared hardware more explicitly acknowledges the existence of the sub-processors that go int creating the illusion of a unified device called a GPU, since messages are composed that can address elements like the processor that handles thread launches.

The CUs are more heavily abstracted from the details of the queues and the policies for those processors, since the shader ISA code does not come into play until all those other queues and cores have done their work. The 3d driver documents for Linux have more discussion of the queues and microcode engines, which the CUs generally rely on getting some of those outputs spoon-fed to them at the end.

With HSA, if that matters, to some extent the CUs arguably should not have that visibility, since queue processing is not supposed to take up ALU resources.
 
AMD has shared lots of low level technical information about GCN in their latest ISA documents. CodeXL also shows shader microcode for various GCN architectures.

http://developer.amd.com/wordpress/...hern_Islands_Instruction_Set_Architecture.pdf
http://amd-dev.wpengine.netdna-cdn..../07/AMD_GCN3_Instruction_Set_Architecture.pdf

What kind of information you would need that is missing?
Informations on the capabilities of two specific units: The "Command Processor" and the "Dispatch Processor".

The papers you linked are only focusing on how to get a kernel to execution. They specify the data format, how to get your kernel downloaded by the GPU, how to issue execution, and the next thing you know is that your kernel is already running on a CU. At lot of hidden stuff is happening before that last step.

What they never disclosed at all, is what limits or capabilities these two units have. Especially which command types or specific workloads are blocking pipelining/concurrent execution inside these units.
And neither did Nvidia publish the limits of their corresponding units. (NV calls the "Dispatch Processor" "Work Distributor" instead.)

And it appears to be the differences in this function group which have caused Oxide (and are causing other teams) so much much trouble.
 
Not for long.
Ashes of the Singularity will be out in a week publicly in Alpha with its benchmark mode publicly available.
Any word on the Star Swarm benchmark? I think Anand's had a DX12 preview with star swarm using DX12... but I don't think the DX12 portion was made public at that time.
 
The papers you linked are only focusing on how to get a kernel to execution. They specify the data format, how to get your kernel downloaded by the GPU, how to issue execution, and the next thing you know is that your kernel is already running on a CU. At lot of hidden stuff is happening before that last step.
It would be great if AMD released PC profiling tools that show per CU timeline for wave occupation (how many waves of each shader is running on each CU at every time slice) + markers for various stalls. This would give the programmer enough information to optimize their DX12 code properly regarding to concurrent execution (barriers & async compute). Current PC tools are not good enough for this.

Also it would be nice if the PC programmers got a document about the GCN concurrent execution details, and a best practices guide to help writing fast code. As you have said, currently many things are not explained well, and this can cause performance issues as the developers do not fully understand the hardware bottlenecks.
 
I just realized that neither AMD or Nvidia gives the PC developes a proper chart of all the graphics state change costs. How some state changes limit the concurrency and what kind of other costly operations the state changes perform. I have been first writing and optimizing/analyzing my code on consoles for so long time that I have forgot how much guesswork it requires to optimize things on PC.

And I am not talking about DX12 async compute here... Basic graphics pipe concurrency (including DX9/10/11) doesn't seem to be well documented either.
 
Back
Top