DX12 Performance Discussion And Analysis Thread

no it seems the front end of Maxwell 2 will have issues with heavy compute situations with graphics. This is because it doesn't have the register space nor cache to keep filling the queues. Now will we see that in its life time, its up in the air need more Dx12 games to test for that. (this is all if its all the drivers fault that we have seen so far)

Talking about gaming performance, how COULD this affect maxwell 2 in the future compared to a Fury X and a 390x for example?
Im a little bit lost with all theese low level definitions. Ive only coded a little bit in assembly for MIPS. Thats mainly it lol.
 
lol yeah that's a good question, too hard to say, nV can do tricks where it offloads things to the cpu and shuffles things back to the gpu, so.....

But if Maxwell 2's queues are over loaded, expect to see a performance hit, and depending on how big the over load is, the hit can be large. But it all depends on the game, will soon to be released games push it that much it creates a bottleneck that just shuts down Maxwell 2, i don't think they will and by the time games like that come out I think other parts of all the GPU's will be struggling.
 
lol yeah that's a good question, too hard to say, nV can do tricks where it offloads things to the cpu and shuffles things back to the gpu, so.....

But if Maxwell 2's queues are over loaded, expect to see a performance hit, and depending on how big the over load is, the hit can be large. But it all depends on the game, will soon to be released games push it that much it creates a bottleneck that just shuts down Maxwell 2, i don't think they will and by the time games like that come out I think other parts of all the GPU's will be struggling.

Then, wouldnt it be better for them to just leave Async Shaders off? And work as they used to with DX11?
 
Maybe I'm being naive, but no hardware is perfect and everything has limitations to work around. I'm hearing that real world best case is a 30% boost, if NVIDIA's implementation can capture even half of that I think they'll be in a good spot and either way it doesn't sound make or break. If the NVIDIA solution starts to fall apart under heavy load I wonder if devs won't go the extra mile beyond that point, given that NVIDIA is the majority. Although TBH I don't know whether this is something that requires careful optimization and hard work, or a simple variable (like number of threads in handbrake) that you turn up until it breaks.
 
Then, wouldnt it be better for them to just leave Async Shaders off? And work as they used to with DX11?
Well no, because there is performance to be gained, and that performance can be significant similar what we see in GCN now or more, as long as its not pushed too much.

From a software development project management point of view, first the driver team will sit down and go over what they need done first, after the critical path is defined *critical path are the tasks that have to be done before anything else is done *. now Async will be a low priority because games that are going to be using it won't be out any time soon, it gets pushed to the end of the backlog.
 
Well no, because there is performance to be gained, and that performance can be significant similar what we see in GCN now or more, as long as its not pushed too much.

From a software development project management point of view, first the driver team will sit down and go over what they need done first, after the critical path is defined *critical path are the tasks that have to be done before anything else is done *. now Async will be a low priority because games that are going to be using it won't be out any time soon, it gets pushed to the end of the backlog.

Allright allright now we are on my field (Systems/Software Engeneering), and i definetly agree with you, this wont likely be an issue for Maxwell 2 during the next years, and by the time it would (just lowering some settings will be fair enough, all in all, other parts of the gpu will struggle by then, as you said).
Therefore, this puts Maxwell 2 in a similar position to Fiji (Given the 4gb Vram and other specs that lack performance)...

Also, I think that given the priority Async Shaders have and the phases Nvidia is going through in its development cycle its natural to see inmature Async development in their drivers. Im pretty sure their marketing dpt wasnt expecting this at all...
 
Regarding Async compute, a couple of points on this. FIrst, though we are the first D3D12 title, I wouldn't hold us up as the prime example of this feature. There are probably better demonstrations of it. This is a pretty complex topic and to fully understand it will require significant understanding of the particular GPU in question that only an IHV can provide. I certainly wouldn't hold Ashes up as the premier example of this feature.
Interesting comments from the Oxide Dev ... guess creating a DX12 game isn't going to be as straight forward as they may have initially thought.
 
Not sure why "games using async shades won't be coming anytime soon" given that exposure to developers under the consoles has been there for a number of years now and it sounds like it has already being implemented. DX 12 means those characteristics are available on more platforms.


I agree with that but when creating a new driver set they still have to focus on current games and games just about to come out., so depending on how deep their driver team is they might have prioritized for them first.

We have seen this before with Windows 7 and Dx11 too.

Actually this comes up to from a business point of view getting drivers ready for current games and games just about to come up is more important too. Because although they would have to ignore developers wanting to use new features in the longer term, they still have time to get to them. There is always both aspects when development is happening business and development vs time and cost.
 
Last edited:
Not sure why "games using async shades won't be coming anytime soon" given that exposure to developers under the consoles has been there for a number of years now and it sounds like it has already being implemented. DX 12 means those characteristics are available on more platforms.

Well, Unreal Engine 4 just implemented it... Also, PS4 and Mantle have been able to use this for a long time, but theres hasnt been an important focus onto this subject yet. Im pretty sure all the fuzz about Async Development, grew exponentially after this years GDC. (Ofc not for devs, but for the general public).

Also, AMD on their demo, got around 10% perf boost using this, some devs say it can get to a max of 30% performance... (Oxide states they used a mdoerate ammount of this)... this makes me wonder...

How much will this really improove (performance wise) in games...

And, (this is a question for you guys who actually know a LOT about this)... when and what for are theese shaders used? (Would love an easy example of a kind of game which would use them the most) (Will FPS or games liek The Witcher use a lot of them?)
 
How much will this really improove (performance wise) in games...

And, (this is a question for you guys who actually know a LOT about this)... when and what for are theese shaders used? (Would love an easy example of a kind of game which would use them the most) (Will FPS or games liek The Witcher use a lot of them?)

Lighting algorithms, deferred rendering, physics, all can use heavy compute.
 
No, it is blending them together just fine. One queue is filled commands from the render pipeline. The other 31 queues are either filled with commands from up to 31 dedicated (software) queues or with async tasks.

That's almost working as it is supposed to. There appear to be several distinct problem though:
  • If a task is flagged as async, it is never batched with other tasks - so the corresponding queue underruns as soon as the assigned task finished.
  • If a queue is flagged as async AND serial, apart from the lack of batching, all other 31 queues are not filled in either, so the GPU runs out of jobs after executing just a single task each.
  • As soon as dependencies become involved, the entire scheduling is performed on the CPU side, as opposed to offloading at least parts of it to the GPU.
  • Mixed mode (31+1) seems to have different hardware capabilities than pure compute mode (32). Via DX12, only mixed mode is accessible. (Not sure if this is true or if just the driver isn't making use of any hardware features.)
  • The graphic part of the benchmark in this thread appears to keep the GPU completely busy on Nvidia hardware, so none of the compute tasks is actually running "parallel".
  • "Real" compute tasks require the GPU to be completely idle first before switching context. This became prominent as Windows is dispatching at least one pure compute task on a regular base, which starved and caused Windows to kill the driver.
Well, at least the first two are apparently driver bugs:
  • Nvidia COULD just merge multiple async tasks into batches in order to avoid underruns in the individual queues. This would reduce the number of roundtrips to the CPU enormously.
  • Same goes for sequential tasks, they can be merged in batches as well. AMD already does that, and even performes additional optimizations on the merged code (notice how the performance went up in sequential mode?).
So AMD's driver is just far more matured, while Nvidia is currently using a rather naive approach only.



Well, and there's the confirmation.

I wouldn't expect the Nvidia cards to perform THAT bad in the future, given that there are still possible gains to be made in the driver. I wouldn't exactly overestimate them either, though. AMD has just a far more scalable hardware design in this domain, and the necessity of switching between compute and graphic context in combination with the starvation issue will continue to haunt Nvidia as that isn't a software but a hardware design fault.

As nVidia said: Heavy pp will slow down the GPU with async shading. Since Ashes of Singularity uses heavy pp with a lot of units and lights there is such a rare case in practice with other games out there that this will happen often. But i don't disagree with your statement - quite the contrary. Thanks for sharing your knowledge.
 
Also, AMD on their demo, got around 10% perf boost using this, some devs say it can get to a max of 30% performance... (Oxide states they used a mdoerate ammount of this)... this makes me wonder...

There is no max 30%. There's already one developer on this forum that has mentioned certain tasks can see greater than 50% improvement from async compute. It's all going to depend on whether a particular job is well suited to it or not. If it isn't you aren't likely to see any performance gains. If it's particularly well suited you may see large gains.

Regards,
SB
 
Thanks to everyone who have provide us with numbers ..... even if this have make the topic a bit hard to understand as a bit noisy.

Still i dont see how they will do the scheduling with efficiency at driver level.. the only way is to get the devs who make the pre empt scheduling ( fixed tasks / queue )... but maybe the result will not be bad when well optimized.
 
Not sure why "games using async shades won't be coming anytime soon" given that exposure to developers under the consoles has been there for a number of years now and it sounds like it has already being implemented. DX 12 means those characteristics are available on more platforms.

People forget changing a rendering engine is a long road. Some consoles games have some compute shaders but aren't compute heavy... All tiled rendered games use it like BF4 and BF Hardline, particles physics in Infamous, Force fields in KZ SF, some compute shader too in The Order 1886... Same thing for Forward + games...

Some 2015 title will use compute heavily like The Tomorrow Children maybe Rise of the Tomb Raider, Battlefront, Need for Speed...

I think 2016 will probably be the year of compute shader heavy consoles title... Dreams 100% compute shaders, maybe Red Lynx game, UC4, Quantum Break , Deux Ex Mankind divided and so on...

From Frosbite developer they told their games will have a DX12 mode fall 2016 on PC...
 
Last edited:
If I understand what you mean by batching, that's what I meant by blending. To give an example that also illustrates my skill level here, let's say the math units in the GPU know how to add, multiply, subtract, or divide. If the rendering queue only needs use of the add and multiply units on this particular cycle, then we should be able to hang compute jobs on the subtraction and/or division units, right? And for whatever reason, that's not happening on the Nvidia cards, right? Is that even what we're talking about here? :p
What I meant by batching, is actually stapling multiple tasks onto each other in a sequential order, in order to keep the queues filled for a longer time. What you call "blending" is what the hardware is supposed to do on its own with the 32 queues. There are 2 cases which you want to avoid:
  • Underutilizing special function units while the GPU is active
  • Idle GPU
Blending instructions from multiple tasks helps with the 1st point. Enqueuing additional tasks right behind already queued ones helps with the 2nd one, and that's what Nvidia appears to be missing so far, even though that is just a driver, not a hardware feature.
 
no it seems the front end of Maxwell 2 will have issues with heavy compute situations with graphics. This is because it doesn't have the register space nor cache to keep filling the queues.
Are you sure? I mean the register files being too small for actual parallel computations is a well know issue, so that can be a likely issue when parallelism raises.

But the caches shouldn't limit the size of each queue. If you insist on pushing longer programs into each queue, the hardware should cope with that quite well. More cache misses, yes. Possibly even running into the memory bandwidth limit. But I don't see how this would possibly affect the refill of the queues. Currently, it only looks as if the queues are simply underrunning far too often, due to a lack of used queue depth.
 
Lighting algorithms, deferred rendering, physics, all can use heavy compute.
Also compute is used for: skinning, global illumination, occlusion culling, particle animation, depth of field, fast blur kernels (ex. bloom), reductions (ex. average luminance for eye adaptation), screen space reflections, distance field ray tracing (ex. soft shadows and AO in UE4), etc, etc.

Media Molecule's new game "Dreams" has a GPU pipeline that is fully compute shader based (no rasterization at all).

Q-Games (Tomorrow's Children) had quite nice performance improvements from asynchronous compute in their global illumination implementation. DICE (Frostbite) is using asynchronous compute for character skinning. Their presentation described skinning being almost free this way, as the async skinning fills holes in GPU execution. Resent presentations from game/technology developers have shown lots of uses cases for asynchronous compute (on consoles). As DX12 supports multiple compute queues, we will certainly see similar optimizations on PC.
 
Also compute is used for: skinning, global illumination, occlusion culling, particle animation, depth of field, fast blur kernels (ex. bloom), reductions (ex. average luminance for eye adaptation), screen space reflections, distance field ray tracing (ex. soft shadows and AO in UE4), etc, etc.

Media Molecule's new game "Dreams" has a GPU pipeline that is fully compute shader based (no rasterization at all).

Q-Games (Tomorrow's Children) had quite nice performance improvements from asynchronous compute in their global illumination implementation. DICE (Frostbite) is using asynchronous compute for character skinning. Their presentation described skinning being almost free this way, as the async skinning fills holes in GPU execution. Resent presentations from game/technology developers have shown lots of uses cases for asynchronous compute (on consoles). As DX12 supports multiple compute queues, we will certainly see similar optimizations on PC.

Particle scatter/gather like in the 2014 AMD presentation with particles without overdraw... No implementation into any released games
 
Particle scatter/gather like in the 2014 AMD presentation with particles without overdraw... No implementation into any released games
Media Molecule must be using something similar in Dreams since they don't use rasterizer at all. So they will be likely the first ones to ship a fully compute based particle engine (that I know of). I am sure many others are using/evaluating similar tiled particle system as described in the AMD paper. It provides impressive gains for heavy overdraw cases and greatly reduces the bandwidth usage.
 
Back
Top