DX12 Performance Discussion And Analysis Thread

AnomalousEntity · Jun 8, 2016

A compute shader IS a drawcall. It is equivalent to a full-screen triangle with passthrough vertex-shader and a regular pixel shader which reads data from memory, does a bunch of computations and w out the result back to memory. They just have additional constraints on which warps can run together on a SIMD unit as they can be within a group so they need to share the shared memory. No other FF units are being used in a compute shader but you can be always limited by memory bandwidth as it regularly happens with postprocessing compute shaders.

And you can do EVERYTHING in a compute shader - even make a complete game out of it drawing pretty pictures on the screen. Look at shadertoy for example.

willardjuice · Jun 8, 2016

Devs have gotten as much as 6-7 ms perf improvement using Async Compute which is HUGE! And no, AMD didn't do any marketing dump on async - those are real gains achieved in real games. It pavs the way to do more stuff on GPU like culling, particles etc.

But it's important to note that a lot of these techniques are for GCN only. For example, Frostbite recently had a paper about doing triangle culling with compute (http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf). They note they got a pretty good performance boost using this technique. However from my understanding when that same technique was applied to Intel's and Nvidia's hardware, it wasn't any faster. And the reason had nothing to do with inferior async/concurrent compute "support". The original "performance problem" (idle units) didn't exist on those architectures because they weren't nearly as bottlenecked by triangle throughput as GCN. Even if Intel and Nvidia had the best async compute support (whatever that means), they would not have seen (meaningful) performance gains with this technique on those architectures.

So what does this mean? You could argue that GCN shouldn't have had those bottlenecks in the first place. You could also argue that AMD designed GCN specifically to be flexible enough to workaround those bottlenecks. Both arguments are correct.

The point is all async compute techniques are unique and are highly sensitive to the underlining architectures. Perhaps async compute technique X might give better performance gains on architecture A but technique Y might be better for architecture B. It all depends on the bottlenecks of the architecture.

Ultimately (imo) the performance gain of a given async technique means nothing. Perhaps the architecture already wasn't idle. Perhaps the architecture was way more idle than it should have been.

Regardless, hopefully people will at least understand why having a "does support async compute" flag is nonsensical. And why "supports async/concurrent compute" is a loaded term.

lanek · Jun 8, 2016

willardjuice said:
But it's important to note that a lot of these techniques are for GCN only. For example, Frostbite recently had a paper about doing triangle culling with compute (http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf). They note they got a pretty good performance boost using this technique. However from my understanding when that same technique was applied to Intel's and Nvidia's hardware, it wasn't any faster. And the reason had nothing to do with inferior async/concurrent compute "support". The original "performance problem" (idle units) didn't exist on those architectures because they weren't nearly as bottlenecked by triangle throughput as GCN. Even if Intel and Nvidia had the best async compute support (whatever that means), they would not have seen (meaningful) performance gains with this technique on those architectures.

So what does this mean? You could argue that GCN shouldn't have had those bottlenecks in the first place. You could also argue that AMD designed GCN specifically to be flexible enough to workaround those bottlenecks. Both arguments are correct. The point is all async compute techniques are unique and are highly sensitive to the underlining architectures. Perhaps async compute technique X might give better performance gains on architecture A but technique Y might be better for architecture B. It all depends on the bottlenecks of the architecture.

Ultimately (imo) the performance gain of a given async technique means nothing. Perhaps the architecture already wasn't idle. Perhaps the architecture was way more idle than it should have been. Regardless, hopefully people will at least understand why having a "does support async compute" flag is nonsensical. And why "supports async/concurrent compute" is a loaded term.

Who could be linked ( on a larger instance with this) http://gpuopen.com/fast-compaction-with-mbcnt/ and this http://gpuopen.com/geometryfx-1-2-cluster-culling/

And as we are at it, streaming and memory management DX12 http://gpuopen.com/performance-series-streaming-memory-management/

its funny all thoses implementation, functionnality, optimiziation path who could have been used by developpers way before and appears just now with Vulkan, DX12. And there's a tons of functionnality on GCN who have never been exposed and accessible by developpers.

AnomalousEntity · Jun 8, 2016

willardjuice said:
However from my understanding when that same technique was applied to Intel's and Nvidia's hardware, it wasn't any faster. And the reason had nothing to do with inferior async/concurrent compute "support". The original "performance problem" (idle units) didn't exist on those architectures because they weren't nearly as bottlenecked by triangle throughput as GCN. Even if Intel and Nvidia had the best async compute support (whatever that means), they would not have seen (meaningful) performance gains with this technique on those architectures.

You got a source for this? In the presentation you linked they only talk about optimizations on GCN. "The op>miza>ons and algorithms presented are specific to AMD GCN hardware, as this first version was aimed at “geLng it right” on consoles and high end AMD PCs"

Ofcourse you will need to tune differently for Nvidia and Intel but I find the claim that doing a GPU level culling on these HW wouldn't produce any speedup even if they fully supported async compute. Here is another presentation where they claim less geometry work on all GPUs : http://advances.realtimerendering.c...siggraph2015_combined_final_footer_220dpi.pdf

Every architecture will have bottlenecks at different units and async compute will help utilizing the "other" units. This is an optimization across all HW and hence it's in DX12 and Vulkan. It's not a loaded term and should be supported by all GPU vendors. Although the burden to tune for all configurations is now on gamedevs who need to batch their work carefully for each GPU configuration.

CSI PC · Jun 8, 2016

Razor1 said:
Doing ports is one thing, but doing bad ports is another. Will dev's that choose to do a pc port stick with no optimizations for nV hardware, where nV has so much marketshare they can't be ignored? I don't buy the whole console wins will drive pc marketshare though optimized games for said hardware. Never worked in the past, and it won't really work now. You might see slight fluctuations but nothing major. nV didn't build their marketshare based on consoles did they?

Although this time round the consoles are the closest they have ever been in terms of architecture and similarities with PC.
PS3 was the pain in backside Cell design with the RSX GPU, while Xbox 360 was the easier the Xenon/Xenos.
I do think Nvidia has screwed up by not having Sony this time round, should had sucked it up again with the tight margins - if they even had a chance to tender a bid but I wonder if they did not really bother trying due to the those small margins..

It does seem more of the recent AAA console games do perform better with DX11 for AMD on PC these days than they did in the past, which I put down to a mix of drivers and better engagement with developers from the console side.
Cheers

willardjuice · Jun 8, 2016

AnomalousEntity said:
You got a source for this? In the presentation you linked they only talk about optimizations on GCN. "The op>miza>ons and algorithms presented are specific to AMD GCN hardware, as this first version was aimed at “geLng it right” on consoles and high end AMD PCs"

Ofcourse you will need to tune differently for Nvidia and Intel but I find the claim that doing a GPU level culling on these HW wouldn't produce any speedup even if they fully supported async compute. Here is another presentation where they claim less geometry work on all GPUs : http://advances.realtimerendering.c...siggraph2015_combined_final_footer_220dpi.pdf

Every architecture will have bottlenecks at different units and async compute will help utilizing the "other" units. This is an optimization across all HW and hence it's in DX12 and Vulkan. It's not a loaded term and should be supported by all GPU vendors. Although the burden to tune for all configurations is now on gamedevs who need to batch their work carefully for each GPU configuration.

Architecture A has 60% of its "hardware units" idle in workload X while architecture B only has 15% of its units idle. After combining workload X with workload Y using the magical power of async/concurrent/whatever compute, architecture A now only has 20% of units idle while architecture B still has 12% of its units idle. Does architecture B have "worse" async compute support than architecture A?

BRiT · Jun 8, 2016

willardjuice said:
Architecture A has 60% of its "hardware units" idle in workload X while architecture B only has 15% of its units idle. After combining workload X with workload Y using the magical power of async/concurrent/whatever compute, architecture A now only has 20% of units idle while architecture B still has 12% of its units idle. Does architecture B have "worse" async compute support than architecture A?

We can't answer that until you tell us which architecture we're a fan of.

AnomalousEntity · Jun 8, 2016

willardjuice said:
Architecture A has 60% of its "hardware units" idle in workload X while architecture B only has 15% of its units idle. After combining workload X with workload Y using the magical power of async/concurrent/whatever compute, architecture A now only has 20% of units idle while architecture B still has 12% of its units idle. Does architecture B have "worse" async compute support than architecture A?

You're asking the wrong question. For workload X arch B is simply better (there'll be a workload Z where Arch A will be better) but async compute DOES improve its efficiency. So having a proper HW support of async is a win-win for all architectures.

Razor1 · Jun 8, 2016

what is proper support lol, they all support it, just different strokes for different folks.

MDolenc · Jun 8, 2016

I've said it on another forum and I'll do it here as well:
Now that NV has disclosed a bit how they handle this on Maxwell we need to petition them to release a driver that will always reserve 50% of SMs to graphics and the other 50% of SMs to compute. Sure it will suck, but OMG the async compute gains!!11!!oneone

Andrew Lauritzen · Jun 8, 2016

lanek said:
In general i dont see interest on posting thing from them, but in this case
http://wccftech.com/async-compute-p...tting-performance-target-in-doom-on-consoles/

I totally spawned that article with my twitter trolling. And I wasn't even being subtle! That's how nuts this industry is about async compute right now... it's quickly going beyond rational...

Andrew Lauritzen · Jun 8, 2016

MDolenc said:
Which has a funny side effect that cards like GTX 970 or GTX 980 Ti can't reach their peak fillrate even with the simplest of pixel shaders!

Ah but it's much more subtle on Maxwell. Maxwell implements a sort of immediate mode tiling that reorders shaders and ROPs to a TBDR-style pattern where possible over some window of geometry (up to ~1k triangles ish IIRC although it's SKU dependent). Within a tile, it doesn't need to go the "external" ROPs, similar to TBDRs which don't even have ROPs. Write a ROP benchmark on a very small texture that fits entirely in one tile (say 64x64 or so) and see it far exceed it's theoretical ROP throughput

Most ROP benchmarks use large triangles and thus don't see this effect, but it definitely offsets some of the disparity in more realistic geometry configurations.

RecessionCone · Jun 8, 2016

Ethatron said:
If you do shadow-mapping then basically every chip of the last decade is severely underutilized, sometimes you use only 10% of the chip. And because there is only one rasterizer state it in itself can't be speed up (through concurrency).

You don't understand where the naming originated from. Under DX11 everything you do is executed in-order, and it finishes in-order. Thus it's entirely synchronous from a command-stream perspective. The hardware engineers transitioned the hardware to extract CSP (command-stream parallelism, similar as in instruction-level parallelism ILP) from the serial command-stream. That only brings you so far, so they added software-support to allow us developers to explicitly specify when we allow in-order execution not to be important, in effect handling the synchronozation ourself instead of handling an implicitely specified synchronisation. ACE allows compute command stream being executed in any order, at any time, thus asynchronously.

And if you two don't grasp what exactly it is - not a swiss army knife but a tool and as much a natural evolution-step as super-scalar out-of-order SMT-able multi-core CPUs - and if you can not contextualize when why and how it isn't or is used in the various engines, then you shouldn't make so much unfounded noise.

Asynchronous means not synchronized. If we say two things are asynchronous, we mean that they are decoupled. Asynchrony does not specify implementation - calls to an asynchronous API may very well be implemented sequentially.

The feature AMD calls asynchronous compute is actually about concurrency: that the hardware can execute graphics and compute at the same time. This concurrency is the source of performance gains.

Concurrency and asynchrony are not the same thing, no matter how many times AMD marketing (or people in Beyond3d) conflate them.

AnomalousEntity · Jun 8, 2016

MDolenc said:
release a driver that will always reserve 50% of SMs to graphics and the other 50% of SMs to compute. Sure it will suck, but OMG the async compute gains!!11!!oneone

You've got it all wrong. That's not how async works infact it's the complete opposite of that - it's about load balancing not reservation. We've evolved to Unified Shader Architecture for a reason.

Silent_Buddha · Jun 8, 2016

Razor1 said:
Doing ports is one thing, but doing bad ports is another. Will dev's that choose to do a pc port stick with no optimizations for nV hardware, where nV has so much marketshare they can't be ignored? I don't buy the whole console wins will drive pc marketshare though optimized games for said hardware. Never worked in the past, and it won't really work now. You might see slight fluctuations but nothing major. nV didn't build their marketshare based on consoles did they?

Or a developer keeps their PC version as close to the console version as possible. And Nvidia relies on it's famous devrel to help them get an Nvidia specific rendering path. Isn't this what everyone was saying when AMD performed poorly in part due to more time spent coding on Nvidia hardware? That AMD needs to get its devrel into gear?

Yes, on PC Nvidia holds by far the larger share of the gaming market. But in a day and age where development on consoles is coming closer and closer to PC development and almost all developers who release PC games also release console games. You have to take into account console market share into that as well as the graphics hardware is basically the same as the PC hardware with some differences (unified memory pool with memory contention for PS4, some of that + ESRAM scratchpad for XBO, for instance).

And considering that for AAA developers their first priority is console gaming, there are inevitably going to be efforts to make rendering as efficient as possible on GCN. If those are also then applicable to PC, should that just be thrown away? In the past when Dx 9/10/11 was vastly different from anything used on PS3/X360 it meant the PC version had to be radically different from the console versions. Dx12 means it's easier than ever before for games to share some of the techniques and possibly even code between the PC and console versions.

So for a developer who may not have a lot of resources (time, money, manpower) for a PC port. Which is more important? Nvidia PC rendering patch with significantly different Console rendering path? Or Console + GCN PC (sharing much of the work that went into the console version) rendering path?

Obviously developers will at least put in effort to make sure GCN optimized code won't totally trash performance on Nvidia cards (like how Oxide just disabled their async compute optimizations for Nvidia cards, granted that's a PC only title, but it's basically the same thing). If they want a path optimized for Nvidia cards, then that's where Nvidia devrel comes in.

And some developers/publisher that have far more binding ties to Nvidia may be persuaded to drop any GCN optimizations into a unified rendering path focusing on Nvidia based optimizations. The same could probably go for games with ties to AMD, but in those cases, I'd imagine it's more likely that the devs just don't have time to optimize for Nvidia hardware and just reuse as much console code as possible. AMD doesn't have the money or resources to get developers as tied into their ecosystem as Nvidia does.

Regards,
SB

Razor1 · Jun 8, 2016

AnomalousEntity said:
You've got it all wrong. That's not how async works infact it's the complete opposite of that - it's about load balancing not reservation. We've evolved to Unified Shader Architecture for a reason.

what no no no lol, yeah at this point seems like you are getting many things confused.

Razor1 · Jun 8, 2016

Silent_Buddha said:
Or a developer keeps their PC version as close to the console version as possible. And Nvidia relies on it's famous devrel to help them get an Nvidia specific rendering path. Isn't this what everyone was saying when AMD performed poorly in part due to more time spent coding on Nvidia hardware? That AMD needs to get its devrel into gear?

Yes, on PC Nvidia holds by far the larger share of the gaming market. But in a day and age where development on consoles is coming closer and closer to PC development and almost all developers who release PC games also release console games. You have to take into account console market share into that as well as the graphics hardware is basically the same as the PC hardware with some differences (unified memory pool with memory contention for PS4, some of that + ESRAM scratchpad for XBO, for instance).

And considering that for AAA developers their first priority is console gaming, there are inevitably going to be efforts to make rendering as efficient as possible on GCN. If those are also then applicable to PC, should that just be thrown away? In the past when Dx 9/10/11 was vastly different from anything used on PS3/X360 it meant the PC version had to be radically different from the console versions. Dx12 means it's easier than ever before for games to share some of the techniques and possibly even code between the PC and console versions.

So for a developer who may not have a lot of resources (time, money, manpower) for a PC port. Which is more important? Nvidia PC rendering patch with significantly different Console rendering path? Or Console + GCN PC (sharing much of the work that went into the console version) rendering path?

Obviously developers will at least put in effort to make sure GCN optimized code won't totally trash performance on Nvidia cards (like how Oxide just disabled their async compute optimizations for Nvidia cards, granted that's a PC only title, but it's basically the same thing). If they want a path optimized for Nvidia cards, then that's where Nvidia devrel comes in.

And some developers/publisher that have far more binding ties to Nvidia may be persuaded to drop any GCN optimizations into a unified rendering path focusing on Nvidia based optimizations. The same could probably go for games with ties to AMD, but in those cases, I'd imagine it's more likely that the devs just don't have time to optimize for Nvidia hardware and just reuse as much console code as possible. AMD doesn't have the money or resources to get developers as tied into their ecosystem as Nvidia does.

Regards,
SB

Its about making money for the developer, if they are making a PC port, they still want to make money because they will be spending money, to "ignore" 80% of the PC market will not make them money, no matter how you want to talk about it.

Oxide is a poor example of this, first off they didn't make money from online sales for their game (steam numbers show this right off the bat) they ripped AMD off though. through bundling lol. And don't try to convince me Oxide isn't/wasn't up AMD's butt lol, they were the ones that started off AMD's marketing spew about async compute. If you go back and see the timing of events it read out like a movie script. Back and forth. Two companies that didn't really know what was going on, one getting publicity for their game with false marketing and the other company using that false marketing to pimp their cards.

CarstenS · Jun 8, 2016

MDolenc said:
There's another point that has been known for years and yet ignored in the discussion about how much performance there is to be gained...
GTX 980 has 64 ROPs and 16 SMs. A single SM can supply at most 4 pixels per clock on average. Which has a funny side effect that cards like GTX 970 or GTX 980 Ti can't reach their peak fillrate even with the simplest of pixel shaders! What is there to gain by squeezing some totally unrelated compute stuff in there? GP104 is basically the first one that breaks this.

Sorry to be so late to the party: But after Fermi (which indeed was imbalanced), Raster, SM and ROP throughput were actually matched on an ASIC, if not necessarily on a product level.

GM204 (GM200 is ×1,5 ROP and GPC, so same ratios):
Raster: 16 ppc per GPC
SM*: 4 ppc, so 4 per GPC was a perfect match
ROP: 64 (which is # of GPCs × GPC's ppc)
*at least shader export rate to ROP crossbar.

GP104:
Raster: 16 ppc per GPC
SM*: 4 ppc, so 5 per GPC is limited by # of GPCs
ROP: 64 (matches GPC, but not SM rate)
So you theoretically could pre-assign one SM per GPC to compute and not impair any of the fill-rate bound operations.

Or did I miss something important here?

CSI PC · Jun 8, 2016

Razor1 said:
Its about making money for the developer, if they are making a PC port, they still want to make money because they will be spending money, to "ignore" 80% of the PC market will not make them money, no matter how you want to talk about it.

Oxide is a poor example of this, first off they didn't make money from online sales for their game (steam numbers show this right off the bat) they ripped AMD off though. through bundling lol. And don't try to convince me Oxide isn't/wasn't up AMD's butt lol, they were the ones that started off AMD's marketing spew about async compute. If you go back and see the timing of events it read out like a movie script. Back and forth. Two companies that didn't really know what was going on, one getting publicity for their game with false marketing and the other company using that false marketing to pimp their cards.

Well another example would be Quantum Break - yeah patched improved things a bit for Nvidia, more so for the 980ti than any other card pre-Pascal.
Let me know how the post processing volumetric lighting/AO works on Nvidia

The subtle changes with 1070/1080 has changed that performance trend-behaviour for the better when it comes to general optimisation/engine design applied for such games, but how this holds up against Polaris will be interesting.
Cheers

Jawed · Jun 8, 2016

Andrew Lauritzen said:
Ah but it's much more subtle on Maxwell. Maxwell implements a sort of immediate mode tiling that reorders shaders and ROPs to a TBDR-style pattern where possible over some window of geometry (up to ~1k triangles ish IIRC although it's SKU dependent). Within a tile, it doesn't need to go the "external" ROPs, similar to TBDRs which don't even have ROPs. Write a ROP benchmark on a very small texture that fits entirely in one tile (say 64x64 or so) and see it far exceed it's theoretical ROP throughput Most ROP benchmarks use large triangles and thus don't see this effect, but it definitely offsets some of the disparity in more realistic geometry configurations.

This sounds like a big deal.

DX12 Performance Discussion And Analysis Thread

AnomalousEntity

willardjuice

super willyjuice

lanek

AnomalousEntity

CSI PC

willardjuice

super willyjuice

BRiT

(>• •)>⌐■-■ (⌐■-■)

AnomalousEntity

Razor1

MDolenc

Andrew Lauritzen

Moderator

Andrew Lauritzen

Moderator

RecessionCone

AnomalousEntity

Silent_Buddha

Razor1

Razor1

CarstenS

Moderator

CSI PC

Jawed

Similar threads