Comparative consideration of DirectX 12 in games *spawn

I still somewhat disagree with the idea giving graphics programmers the ability to do generalized indirect shader dispatches since that will just encourage spilling.
Sure but as with many such discussions in the past (we don't want to do dynamic branching on GPUs as that just encourages wasted lanes; we don't want to add caches to GPUs because it's better to prefetch or handle with occupancy; we don't want to do raytracing on GPUs because that is fundamentally divergent, bindless and random memory access, I can keep going...) I think we as an industry are just increasingly being in denial about how this hardware is and will be used to produce the next generation of graphics.

Regular computation with predictable memory access patterns is great... until it isn't. Much of graphics is finding the right balance between regularity and efficient scaling of course, but as we move into the next phase of global queries and tons of user-space acceleration structures the architectures simply have to evolve or they will get crushed by inefficiency. There's obviously some push and pull, but ultimately if you want stuff like GI, you need a significant amount of what we would typically call "inefficient", irregular and often not-particularly-data-parallel computation. I hesitate to invoke the wrath of the "we just want texture mapped polygons at NATIVE 8k" crowd, but I think most would agree that a better use of the transistors at this point is to do stuff like RT, Nanite, Lumen, etc. But of course that stuff is not usually primarily limited by raw SIMD flops... to make it faster you need more robust hardware that doesn't fall on its face when you hit that one shader in the scene that needs a bazillion registers, or need to spawn some additional work on the GPU and so on.

If anyone else had attempted to pull off a similar move in a competitive environment such as desktop or mobile graphics space, they'll either sink (hardware complexity/unsatisfactory performance) or swim (apps start taking advantage of the feature). Even Apple's dynamic register caching solution has limits where there's a *specific threshold* that just enough spilling will start cratering their performance.
Possibly, but before they did it all the IHVs have been saying it is impossible for over a decade (and I say this from personal experience at an IHV), like they did a decade before that with raytracing. I don't expect version 1 of either to be amazing, but we have an existence proof now and the needle can continue to shift.

I think the pull towards more generally robust performant hardware is undeniable at this point, at least on the high end. We're getting nowhere near the peak flops of these machines in complex workloads, so the solution is clearly not to just keep laying down more ALUs and shipping the next SKU. Some decent chunk of the transistors and logic needs to go into making the current hardware run more efficiently.

The only real pull back in the other direction at this point is ML to be honest. And hell maybe we end up in a world where some significant chunk of rendering is just a giant, regular ML matrix multiply and it's enough more power efficient that it doesn't matter if it's using way more theoretical FLOPS to do it. That said, there's also some indication that the ML stuff needs to get pulled back a bit in the other direction, exposing the ability to do finer-grained small kernels in-line with more general compute rather than monolithic stop-the-world dispatches.

As always, it'll be interesting to see where this all goes, but it's clearly not sustainable to keep shipping literally gigabytes of shader permutations due to fear about GPU performance if they are asked to do something so complicated as a function call... It's bad all the way from multi-hour/day developer game cook times to end users dealing with large downloads and last-second JITs.

I don't think supporting GPU's that were not released as DX12 GPU's helped things.
I think most of the IHVs would agree on this point, but for good or for bad I think it was important for Microsoft to be able to tell people at the time that they should just upgrade to Windows 10 without worrying about buying a new PC to get all these benefits. The fact that existing hardware like Haswell IGP had to be supported (gotta be able claim some high % of systems can upgrade to DX12) is certainly constraining to what you'd really want to do with APIs. On the minor upside, prototyping on hardware that exists did help avoid some of the other pitfalls of previous API versions.

Microsoft avoided some of the worst of the legacy issues with things like the feature levels, heap tiers and now "DX12 Ultimate", but there's certainly a few areas of the API that could be simpler if it wasn't for that initial hardware. That said, I will say that once you start digging into the details of state management for PSOs it becomes pretty clear that there's a lot more divergence between hardware than you might think, especially if you ever want to include mobile hardware. Vulkan has certainly fallen into its own additional pitfalls on that front.
 
Doesn't really matter since your incorrect example wasn't all that helpful for your justification of hard resets and compatibility breaks.
Never harmed PC before.
Developers could very much use D3D11 without all the new features on DX9 hardware.
But yet you couldn't run a DX11 path on a DX9 only GPU, so it's a moot point.

If you wanted to run Crysis 2 in DX11 mode, you needed a DX11 GPU.

Just like you couldn't run the DX10 path on a DX9 class GPU, even if the game didn't use of the DX10 'benefits' it still wouldn't run (Again, try and get Crysis 2007 to run in DX10 mode on a DX9 GPU)
Hardware tessellation is a bit ironic since it's irrelevant these days when the older ways aged better so that would serve as a possible case against obsoleting perfectly functional older hardware especially when nobody knows if new feature xyz will stand the test of time to remain relevant ...
?
 
Last edited:
I think the pull towards more generally robust performant hardware is undeniable at this point, at least on the high end. We're getting nowhere near the peak flops of these machines in complex workloads, so the solution is clearly not to just keep laying down more ALUs and shipping the next SKU. Some decent chunk of the transistors and logic needs to go into making the current hardware run more efficiently.
The split register/shared memory/etc. design Apple has going is a great solution, and shows the limits of adding yet more caches. SRAM doesn't scale anymore, and the more caches you have the longer you spend chasing data and dealing with tags, the higher your latency and lower your efficiency.

Caches have already blown up to about as big as they can get with both Nvidia and AMD using 50mb+ shared caches at this point. I think we'll see Apple's solution here as well, which is putting DRAM on package instead of soldering it. This solves the problem by just moving main memory closer to the silicon and lowering latency altogether. I wouldn't be surprised if we see RDNA4 turn up with it for example.

The only real pull back in the other direction at this point is ML to be honest. And hell maybe we end up in a world where some significant chunk of rendering is just a giant, regular ML matrix multiply and it's enough more power efficient that it doesn't matter if it's using way more theoretical FLOPS to do it. That said, there's also some indication that the ML stuff needs to get pulled back a bit in the other direction, exposing the ability to do finer-grained small kernels in-line with more general compute rather than monolithic stop-the-world dispatches.

As always, it'll be interesting to see where this all goes, but it's clearly not sustainable to keep shipping literally gigabytes of shader permutations due to fear about GPU performance if they are asked to do something so complicated as a function call... It's bad all the way from multi-hour/day developer game cook times to end users dealing with large downloads and last-second JITs.

I just can't see ML for rendering itself. If you want to statistically guess something, like upscaling/denoising/similar which is inherently a guess and thus inherently statistics then it's great. Or other gamedev tasks, I'm looking forward to the ML animation controllers becoming useable because the experiments there are great and could bring the long held dream of highly procedural/adaptive animation in games to life; I still remember a Fable 2 test demo of a character swinging down from a chandelier during sword fight and it seems like were finally almost there. But as we've seen with NERFS they're just vanishing into guassian splatting, the same is going to be true with all the other direct rendering. Why bother with multiple layers of matrixes when you can jus get the answer you want far more directly.
 
I just can't see ML for rendering itself. If you want to statistically guess something, like upscaling/denoising/similar which is inherently a guess and thus inherently statistics then it's great. Or other gamedev tasks, I'm looking forward to the ML animation controllers becoming useable because the experiments there are great and could bring the long held dream of highly procedural/adaptive animation in games to life; I still remember a Fable 2 test demo of a character swinging down from a chandelier during sword fight and it seems like were finally almost there. But as we've seen with NERFS they're just vanishing into guassian splatting, the same is going to be true with all the other direct rendering. Why bother with multiple layers of matrixes when you can jus get the answer you want far more directly.
It's possible, but I'm not willing to place my bet yet :D Stuff like the corridor crew "use Unreal to generate a low fi version of your environment, then ML it into something stylized, then ML/"deflicker" that back into something temporally stable-ish" could evolve into something usable, and has some advantages in terms of art retargeting of course. Then again it possibly hits some walls that just can't be resolved but as I said, I'm just not yet willing to place a personal bet given how fast that field is moving currently.
 
@Andrew Lauritzen Eventually there will come about a point where hardware evolution will plateau and stagnate for long periods of time so we can't just expect IHVs to freely experiment whatever seemingly desirable features that the industry may want as they did in the past. Hardware vendors are justifiably becoming more conservative in design as time goes on as they can't expect to simply keep riding on the trend that's slowing down. Progression in hardware could imminently come to a halt as soon as the next generation arrives and any hopes that one may have had when it came to "convergence" in hardware design will be cruelly shattered in an instant. We have to sympathize that IHVs are on borrowed time.
 
Last edited:
@Andrew Lauritzen Eventually there will come about a point where hardware evolution will plateau and stagnate for long periods of time so we can't just expect IHVs to freely experiment whatever seemingly desirable features that the industry may want as they did in the past. Hardware vendors are justifiably becoming more conservative in design as time goes on as they can't expect to simply keep riding on the trend that's slowing down. Progression in hardware could imminently be come to a halt as soon as the next generation arrives and any hopes that one may have had when it came to "convergence" in hardware design will be cruelly shattered in an instant. We have to sympathize that IHVs are on borrowed time.
Agreed, but I don't think that implies that the tradeoffs will change in a way that means we'll revert to baked lighting and texture maps or anything. If anything the burden will be even more on software to do clever-er things and the hardware that is more robust to those uses is likely to win out in the long run there. There's just only so much you can do with flat FLOPS in terms of making an image better without needing to invoke better data structures and queries.

Maybe we'll get even more dedicated hardware for even more tasks too, but at some level it's hard to undo programmability once you have had it without regressions. That one is an even hardware "sell your next SKU" ask than a bit of a performance regression I think.
 
@Andrew Lauritzen Eventually there will come about a point where hardware evolution will plateau and stagnate for long periods of time so we can't just expect IHVs to freely experiment whatever seemingly desirable features that the industry may want as they did in the past. Hardware vendors are justifiably becoming more conservative in design as time goes on as they can't expect to simply keep riding on the trend that's slowing down. Progression in hardware could imminently come to a halt as soon as the next generation arrives and any hopes that one may have had when it came to "convergence" in hardware design will be cruelly shattered in an instant. We have to sympathize that IHVs are on borrowed time.

But hardware changes need to happen even more now. Moore's Law has been gone for a while, re-written from the previously accepted idiom to mean "silicon gets better at all" by IHVs and foundries in a desperate bid to seem optmistic. And it's not like there's any dramatic slowdown in competition, sure phones are stable so now we need "edge" (on device) computed AI magic and AR headsets and etc. So all that's left is hardware optimization, if they can't just buy more performance per watt/dollar from silicon then they have to work all the harder on hardware design.

This is already pretty much dogma across IHVs, can't compete on transistor shrinks so compete on hardware design. And convergence will definitely happen, it's already happened. Everyone hires the other's engineers every chance they get to see what it is the others do and how they do it. That's why CPUs and GPUs and etc. look so similar instead of being some crazy out there architecture, remember Intel's wacky x86/AVX attempt at a GPU? We don't see that at all anymore except in the absolute newest fields like AI where best practices haven't settled down yet.
 
Agreed, but I don't think that implies that the tradeoffs will change in a way that means we'll revert to baked lighting and texture maps or anything. If anything the burden will be even more on software to do clever-er things and the hardware that is more robust to those uses is likely to win out in the long run there. There's just only so much you can do with flat FLOPS in terms of making an image better without needing to invoke better data structures and queries.

Maybe we'll get even more dedicated hardware for even more tasks too, but at some level it's hard to undo programmability once you have had it without regressions. That one is an even hardware "sell your next SKU" ask than a bit of a performance regression I think.
When looking at the disconnect between the APIs and hardware capability, where in the chain is/are the failure point(s) for this? What are there non obvious reasons why the IHVs have been reluctant to change the status quo for so long?
 
I think the pull towards more generally robust performant hardware is undeniable at this point, at least on the high end. We're getting nowhere near the peak flops of these machines in complex workloads, so the solution is clearly not to just keep laying down more ALUs and shipping the next SKU. Some decent chunk of the transistors and logic needs to go into making the current hardware run more efficiently.

Amen. ALU Utilization actually seems to be getting better in modern, more compute heavy engines that I've profiled but it's nowhere near ideal. Would love to see more transistors spent on feeding the flops we already have.

When looking at the disconnect between the APIs and hardware capability, where in the chain is/are the failure point(s) for this? What are there non obvious reasons why the IHVs have been reluctant to change the status quo for so long?

It's a chicken and egg situation with hardware and software. Which APIs and games are going to support your radical new hardware architecture on day one? Look at what happened with Turing and that wasn't very radical and also had API support. It didn't stop people from clamoring for the good old days at 1000 fps.
 
Would love to see more transistors spent on feeding the flops we already have.
Feeding flops isn’t cheap. A crude way to think about it is that you basically need more MUXes to route stuff to where it needs to go, and more control to tell those MUXes what to do. That has area and more importantly power cost. Swinging MUXes around is power.

On the flip side adding flops of course costs area and leakage, and actually increases the distance wires have to move, increasing dynamic power as well. But it may be somewhat better for thermals.

It’s a non-trivial tradeoff but my point is that having some dark-silicon flops isn’t the worst thing in the world — to an extent. Striving for 100% utilization may be counter productive.
 
It's a chicken and egg situation with hardware and software. Which APIs and games are going to support your radical new hardware architecture on day one? Look at what happened with Turing and that wasn't very radical and also had API support. It didn't stop people from clamoring for the good old days at 1000 fps.
I assumed, perhaps incorrectly, that these changes would be invisible to developers and would just allow better hardware utilization. As an example, I think registers are often a common bottleneck. Are these power heavy? Die area heavy? Something else which limits the feasibility of adding more?
 
But hardware changes need to happen even more now. Moore's Law has been gone for a while, re-written from the previously accepted idiom to mean "silicon gets better at all" by IHVs and foundries in a desperate bid to seem optmistic. And it's not like there's any dramatic slowdown in competition, sure phones are stable so now we need "edge" (on device) computed AI magic and AR headsets and etc. So all that's left is hardware optimization, if they can't just buy more performance per watt/dollar from silicon then they have to work all the harder on hardware design.
If some hardware vendor wants to stake their entire livelihoods on a theoretical future that may or may not ever materialize at a great performance/hardware complexity cost on contemporary applications or APIs then they can feel free to do so but they shouldn't have regrets if the competition clearly wins out for not following them.

I think that exposing true function calls may not be the way forward.
This is already pretty much dogma across IHVs, can't compete on transistor shrinks so compete on hardware design. And convergence will definitely happen, it's already happened. Everyone hires the other's engineers every chance they get to see what it is the others do and how they do it. That's why CPUs and GPUs and etc. look so similar instead of being some crazy out there architecture, remember Intel's wacky x86/AVX attempt at a GPU? We don't see that at all anymore except in the absolute newest fields like AI where best practices haven't settled down yet.
Convergence will only happen as much as it necessary for compatibility and competitive reasons. I can very much see us going back to "exotic" hardware design once the tap runs out because what we define as an "improvement" is subjective for the most part and having one API to rule them all would be unsuitable at that point.
 
I think that exposing true function calls may not be the way forward.
That is a perfect example of people just missing the point and pretending the code we're writing is still streaming vector ops. And I say this having still spent the majority of my career at an IHV. (I will say that in this case that reply is in the context of an issue talking mostly about compile-time constructs so I'm not picking on the particular reply or person, but it is a good example of a general statement that IHVs might make so I'm going to treat if from that point of view.)

We need more folks at IHVs that understand the high level algorithms people are running as well as the lower level details, otherwise we keep getting into these situations where the default response is "don't do that, it's worse than DX9 rendering". But the question isn't whether it's worse than DX9 rendering, it's what is the most efficient way to accomplish the more complicated algorithms and data structures that we need to use to do higher quality rendering.

I'll tell you that doing 64-bit atomics for pixel writes is way slower than ROPs but weirdly Nanite does a much better job overall than the hardware currently. And it's not like the IHVs weren't aware that people would have liked a more efficient geometry pipeline that can handle more and smaller triangles since... forever. None of the IHVs would have recommended doing what Nanite did and many actively recommended against it but here we are.

Circling back, the question is not "are we going to do function calls and indirect dispatches". We already are. The question is can we do it at a finer grain and more efficiently than stalling literally the entire machine and draining the whole pipeline, then spinning it all back up. I guarantee that we can, and we can even do better than the current raytracing/callable shaders mechanism.
 
I assumed, perhaps incorrectly, that these changes would be invisible to developers and would just allow better hardware utilization.
I don't think we need huge API changes for a lot of improvements to happen. Step 1 is just to soften the cliffs in the occupancy curves. This will benefit existing applications as well, so it's not as if it's purely an investment in future at the expense of the present.

API stuff is already in the pipe too that will really need/benefit from the hardware becoming more generic here. Raytracing is already here. Work graphs are pretty much here too. Both of these move us further towards a world where hardware that is better able to schedule a mix of regular and irregular work onto their execution units will tend to come out ahead. I certainly hope most IHVs have been working on this for the past while as playing catch-up in these sorts of areas is rarely fun.
 
That is a perfect example of people just missing the point and pretending the code we're writing is still streaming vector ops. And I say this having still spent the majority of my career at an IHV. (I will say that in this case that reply is in the context of an issue talking mostly about compile-time constructs so I'm not picking on the particular reply or person, but it is a good example of a general statement that IHVs might make so I'm going to treat if from that point of view.)
FWIW the statement came from a Microsoft representative. Here's what employees from an IHV had to say on the subject matter below in a mastodon thread ...
We need more folks at IHVs that understand the high level algorithms people are running as well as the lower level details, otherwise we keep getting into these situations where the default response is "don't do that, it's worse than DX9 rendering". But the question isn't whether it's worse than DX9 rendering, it's what is the most efficient way to accomplish the more complicated algorithms and data structures that we need to use to do higher quality rendering.
If you can demonstrate to IHVs as to how function calls can be a performance improvement in a relevant scenario (i.e. Lumen) as opposed to the alternative methods on current hardware then they might make a consideration for your case. If you have to use console APIs or CUDA for the above purpose to construct some compelling data for your argument then do that as well!
I'll tell you that doing 64-bit atomics for pixel writes is way slower than ROPs but weirdly Nanite does a much better job overall than the hardware currently. And it's not like the IHVs weren't aware that people would have liked a more efficient geometry pipeline that can handle more and smaller triangles since... forever. None of the IHVs would have recommended doing what Nanite did and many actively recommended against it but here we are.
Nanite effectively *skips* aspects of the hardware pipeline like quad occupancy issues and respecting primitive order which is prohibitively expensive with highly dense geometry hence why we see that it does less work overall for the same. Even if you can *skip* work with function calls, is doing less work somehow faster in any pertinent example ?
Circling back, the question is not "are we going to do function calls and indirect dispatches". We already are. The question is can we do it at a finer grain and more efficiently than stalling literally the entire machine and draining the whole pipeline, then spinning it all back up. I guarantee that we can, and we can even do better than the current raytracing/callable shaders mechanism.
I wouldn't necessarily describe RT pipelines/callable shaders as "real" function calls since all entries in the shader binding tables can be potentially inlined by the driver.
 
FWIW the statement came from a Microsoft representative. Here's what employees from an IHV had to say on the subject matter below in a mastodon thread ...
Those are responding to a post that is indeed missing some of the fundamental points. And yet they also sort of concede the general point too...

If you can demonstrate to IHVs as to how function calls can be a performance improvement in a relevant scenario (i.e. Lumen) as opposed to the alternative methods on current hardware then they might make a consideration for your case.
Oh let me be clear here: I'm having this discussion here for the sake of the fun soapbox and community involvement given the - frankly otherwise boring - direction this thread was headed. There's no shortage of direct communication between the involved parties for the past decade or more on these issues and more general agreement than disagreement.

Nanite effectively *skips* aspects of the hardware pipeline like quad occupancy issues and respecting primitive order which is prohibitively expensive with highly dense geometry hence why we see that it does less work overall for the same.
The point of the example was it's a pretty clear case where IHVs defaulted to "this is not the way to do it, just do the simple thing the hardware is designed for" and missed the forest for the trees.
 
Those are responding to a post that is indeed missing some of the fundamental points. And yet they also sort of concede the general point too...


Oh let me be clear here: I'm having this discussion here for the sake of the fun soapbox and community involvement given the - frankly otherwise boring - direction this thread was headed. There's no shortage of direct communication between the involved parties for the past decade or more on these issues and more general agreement than disagreement.


The point of the example was it's a pretty clear case where IHVs defaulted to "this is not the way to do it, just do the simple thing the hardware is designed for" and missed the forest for the trees.
In the nanite example, what was the approach the IHVs would have recommended that suits the hardware design better? Or did they consider current geometry levels sufficient?
 
In the nanite example, what was the approach the IHVs would have recommended that suits the hardware design better? Or did they consider current geometry levels sufficient?
For the most part various shades of "just use bigger triangles", as with the other past examples. To the current case, I haven't heard any IHVs coming up with other solution that hit at the roots of the PSO/permutation explosion, occupancy and inlining issues... they are mostly just picking at the edges but otherwise arguing to maintain the status quo because changing that is hard. And it is, but they don't really see the full extent of the fallout and how unworkable it is becoming.
 
That's very sad to read. I was expecting for more advancements since the introduction of DX12 and the ray tracing mantra and 2 generation of processing nodes ahead, but reading that all IHV's are just turning a blind eye to those issues just blows a crater on my expectations for the next generation of gpu's. Guess ill skip another generation.
 
That's very sad to read. I was expecting for more advancements since the introduction of DX12 and the ray tracing mantra and 2 generation of processing nodes ahead, but reading that all IHV's are just turning a blind eye to those issues just blows a crater on my expectations for the next generation of gpu's. Guess ill skip another generation.
That is perhaps overstating things; don't take my soapbox airing of some frustrations in too cynical a light here. I do think too little focus is being put on performance robustness and static resource allocation and occupancy and permutations and all that mess, but I wouldn't say there's *no* work there. Some amount of progress is necessary to get raytracing working better, and I'm happy that we have that as a forcing function. As long as folks build the hardware that makes raytracing-driven executions go faster in a relatively generic way, the benefits will go far beyond raytracing itself. RT at least gives IHVs an immediate motivation around current benchmarks which is important to get things prioritized.
 
Last edited:
Back
Top