Shader Compilation on PC: About to become a bigger bottleneck?

Obsolete features meaning geometry and tessellation shaders? GPL supports those shader stages too. The only real difference seems to be that AMD hardware is optimized for rolling vertex/geometry/tessellation shaders into one combined “primitive shader” stage while Nvidia retained support for dynamically linking those discrete stages at runtime.
Obsolete features as in SSOs for everything except maybe fragment shaders. Possible future hardware designs may take even more advantage of PSOs so maybe not even SSOs for fragment shaders as well ...

In spirit, tessellation and geometry shaders are heading towards obsolescence as the industry learns more. It may be in the not far off future, that they'll be obsolete in principle too since their in competition with mesh shading. Mesh shading isn't a replacement for old geometry pipeline since it isn't compatible with transform feedbacks. On AMD's implementation of mesh shading, it can potentially be a true replacement to the current geometry pipeline since primitive shaders are compatible with transform feedbacks. A more monolithic approach also translates to a more flexible/general purpose/stateless hardware design. Not including ray tracing, Nvidia has at least 8 different HW shader stages (vertex/hull/domain/geometry/Amplification/Mesh/Pixel/Compute) while AMD only needs 4 HW shader stages (Hull/Geometry/Pixel/Compute) and ray tracing ends up being compute shaders over there for good measure ...
GPL isn’t any more forward looking than SSO. It simply maps better to AMD hardware.

Nvidia has a beta driver out with SSO support on day one so it seems they’re at least trying to fix the problem. Has any IHV implemented official support for GPL since it launched a year ago? It will be interesting to see where Microsoft takes DX13. Maybe they will drop support altogether for the legacy geometry pipeline.
It could be that we may still stick with PSOs in the end but there is nearly no industry momentum to have SSOs in it's purest form. Both GPL and SSOs are a lot of work to implement so it'll take some time to see them if at all but I wouldn't count on converging to the latter since there's still quite a bit of inertia against it (everyone NOT Nvidia) ...
 
It is not "opposite for Nvidia", they do both equally well (they actually do PSOs better than AMD atm if we take the issues with TLOU on PC as a sign of who's doing how in that right now). It is the deficit of non-Nv h/w which should be fixed in h/w - finally, as it's been more than 10 years of IHVs transferring this issue to s/w vendors instead.

And no, this is not "implementing D3D11 in h/w" because it has nothing to do with D3D11 beyond the fact that D3D11 s/w runs better on h/w which has fast state management. If you expose this advantage in a modern API like VK or D3D12 then it suddenly becomes "implementing the new feature of D3D12 in h/w". Sounds quite a bit different, is it?
Still doesn't change the fact that the idea of Intel endorsing SSOs is tone deaf since that's the absolute worst model for them ...

"Fast state management" is just the end result of hardware closely implementing the software model which is pretty much just Nvidia implementing D3D11 in their hardware no matter how you word it or slice it. I can't see how what I'm describing here isn't apt since it's more precise and accurate than your description ...

My statement is even more true when you consider how Nvidia HW supports D3D11 resource binding model in their hardware with D3D11 style bound textures and constant buffer as well ...
What gives you the idea that Nvidia doesn't work with Microsoft on evolving D3D too? RT was added in D3D12 as an Nvidia exclusive feature. A bunch of other changes to stuff like RS and such were implemented because the original spec missed some things in Nv h/w making it run worse than it could. How's PSOs any different?
General rule of thumb is for 2 vendors to support a feature in order for Microsoft to expose an API for it. Effectively this means that the tie-breaker (Either AMD or Intel) holds a filibuster. A Microsoft representative disclosed (timestamp 18:54 and 19:06) that they're mostly on the sidelines watching what happens and how it develops so they have no intention of immediately forcing anyone to implement anything yet ...

Unlikely that AMD or Intel will ever want implement SSOs so unless Nvidia can convince one or the other, D3D12 won't be getting the feature anytime soon ...
 
"Fast state management" is just the end result of hardware closely implementing the software model
Even if that is true (which I very much doubt it is considering how the h/w is usually made and why APIs are being made after the h/w and not vice versa) you could apply the same logic to other h/w - that AMD's h/w is "closely implementing D3D12 model" - which means that all the issues we have at present with games running through that model are the issues of h/w which should be solved in h/w.

As I've said it's not like AMD h/w is doing any better with PSOs.

Also what Faith wrote makes me think that GPLs will require about the same h/w improvements as SSOs while still being very complex to implement for API users in contrast to SSOs.
 
Obsolete features as in SSOs for everything except maybe fragment shaders. Possible future hardware designs may take even more advantage of PSOs so maybe not even SSOs for fragment shaders as well ...

The immutable precompiled PSO model has been given a fair shake already and been proven not to work well because of the dynamic nature of runtime game workloads. What’s going to change to make PSOs more viable?

A more monolithic approach also translates to a more flexible/general purpose/stateless hardware design. Not including ray tracing, Nvidia has at least 8 different HW shader stages (vertex/hull/domain/geometry/Amplification/Mesh/Pixel/Compute) while AMD only needs 4 HW shader stages (Hull/Geometry/Pixel/Compute) and ray tracing ends up being compute shaders over there for good measure ...

If AMD’s hardware is more flexible then it presumably wouldn’t have any problem breaking down the pre-raster pipeline into more discrete steps. It seems that it’s actually less flexible i.e. you “must” treat geometry processing as a single compiled stage to get the most out of the hardware. While it seems Nvidia gives you the option to define geometry processing as either a monolithic stage or multiple decouple stages that can be mixed and matched at runtime. So how is Nvidia’s approach not the more flexible option here?

Essentially on hardware that supports SSO implementing GPL and PSO is trivial. It doesn’t work the other way around. The narrative that SSO is bad but GPL is good doesn’t really make sense. Ideally we would blow all of this up and force everyone to use mesh shaders or do rasterization in compute ala Nanite. It’s currently a mess. Nobody is even using mesh shaders yet.

It makes sense that Microsoft doesn’t want to push api features that IHVs don’t want to support. The end result though is that #stutterstruggle rolls on.
 
The immutable precompiled PSO model has been given a fair shake already and been proven not to work well because of the dynamic nature of runtime game workloads. What’s going to change to make PSOs more viable?
PSOs allow hardware designers to move more state from hardware to programs/software. Hardware features essentially become software features ...
If AMD’s hardware is more flexible then it presumably wouldn’t have any problem breaking down the pre-raster pipeline into more discrete steps. It seems that it’s actually less flexible i.e. you “must” treat geometry processing as a single compiled stage to get the most out of the hardware. While it seems Nvidia gives you the option to define geometry processing as either a monolithic stage or multiple decouple stages that can be mixed and matched at runtime. So how is Nvidia’s approach not the more flexible option here?

Essentially on hardware that supports SSO implementing GPL and PSO is trivial. It doesn’t work the other way around. The narrative that SSO is bad but GPL is good doesn’t really make sense. Ideally we would blow all of this up and force everyone to use mesh shaders or do rasterization in compute ala Nanite. It’s currently a mess. Nobody is even using mesh shaders yet.

It makes sense that Microsoft doesn’t want to push api features that IHVs don’t want to support. The end result though is that #stutterstruggle rolls on.
You're conflating performance with flexibility ...

Special graphics state exists to speed up the graphics pipeline but the limitations imposed with specialized graphics states is that it's usage is restricted within a specific context ...

Mobile graphics may not have blending units that we're used to seeing on desktop graphics hardware so they implement blending equations by embedding the blend states into a program instead that would normally be used to set specific registers to get the fixed function blending unit into the right equation on desktop hardware. The big implication in my example is that mobile hardware is running a *program* while desktop hardware is running a *finite-state machine*. This ultimately means that mobile hardware can make blending a programmable operation while on desktop hardware it becomes a fixed operation since the former doesn't have a distinct/separate stage for that process in their hardware pipeline and instead becomes a part of the fragment shader. I don't see how any graphics programmers would somehow argue that the hardware in that case is less flexible when they've gained another programmable stage of the graphics pipeline even though more PSOs we're needed because of the permutations caused by the many different states that need to be embedded into the binaries ...

AMD HW specifically can probably implement other unique geometry pipelines with more features/less restrictions in the shaders as well besides the ones we already know like mesh shading or the current traditional geometry pipeline ...
 
Mobile graphics may not have blending units that we're used to seeing on desktop graphics hardware so they implement blending equations by embedding the blend states into a program instead that would normally be used to set specific registers to get the fixed function blending unit into the right equation on desktop hardware. The big implication in my example is that mobile hardware is running a *program* while desktop hardware is running a *finite-state machine*. This ultimately means that mobile hardware can make blending a programmable operation while on desktop hardware it becomes a fixed operation since the former doesn't have a distinct/separate stage for that process in their hardware pipeline and instead becomes a part of the fragment shader. I don't see how any graphics programmers would somehow argue that the hardware in that case is less flexible when they've gained another programmable stage of the graphics pipeline even though more PSOs we're needed because of the permutations caused by the many different states that need to be embedded into the binaries ...
You can do both on h/w which provides "fixed functions" (again I'd call them "accelerators" since they are advantageous in a world where perf/watt matters more than anything else) but can only do these in s/w on h/w which doesn't. Thus the former is more flexible than the latter, not vice versa.
 
PSOs allow hardware designers to move more state from hardware to programs/software. Hardware features essentially become software features ...

Yes but the point is we’ve already tried that and it’s not working.

AMD HW specifically can probably implement other unique geometry pipelines with more features/less restrictions in the shaders as well besides the ones we already know like mesh shading or the current traditional geometry pipeline ...

Ok but you haven’t addressed the question of why this greater hardware flexibility doesn't accommodate the SSO api (which you seem to think is less flexible). You’re making points from two conflicting perspectives. Either the hardware is flexible enough to support PSO/GPL/SSO or it isn’t.
 
The nvidia design to me looks like it maintains legacy stages because there’s still a lot of legacy software. People are even still writing dx11 and OpenGL based software. In terms of the mesh and amplification stages, I guess on AMD they’re just Hill/Geometry stages?
 


The nvidia design to me looks like it maintains legacy stages because there’s still a lot of legacy software. People are even still writing dx11 and OpenGL based software. In terms of the mesh and amplification stages, I guess on AMD they’re just Hill/Geometry stages?
That's assuming that these stages are any different from whatever units the h/w is using under the hood in modern APIs.
 
Last edited:
I found this document which basically outlines which software stages run on which hardware stages for AMD gpus.


Now to see if I can find something similar for Nvidia.

Edit: Looks like the linux drivers for Nvidia are still not fully open-sourced. I think we'd need the firmware component, which is still closed.
 
Last edited:

I've been saying Digital Foundry should be testing CPUs and how quickly they compile shaders in these games with pre-compilation processes for a while now, because it will be an important metric to know for people who are considering specific CPU purchases. It looks as if it will be more applicable in the future than ever before. Though in truth, as newer Sony games will likely take PC in mind from the very beginning, it's likely that these processes can be cut down a fair amount from what they are when PC isn't taken into consideration at all during development. The Detroit Become Human guys touched upon this when porting that game over to PC.. stating that their future game will be much more efficient in how they author shaders, vastly reducing the amount of permutations.

@Dictator a future video comparing low/mid/high end CPUs with a few of these games with pre-comp processes would be enlightening for users and very entertaining. Please consider it!
 
I've been saying Digital Foundry should be testing CPUs and how quickly they compile shaders in these games with pre-compilation processes for a while now, because it will be an important metric to know for people who are considering specific CPU purchases. It looks as if it will be more applicable in the future than ever before. Though in truth, as newer Sony games will likely take PC in mind from the very beginning, it's likely that these processes can be cut down a fair amount from what they are when PC isn't taken into consideration at all during development. The Detroit Become Human guys touched upon this when porting that game over to PC.. stating that their future game will be much more efficient in how they author shaders, vastly reducing the amount of permutations.
Well, consoles don't care about runtime compilation and on their fixed h/w the original PSO model works fine. You could say that it's a "console model" even since they are the best fit for such approach.

On PC though the model obviously doesn't work well and thus there are already various "enhancements" in place which aim at limiting the amount of PSOs generated by a game as much as possible - this was the topic of the last several pages, and games are pretty much will be forced to use them - or face the hours of shader (re)compilation time eventually.
 
Well, consoles don't care about runtime compilation and on their fixed h/w the original PSO model works fine. You could say that it's a "console model" even since they are the best fit for such approach.

On PC though the model obviously doesn't work well and thus there are already various "enhancements" in place which aim at limiting the amount of PSOs generated by a game as much as possible - this was the topic of the last several pages, and games are pretty much will be forced to use them - or face the hours of shader (re)compilation time eventually.
Most games do the compilation process in less than 5 minutes. Most Unreal Engine 4 games which pre-compile the PSOs take like 1-2min.

The issue of extremely long shader compilation processes is overblown by a few outliers, like this game and Detroit, for example, and usually by people on mid to low end CPUs. 16/24/32 thread CPUs chew through this stuff quite quickly. And I disagree with the last point.. because I think long before it ever gets THAT bad, there will be infrastructure in place to augment the shader pre-compilation process by connecting to a server which will throw tons of CPU cores at the issue compiling shaders multitudes of times quicker.
 
Most games do the compilation process in less than 5 minutes. Most Unreal Engine 4 games which pre-compile the PSOs take like 1-2min.
That's because they either don't generate many PSOs, don't compile them all during the precompilation step or use the options available to limit PSO numbers on PCs - or a mix of all three.

The issue of extremely long shader compilation processes is overblown by a few outliers, like this game and Detroit, for example, and usually by people on mid to low end CPUs. 16/24/32 thread CPUs chew through this stuff quite quickly.
Most people don't have even 16T CPUs and I as an owner of a 24T CPU can't really agree that it's "quite quickly" in cases where it is actually visible.

And I disagree with the last point.. because I think long before it ever gets THAT bad, there will be infrastructure in place to augment the shader pre-compilation process by connecting to a server which will throw tons of CPU cores at the issue compiling shaders multitudes of times quicker.
This is likely to never happen.
 
Ok but you haven’t addressed the question of why this greater hardware flexibility doesn't accommodate the SSO api (which you seem to think is less flexible).
Even according to the link he posted, NVIDIA hardware is more flexible and faster in all modes.

18:59 karolherbst: well, unless I missed anything, at least on Nvidia hardware it should be all the same in the end, just that pipelines objects might use more memory?
18:59 gfxstrand: karolherbst: Yes, this is ideal for NVIDIA
....
19:03 gfxstrand: karolherbst: Yeah, NVIDIA really is the only hardware where all this is easy.

 
That's because they either don't generate many PSOs, don't compile them all during the precompilation step or use the options available to limit PSO numbers on PCs - or a mix of all three.


Most people don't have even 16T CPUs and I as an owner of a 24T CPU can't really agree that it's "quite quickly" in cases where it is actually visible.


This is likely to never happen.
Right... which is what developers can do when they build a game with PC in mind from the start..

On Steam the amount of people with 6 physical cores or more is > 60%. So we're looking at at least 12+ threads for the majority of gamers on Steam at this point.

cpusper.png


You can think what you want... but they aren't going to have people waiting literal hours to compile shaders... the industry will figure out how to augment any of these compilation steps and reduce the friction to gamers. Just look at what Valve is doing with Fossilize... essentially downloading the shaders first, and compiling them in the background as the game proper is downloaded.
 
Yes but the point is we’ve already tried that and it’s not working.
Just because it's not working on the application doesn't mean it's not working from a hardware/driver design perspective. PSOs are a little bit more conservative in it's design about making assumptions in how hardware designs work than developers would like but the benefit is performance by safety on more configurations since drivers can resolve state explicitly ahead of time ...

It's a problem for developers to figure out how to be more compatible with that model and not the driver/hardware since their only job is to run as fast as possible if they can to which PSO does allow them to do that more easily ...
Ok but you haven’t addressed the question of why this greater hardware flexibility doesn't accommodate the SSO api (which you seem to think is less flexible). You’re making points from two conflicting perspectives. Either the hardware is flexible enough to support PSO/GPL/SSO or it isn’t.
SSOs just represent one of the many other ways to apply state changes. Straight up there's no flexibility gained or lost with SSOs in comparison to PSOs or GPL in regards to state changes. You can do the same state changes in all three different models. Each design makes their own implications as to how fast/slow these state changes are for the hardware ...

With SSOs, hardware designers are incentivized to not remove state from hardware which quite often times means that legacy features left in there. With PSOs, hardware designers can freely decide whether or not to remove state from their hardware and implement feature in software (which amounts to compilation) without implicitly facing retaliation the cost of state changes ...

SSOs can see faster state changes if there's hardware for it. PSOs, by letting hardware designers move state off hardware means that there is more potential to expose more flexible features/pipeline since hardware is capable of running customized states. With SSOs, you're pretty much stuck with the features already in there unless a vendor decides to support more states. In a stateless approach like PSOs, you can potentially program custom features yourself if hardware is capable of it ...
 
Back
Top