Shader Compilation on PC: About to become a bigger bottleneck?

Is that sorting not possible CPU side today?
Grouping draws by state on the CPU only lets you amortize the cost of the API call itself if we take a look at the early days of explicit APIs when they would show tech demos on pushing over a million draws on the GPU ...

Grouping draws according to render states on the GPU could potentially let the driver perform runtime shader linking optimizations hence making "PSO switching" cheap. Instead of having to recompile the exact same set of shaders from previous PSOs, newly formed PSOs with similar render states as before can 'stitch' together shaders from different PSOs that were compiled before. Think of it as having an "implicit form" of Vulkan's graphics pipeline library extension functionality without the ability to explicitly define what render states to change ...

On Xbox, PSOs perfectly match the description of the hardware's render states and it's native shader bytecode which makes it trivial for them to implement GPU driven render state changes (indirect command for PSO switching) via ExecuteIndirect ...
 
And at last some of those small teams might lack the knowledge to make a good PSO gathering step. In UE5 this should be easier but in UE4 you had to roll your own system and only a handful developers actually managed that.

Is the built-in PSO gathering system for UE4 that useless? What I've read is it's particularly inadequate for things like ray tracing and some particle effects, and certainly with user skins as you mention that's also a big problem.

With games made by inexperienced teams like Kena and The Ascent though, they relatively quickly added a precompiling step that while not perfect, was a massive improvement to those games. Did they actually roll their own?
 
Is the built-in PSO gathering system for UE4 that useless? What I've read is it's particularly inadequate for things like ray tracing and some particle effects, and certainly with user skins as you mention that's also a big problem.

With games made by inexperienced teams like Kena and The Ascent though, they relatively quickly added a precompiling step that while not perfect, was a massive improvement to those games. Did they actually roll their own?
Did Kena actually add a precomilation step?
 
Is the built-in PSO gathering system for UE4 that useless? What I've read is it's particularly inadequate for things like ray tracing and some particle effects, and certainly with user skins as you mention that's also a big problem.

With games made by inexperienced teams like Kena and The Ascent though, they relatively quickly added a precompiling step that while not perfect, was a massive improvement to those games. Did they actually roll their own?
UE4 added PSO gathering really late. I don't remember the exact version but it was really really late and not well documented back then. Also it took some time for the PSO gathering in UE to be bug free and not miss stuff. Even in 5.0 there were still a lot of bugs. They even said themselves that they made good strides in 5.1 and 5.2. So yeah it took some time before it became useful. I have not kept up with the news around Kena after release so I don't know if they added a precompilation step.
 
Did Kena actually add a precomilation step?

Yeah, well a form of one at least. DX12 only, when you first load it up after a fresh driver the opening logo will freeze and sputter, CPU will be pegged for 20 secs while its compiling.

Just tried it, there were a few stutterings opening the world (it's installed on a HDD though so that may have some effect), but running around, opening a chest, getting into a fight with 6+ enemies, the only shader stutter I saw was the reward chest opening after the fight. So can't say how extensive their PSO gathering process was, haven't played that far. DX11 mode though has no precompilation whatsoever, and it shows in comparison. Lots of little stutters throughout that same encounter.

Also tried it on Linux , now that was pretty much perfect. So looks like I'll be dual booting (my "Unreal 4 Partition" as I call it) whenever I decide to sit down and really play this.
 
One of the contributors to the open source GodotEngine has previewed a working implementation of Ubershaders with promising results:


Technical details on their github.

One of the more interesting qt's was from Cory Petkovsek, who describes the 'hacky' method they were using before to reduce shader stuttering:


Cory said:
@tccoxon turned me on to a hacky method to force shader pre-compilation in #GodotEngine. Here's what we do now:

1. Load the scene.
2. Rotate through all materials on 4 quads attached to the camera, showing each material set for one frame.
3. Rotate the camera 360 degrees around the world in 1 second.
4. Move the camera to a bird's eye view of the map for 1 frame to avoid culling.
5. Do all of this behind a loading screen. The 3D scene won't be culled, and the only indication of stutters will be your progress bar halting, as expected.

In Godot 3, shader caching was added but wasn't sufficient by itself. However, combined with these methods we eliminated stutters in Out of the Ashes, making the game commercially viable from that perspective.

In Godot 4, we've maintained the methods and still don't have a problem with stutters in OOTA. However, a pre-compilation pipeline built into the engine is always appreciated. We will use both just to ensure shader everything is always cached for the user beginning with the most important first playthrough.

First - this is what you can accomplish even with engines and imperfect PSO gathering methods if, you know, you really care about it.

I've often wondered why this approach isn't done more often with UE4 titles though - load the level assets in the background, quickly fly around the environment/fire off some effects, finishing loading when done. It will increase level load times at least initially, but also doesn't depend on a QA team to play through the game entirely. It's...something.

It's not a solution to games with a ton of user skins and materials the engine doesn't ship with, and probably not applicable to large open world games natch - but there are plenty of single-player focused games where just rendering the assets beforehand would help tremendously.
 
Last edited:
So what they've introduced is some form of a temporary ubershader pipeline which is going to see lower perf until the specialized pipelines are compiled ...
 
I'm always amazed that games have so many programs when we are supposed to have uniform shading with very capable B(SS)RDF...
Only iD Software controls the number of PSO in their games AFAICT.
In Unreal Engine 5(.0-.3), all consoles permutations are computed, but on Windows you have to ask your QA to go through the game and record them all, then integrate them back into your build. (Oh, and Epic's system is broken, it uses an abysmal hashing function on a non unique key to see whether it already compiled that permutation...)
 

May very well be needed in the overall picture of needless complexity and some kind of hybrid solution he proposes here could be ideal, but speaking solely in the context this thread, DX11 was hardly a panacea for shader stuttering, hell the driver intercepting shaders and deciding to 'optimize' it was something that DX12 was actually made to address!

Lots of shader stuttering occurred in DX11 games, we just didn't know how to identify it then. It still has to be managed by the developer/engine.
 
Anyone with an actual clue about this stuff want to add their thoughts on this? Huge/Interesting/Kind of Neat/Meh?

DirectX®12 single shader compilation with Radeon™ GPU Analyzer (RGA) v2.9.1


Background​

DirectX®12 requires complete pipeline state definition to compile a pipeline. This involves locating all the pipeline’s shaders, defining a root signature, and, for graphics, defining a subset of the graphics pipeline state. The need to prepare the entire graphics or compute pipeline elements upfront made the offline compilation process of DirectX12 shaders somewhat tedious. This approach could be cumbersome, particularly in scenarios where users want to compile a single shader in isolation.


RGA v2.9.1 to the rescue​

RGA v2.9.1 streamlines the shader compilation experience by allowing you to compile a single D3D12 shader. When an incomplete DirectX®12 pipeline is given, RGA v2.9.1 will autogenerate the missing elements of the pipeline for you. These elements can be the root signature, the graphics pipeline state subset or even shaders in the pipeline. This feature essentially makes any input beyond the single shader that you would like to compile optional.
 
Since it's RGA, it has nothing to do with in-game pipeline compilation. RGA is a tool used to inspect the ISA disassembly generated by the compiler from the source language files. The blog states that RGA now supports offline separate shader compilation for vertex/pixel/compute shaders without requiring all of the information in a pipeline state object ...
 
Since it's RGA, it has nothing to do with in-game pipeline compilation. RGA is a tool used to inspect the ISA disassembly generated by the compiler from the source language files. The blog states that RGA now supports offline separate shader compilation for vertex/pixel/compute shaders without requiring all of the information in a pipeline state object ...

Yes, I know it's offline. I'm guess I was just curious if this could be potentially impactful wrt to the accuracy of creating precompiled caches, or what benefits (if any) this brings to DX12 shader programming in general.
 
Yes, I know it's offline. I'm guess I was just curious if this could be potentially impactful wrt to the accuracy of creating precompiled caches, or what benefits (if any) this brings to DX12 shader programming in general.
There's no programming improvement since there's no new API. It is a tooling/instrumentation improvement ...
 
Back
Top