Shader Compilation on PC: About to become a bigger bottleneck?

One of the developers speaks about the messy state of Vulkan/DX12.

There is only one problem, which is that with all this fine-grained complexity, Vulkan winds up being basically impossible for humans to write. Actually, that's not really fair. DX12 and Metal offer more or less the same degree of fine-grained complexity, and by all accounts they're not so bad to write. The actual problem is that Vulkan is not designed for humans to write.

Literally. Khronos does not want you to write Vulkan, or rather, they don't want you to write it directly.

I was in the room when Vulkan was announced, across the street from GDC in 2015, and what they explained to our faces was that game developers were increasingly not actually targeting the gaming API itself, but rather targeting high-level middleware, Unity or Unreal or whatever, and so Vulkan was an API designed for writing middleware. The middleware developers were also in the room at the time, the Unity and Epic and Valve guys. They were beaming as the Khronos guy explained this. Their lives were about to get much, much easier.

Vulkan is weird— but it's weird in a way that makes a certain sort of horrifying machine sense. Every Vulkan call involves passing in one or two huge structures which are themselves a forest of other huge structures, and every structure and sub-structure begins with a little protocol header explaining what it is and how big it is. Before you allocate memory you have to fill out a structure to get back a structure that tells you what structure you're supposed to structure your memory allocation request in. None of it makes any sense!

In short, Vulkan is not for you. It is a byzantine contract between hardware manufacturers and middleware providers, and people like… well, me, are just not part of the transaction.

Khronos did not forget about you and me. They just made a judgement, and this actually does make a sort of sense, that they were never going to design the perfectly ergonomic developer API anyway, so it would be better to not even try and instead make it as easy as possible for the perfectly ergonomic API to be written on top, as a library.

Khronos thought within a few years of Vulkan⁸ being released there would be a bunch of high-quality open source wrapper libraries that people would use instead of Vulkan directly. These libraries basically did not materialize.

It turns out writing software is work and open source projects do not materialize just because people would like them to.

 
So, what's the problem here ? Why unloading and loading data is freezing everything ? Drivers ? Api ? Engine ?

Even PCIe 4.0 x16 has only 32GB/s bandwidth, so loading 2GB of data takes at least 1/16 seconds, which means the FPS will be less than that (< 16FPS). That'd certainly cause a noticeable stutter.
It'd be even slower if you have to load them from storage.
 
@pcchen I think the issue is probably that they're not loading the data before it's needed, and it impacts their critical path. Basically you get a stall because the data is not ready because they're loading as needed or trying to load something just in time.

Yes, it could be a pacing issue. For example it'd be impossible to load the entire assets in an open world game, so some sort of streaming will be required. Generally you'll want to keep distant assets in lower LOD and near assets in higher LOD, and streaming according to current player location. However, it can be difficult to schedule the loading and I can imagine in some (maybe rare) case it's being scheduled too late so it has to wait for the assets to be loaded to continue.
It can be difficult to maintain a good schedule especially in an open world game where players can move freely (espcially if players can fly or move quickly). I don't know what's the best way to deal with this, but I hope something like DirectStorage and the recent NVIDIA paper on AI assisted texture compression might help. Also, that's also why I think a video card with larger memory could be a good insurance policy :)
 
Yes, it could be a pacing issue. For example it'd be impossible to load the entire assets in an open world game, so some sort of streaming will be required. Generally you'll want to keep distant assets in lower LOD and near assets in higher LOD, and streaming according to current player location. However, it can be difficult to schedule the loading and I can imagine in some (maybe rare) case it's being scheduled too late so it has to wait for the assets to be loaded to continue.
It can be difficult to maintain a good schedule especially in an open world game where players can move freely (espcially if players can fly or move quickly). I don't know what's the best way to deal with this, but I hope something like DirectStorage and the recent NVIDIA paper on AI assisted texture compression might help. Also, that's also why I think a video card with larger memory could be a good insurance policy :)

The Spider-man GDC presentation had really good information about how they handled streaming and that was with an ultra-slow HDD :)

 
Even PCIe 4.0 x16 has only 32GB/s bandwidth, so loading 2GB of data takes at least 1/16 seconds, which means the FPS will be less than that (< 16FPS). That'd certainly cause a noticeable stutter.
It'd be even slower if you have to load them from storage.

Well, I get that if you're loading the assets just in time, but if you stream ahead, it shouldn't affect the fps. Or, are the gpu inefficient at loading assets on vram ? Like, can they work on current frame while loading/unloading at the same time ? I suppose they have dma engines to manage this.
 
Well, I get that if you're loading the assets just in time, but if you stream ahead, it shouldn't affect the fps. Or, are the gpu inefficient at loading assets on vram ? Like, can they work on current frame while loading/unloading at the same time ? I suppose they have dma engines to manage this.

There's no problem with that. The problem is the game tries to load too much data within one frame (probably because that particular frame needs these data, see discussions above).
 
It could be that the game/developer knows that "these" assets will be used soon so they want to start filling VRAM with those assets. However, they don't have proper code in place to limit how quickly those assets are loaded. Thus those assets are loaded in as fast as the host system can transfer them. If this involves lots of small files (lots of IO overhead) then that will starve the game of system resources that might be critical (for example, transfer of many small files or high speed transfers can eat significant CPU time).

I could certainly see this happening with developers who haven't implemented a robust streaming system in the past for a variety of reasons. They may figure that the host system is best suited to handling any and all IO tasks and don't specifically code around making certain that their IO demands and code in game aren't themselves causing a bottleneck.

Host IO systems aren't there to try to predict how file transfers will impact your app code and execution, they are there generally just to deliver data as quickly as possible, as safely as possible or some combination of both. PS5, XBS, DirectStorage, Windows File System, etc. aren't going to ensure that the developer doesn't screw things up by initiating an uncontrolled large burst of IO traffic that might impact system resources that their currently running code might need.

Regards,
SB
 
DX12 PSO Precaching

A new PSO precaching mechanism was introduced as experimental in 5.1 to improve PSO hitching in DX12 titles. Improvements to this system in 5.2 include:

  • We improved the performance and stability of the system. There were various corner cases which we needed to address.
  • We now skip drawing objects if their PSOs aren't ready yet. The system aims to have the PSO ready in time for drawing, but it will never be able to guarantee this. When it's late, it is now possible to skip drawing the object instead of waiting for compilation to finish (and hitching).
  • We reduced the number of PSOs to precache due to improved logic that omits ones which will never be used.
  • We improved the old (manual) PSO cache system so that it can be used alongside precaching.
 
Looks like some really goo improvements there. Hopefully UE5.2 will spell the end for shader comp stutter once and for all in the engine. And if it doesn't exist in UE5, games built on other engines will hopefully put a lot more effort into resolving their own problems too.
It won't. But less is always better.
 
This seems similar to the stages the dolphin emulator went through. The GameCube had ridiculous pipeline permutations before it was cool, so dolphin was a stuttery mess. They added not rendering in lieu of stuttering too.

Their eventual solution was to add a shader interpreter as a fallback while the shader compiled in the background.
 
Back
Top