Speculation and Rumors: Nvidia Blackwell ...

I disagree there's no need for it though (or at least no benefit to be had from it) as it's clearly going to reduce CPU load in games that use LZ4 on the CPU for streaming. I guess it could also potentially be faster than the CPU for the big initial loads.

This was obviously the hope for Gdeflate but so far it's not working out as intended it seems.
Do we actually want to load the CPU less? Generally in games the GPU is loaded 100% while the CPU has idle cores. And decompressing assets is easy to offload so it shouldn't be something contributing to being single-thread bound. Even if GPU decompression is considerably faster than GPU, the fact that it's additional load is still going to cause a performance hit. I don't necessarily think gdeflate isn't working as expected, so much as expectations were unrealistic. Of course shifting load from a partially idle CPU to a fully loaded GPU will affect performance.

If there's a time and place for it, it's loading screens. Streaming during gameplay should probably stay the domain of the CPU, unless it's either not able to fully load the GPU, or decompress assets fast enough.
 
On the desktop with >100W CPUs this may be logical but in notebooks these CPUs are limited to 45W or less. It can make sense to let the GPU do the decompression because the CPU is limiting the performance.
 
Do we actually want to load the CPU less? Generally in games the GPU is loaded 100% while the CPU has idle cores. And decompressing assets is easy to offload so it shouldn't be something contributing to being single-thread bound. Even if GPU decompression is considerably faster than GPU, the fact that it's additional load is still going to cause a performance hit. I don't necessarily think gdeflate isn't working as expected, so much as expectations were unrealistic. Of course shifting load from a partially idle CPU to a fully loaded GPU will affect performance.

If there's a time and place for it, it's loading screens. Streaming during gameplay should probably stay the domain of the CPU, unless it's either not able to fully load the GPU, or decompress assets fast enough.

The context of the post was in relation to a hardware decompression unit for LZ4 so in that case the offload from CPU would be free which even in terms of energy draw would be a win. I do agree GPU decompression has a place in the current schema but if GPU's are going to include the unit anyway then it would be advantageous to use it.
 
Dedicated hardware sounds wasteful. There are so many idle GPU cycles during any given frame I don’t see why they couldn’t be filled with GDeflate async compute work.
 
So you're saying that if a game uses LZ4 via DirectStorage for CPU decompression as standard (like Forbidden West does) then the HW could automatically move the decompression off the CPU on to the hardware based decoder?
Depends on how the API in question would implement such support. DS will require an update to support other GPU decompressible formats.
What I'm saying though is that the GPU decompression itself could happen either on a dedicated unit or on the shader core as it does now, this will be transparent for the application. It's a driver decision basically.

I disagree there's no need for it though (or at least no benefit to be had from it) as it's clearly going to reduce CPU load in games that use LZ4 on the CPU for streaming. I guess it could also potentially be faster than the CPU for the big initial loads.
There is no need for a dedicated d/c h/w unit in GPUs. The task isn't that big to not handle it on the general FP units. A better scheduling may be needed to avoid deadlocks but that seems easy enough to add.

This was obviously the hope for Gdeflate but so far it's not working out as intended it seems.
The problem which Nixxes highlights is that GDeflate isn't very CPU friendly meaning that on h/w which doesn't support GPU d/c the performance of decompression is too slow.

This can be approached from several directions:
  1. Improve said performance by making a faster CPU d/c for GDeflate.
  2. Increase your CPU system requirements.
  3. Make a GPU which support DS GPU d/c mandatory for your game.
The latter isn't that specific really as "GPU decompression is supported on all DirectX 12 + Shader Model 6.0 GPUs." which basically means "all GPUs which support DX12".
With HFW specifically being a DX12 game I don't see how a requirement for a DX12 GPU to perform GPU d/c is suddenly too much honestly.
 
The consoles can definitely run GDeflate. Maybe they’ve super optimized the GPU pipeline such that there aren’t any bubbles that can be filled with GDeflate work. Would love to see that trace.
 
Do we actually want to load the CPU less? Generally in games the GPU is loaded 100% while the CPU has idle cores.
That's another point which is worth discussing.
All DS implementations so far ignore the suggestion of the API itself which says that a user facing option of where to perform d/c is a good idea.
It seem like the developers try to approach this as if PC is a console where the configuration is known and no options are needed.
But in case of DS the optimal choice of what to use for d/c - a CPU or a GPU - is essentially a user choice, not the s/w one.
A PC which runs a game which does DS d/c can be better suited to run it either on the CPU or on the GPU depending on what these are.
Just presuming that a GPU is pegged to 100% in a system or than a CPU is at 100% is a bad design for a PC application.

The consoles can definitely run GDeflate.
Can they? I thought that it's a PC only format designed specifically for DS. XS use BCPack and PS5 is using Kraken.
Anyway if R&C is of any indication AMD GPUs don't have any issues running GDeflate on PC, it's Nv h/w which seem to have deadlocks for whatever reason.
 
Last edited:

Who is David Blackwell?

While such an impressive innovation has been widely covered throughout the tech world, the name “Blackwell” hasn’t gotten as much attention. This GPU is named after David Blackwell, an American mathematician and statistician whose work has had a lasting impact in mathematics as well as the specific domain of AI.

Blackwell’s work was revolutionary, especially at a time that put steep racial barriers in front of African American scientists.

Blackwell’s story is one of a great intellect triumphing in the face of adversity, and as such his recognition by NVIDIA is well earned. To learn more about this exceptional mind, let’s explore the life and work of Blackwell and what he has contributed to AI.
 
Can they? I thought that it's a PC only format designed specifically for DS. XS use BCPack and PS5 is using Kraken.
Anyway if R&C is of any indication AMD GPUs don't have any issues running GDeflate on PC, it's Nv h/w which seem to have deadlocks for whatever reason.

What’s PC specific about it? Nvidia describes the format as 64KB chunks of data optimized for parallel processing on GPUs. Just another compute shader.
 
What’s PC specific about it? Nvidia describes the format as 64KB chunks of data optimized for parallel processing on GPUs. Just another compute shader.
You can run it on console GPUs of course but that goes against their design where the decompression h/w was added specifically to not burden their GPUs or CPUs with it.
 
Dedicated hardware sounds wasteful. There are so many idle GPU cycles during any given frame I don’t see why they couldn’t be filled with GDeflate async compute work.

On GPU decompression isn't a good idea as bandwidth is often a bottleneck, cache lines are often a bottleneck, and using them just to shuffle compressed data to the GPU just to decompress it to VRAM isn't a good usage.

This operation on the GPU isn't free and never will be. On the other hand CPU often has enough bandwidth and also has some spare pipes not doing work, full multi core utilization is exceedingly hard as it is. That's a better place to decompress than the GPU.
 
On GPU decompression isn't a good idea as bandwidth is often a bottleneck, cache lines are often a bottleneck, and using them just to shuffle compressed data to the GPU just to decompress it to VRAM isn't a good usage.

This operation on the GPU isn't free and never will be. On the other hand CPU often has enough bandwidth and also has some spare pipes not doing work, full multi core utilization is exceedingly hard as it is. That's a better place to decompress than the GPU.
What's about PCI bandwidth and latency?
 
What's about PCI bandwidth and latency?

Not often as saturated as GPU bandwidth and cache. Saturating with say a Series X level of decompression: 2.5gbs to the decompressor, about 5gbps over PCIE then. That's less than 10% of a 16x 4.0 link, less than 5% of a 5.0. Not a huge amount, a 4090 isn't hampered that much even by PCIE 3.0 assuming it has all 16x lanes.

On GPU decompression here is being built for AI, because training wants to use as many petabytes or whatever ludicrous data monster they can get there hands on, and there's no way a GPU can have that much ram at the moment. So there's been a bunch of recent work in trying to get training, and inference (which can be big anyway), to run as efficiently as possible off SSDs instead. There you could easily max out interlink bandwidth if you're say, running from a raid array of SSDs in their own 16x slot, so on GPU decompression makes sense there, you get a (compression ratio) multiplier of the interlink bandwidth you're limited by just by decompressing on the GPU.

Gaming on the other hand reuses a lot of data. A stream of new data is great, but 80% of it is probably the same frame to frame, and most of the work is done on intermediate data just for that frame/last frame with TAA. A few gbps of new texture/model data decompressed on the CPU and sent over PCIE is fine.
 
Last edited:
On GPU decompression here is being built for AI, because training wants to use as many petabytes or whatever ludicrous data monster they can get there hands on, and there's no way a GPU can have that much ram at the moment.
Worth mentioning that the CPU typically can't handle the data rates necessary due to being too constrained in cache size / memory bandwidth to run decompression with large dictionaries on all cores simultaneously. Even spilling to L2 cache doesn't scale well on the CPU, LZ4 is actually even optimized to only use a 64kB sliding window working set to account for those limitations.
On GPU decompression isn't a good idea as bandwidth is often a bottleneck, cache lines are often a bottleneck
... which applies to both CPU and GPU for the entire deflate family. A decompression engine with even just a tiny dedicated cache actually doesn't cost you memory bandwidth, or even L2 cache bandwidth when handling LZ4 though.

Same goes for GDeflate. Except it's less efficient than LZ4 due to using 64kB chunks instead of a 64kB sliding window, so half of the time you have to work with an half empty dictionary and then you end up having to reload a full 64kB burst...

It's the sliding window which renders LZ4 both more efficient in compression rate as well as bandwidth cost and giving you a decent pipeline saturation on a CPU, but also entirely unsuitable for distribution onto an ultra wide processor architecture.

The statements about GPU decompression burning bandwidth were last true when taking a look at "classic" image and video codecs. The cost is not (and never was) in the deflate family decompression phase (or whatever compression / encoding scheme was used), but in the way those codecs decomposed the image into multiple image planes, and cross-referenced data from frames or planes without spatial constraints in video codecs.

Coincidentally, the GTX 1630 was actually the latest GPU I can recall where the classic video decompression engine alone would already exceed 130% of the available memory bandwidth when trying to use H265 with the worst possible / highest compressing feature set.

But that doesn't happen with the LZ4 / GDeflate decompression engine either, the way it's set up to be used for decompressing flat buffers only, not as part of a more sophisticated image compression or alike.

What you also have to understand: That decompression engine in Blackwell is only built to be fast enough to match the PCIe uplink on consumer cards. Cards with other form factors / connectors will have likewise simply more instances of that engine, but still only restricted to plausible input rates per decompression stream. It's working under assumption that you will use it for asset streaming only, that you will saturate the uplink first, but not to hold compressed assets resident in video memory!

Running the decompression on the shaders instead, would give you much, much higher peak decompression rates after all. Except that is actually burning your valuable L1 cache size, trashing your L2, and all that while not even remotely utilizing the shader arrays or the memory interface. It only starts scaling when you go so wide, that the L1 cache no longer hits at all and everything chokes on memory and efficiency tanks. Kind of funny how badly GDeflate actually matches ANY processor architecture. 🤡 At least LZ4 is a good match for scalar processors.



While we are at spilling the beans. Where did they put it? For the consumer silicon they simply beefed up one of the two copy engines with an 128-256kB directly addressed scratchpad (with a 512B-2kB fully associative L0 in front), a huffman decoder and a deflate unit. Remember, those copy engines where most developers fundamentally misunderstood when to use them, and when not, and why you even have more than one if they are so slow...

How can I be so certain of those details? It's the only place to put this function without introducing another scheduling roundtrip. LZ4 doesn't scale indefintely either, not even with an ASIC, and the throughput of this component already matches what I expect you can achieve. Meanwhile the bandwidth amplification introduced by decompression means it would be too expensive to beef up more than one instance.

NVlink equipped datacenter GPUs got more than just two of the copy engines, and my best guess is NVidia simply beefed up all of them. Possibly even just forked some of the existing ones, so no longer just 4, more more likely 6-8 copy engines on those in total.

For Vulkan users: Rejoice, it means you will get an extension soon and then you'll be able to distinguish the two engines. And that will implicitly at last give you the ability to distinguish them, and figure out which one is dedicated for upload and which one is dedicated for download...

I expect for DirectX users, Microsoft will keep this hardware detail hidden from you, and you are merely discouraged from trying to address the copy queues yourself, at all, from now on. Well, to be specific, NVidia will discourage you in their optimization guide.



Finally one PSA: GDeflate was a poison pill. Whoever used it in their asset pipeline introduced a major inefficiency for platforms prior to Blackwell which neither the CPU nor GPU can cope with. And now with NVidia enabling LZ4 support, NVidia immediately declared it as obsolete again.
 
Last edited:
... which applies to both CPU and GPU for the entire deflate family. A decompression engine with even just a tiny dedicated cache actually doesn't cost you memory bandwidth, or even L2 cache bandwidth when handling LZ4 though...
... (etc)

I mean, it was just in response to "on GPU decompression without a dedicated decompressor should be free right?" As you pointed out, the answer is "no". And it's all just under the assumption that things are GPU bound by default.

Besides, LZ4 decompression runs at over 4gbps decompression on like, a Zen 2 processor, and a mid level one, on one core. For the next decade that's likely "good enough", so I still don't see why any consumer should care about a dedicated on GPU decompressor. For a console APU it makes sense, that's an entire CPU core freed up for devs, at a relatively low ASIC cost for Sony/MS. But for consumers a 6 core Zen 3 or above is cheap, they've probably got cores to spare.

Actually I'll make an allowance: for mobile it makes sense. No doubt an asic is much more power efficient. If I were Valve designing the Steam Deck 2 I'd have AMD put in an OS exposed decompressor that can be redirected to from the Direct Storage API (one that runs Oodle compression as well, that's popular and more efficient on CPU anyway). So does anyone building a desktop for gaming need to care much about this dedicated decompressor? I don't see it.
 
Last edited:
Running the decompression on the shaders instead, would give you much, much higher peak decompression rates after all. Except that is actually burning your valuable L1 cache size, trashing your L2, and all that while not even remotely utilizing the shader arrays or the memory interface. It only starts scaling when you go so wide, that the L1 cache no longer hits at all and everything chokes on memory and efficiency tanks. Kind of funny how badly GDeflate actually matches ANY processor architecture. 🤡 At least LZ4 is a good match for scalar processors.

Those guys aren't infallible by any means but it's hard to believe Nvidia proposed (and Microsoft endorsed) a decompression scheme fit for GPUs that actually isn't fit for GPUs. Why would they make such a fundamental mistake like not realizing the memory and cache subsystem is inadequate?
 
Those guys aren't infallible by any means but it's hard to believe Nvidia proposed (and Microsoft endorsed) a decompression scheme fit for GPUs that actually isn't fit for GPUs. Why would they make such a fundamental mistake like not realizing the memory and cache subsystem is inadequate?
What you mean by mistake? Take a look at the timeline. By the point NVIDIA sold/licensed GDeflate to Microsoft, they had already the hardware decompression unit ready. "Hardware accelerated JPEG decompression" in Hopper rings a bell? They had deflate support, 2 years ago.

GDeflate nicely fits in the role of something you can do in software, which is a bad fit for any architecture, which still distinctively performs better than the CPU based alternative by brute force, but which also met the requirement of being trivially hardware accelerated by a unit they already knew they were going to include in their next consumer hardware iteration.

It didn't ring a bell for me either until I saw some dev studios rejecting GDeflate for its poor CPU performance in favor of LZ4, and now realizing that NVIDIA had already planed for LZ4 support as the preferred alternative 3 years back. And still had pushed/promoted GDeflate at any chance they could find, even though it's obviously a poor choice for their current generation as well. An also a poor choice by objective terms, as it under-performs in terms of compression ratio.

We will find that this performance uplift will also be in the benchmark guides handed out to tech reviewers, further pushing GDeflate into the mindset despite being inferior.

Microsoft simply took the bait. There was an odd change in the direction of DirectStorage when NVMe to GPU push disappeared from the agenda (despite BypassIO permitting just that), and compression support went in instead. With NVidia taking the lead and selflessly open sourcing their great new asset compression tech...


Sure, may be all coincidence. But it's also exactly the type of strategy NVIDIA has become so infamous for, promoting tech which will cripple the competition while they already have the next iteration of a hardware in the pipeline which contains a solution to the problem they created, and properly safeguarded with patents to keep competitors from following the hardware solution path.
 
Last edited:
Sure, may be all coincidence. But it's also exactly the type of strategy NVIDIA has become so infamous for, promoting tech which will cripple the competition while they already have the next iteration of a hardware in the pipeline which contains a solution to the problem they created, and properly safeguarded with patents to keep competitors from following the hardware solution path.
A. The tech doesn't seem to "cripple the competition" at all.
B. The competition has h/w decompression units for years now. What do you think both consoles are using?
 
Back
Top