On GPU decompression here is being built for AI, because training wants to use as many petabytes or whatever ludicrous data monster they can get there hands on, and there's no way a GPU can have that much ram at the moment.
Worth mentioning that the CPU typically can't handle the data rates necessary due to being too constrained in cache size / memory bandwidth to run decompression with large dictionaries on all cores simultaneously. Even spilling to L2 cache doesn't scale well on the CPU, LZ4 is actually even optimized to only use a 64kB sliding window working set to account for those limitations.
On GPU decompression isn't a good idea as bandwidth is often a bottleneck, cache lines are often a bottleneck
... which applies to both CPU and GPU for the entire deflate family. A decompression engine with even just a tiny dedicated cache actually
doesn't cost you memory bandwidth, or even L2 cache bandwidth when handling LZ4 though.
Same goes for GDeflate. Except it's less efficient than LZ4 due to using 64kB chunks instead of a 64kB sliding window, so half of the time you have to work with an half empty dictionary and then you end up having to reload a full 64kB burst...
It's the sliding window which renders LZ4 both more efficient in compression rate as well as bandwidth cost and giving you a decent pipeline saturation on a CPU, but also entirely unsuitable for distribution onto an ultra wide processor architecture.
The statements about GPU decompression burning bandwidth were last true when taking a look at "classic" image and video codecs. The cost is not (and never was) in the deflate family decompression phase (or whatever compression / encoding scheme was used), but in the way those codecs decomposed the image into multiple image planes, and cross-referenced data from frames or planes without spatial constraints in video codecs.
Coincidentally, the GTX 1630 was actually the latest GPU I can recall where the classic video decompression engine alone would already exceed 130% of the available memory bandwidth when trying to use H265 with the worst possible / highest compressing feature set.
But that doesn't happen with the LZ4 / GDeflate decompression engine either, the way it's set up to be used for decompressing flat buffers only, not as part of a more sophisticated image compression or alike.
What you also have to understand: That decompression engine in Blackwell is
only built to be fast enough to match the PCIe uplink on consumer cards. Cards with other form factors / connectors will have likewise simply more instances of that engine, but still only restricted to plausible input rates per decompression stream. It's working under assumption that you will use it for asset streaming only, that you will saturate the uplink first, but not to hold compressed assets resident in video memory!
Running the decompression on the shaders instead, would give you much, much higher peak decompression rates after all. Except that is actually burning your valuable L1 cache size, trashing your L2, and all that while not even remotely utilizing the shader arrays or the memory interface. It only starts scaling when you go so wide, that the L1 cache no longer hits at all and everything chokes on memory and efficiency tanks. Kind of funny how badly GDeflate actually matches ANY processor architecture.
At least LZ4 is a good match for scalar processors.
While we are at spilling the beans. Where did they put it? For the consumer silicon they simply beefed up
one of the two copy engines with an 128-256kB directly addressed scratchpad (with a 512B-2kB fully associative L0 in front), a huffman decoder and a deflate unit. Remember, those copy engines where most developers fundamentally misunderstood when to use them, and when not, and why you even have more than one if they are so slow...
How can I be so certain of those details? It's the only place to put this function without introducing another scheduling roundtrip. LZ4 doesn't scale indefintely either, not even with an ASIC, and the throughput of this component already matches what I expect you can achieve. Meanwhile the bandwidth amplification introduced by decompression means it would be too expensive to beef up more than one instance.
NVlink equipped datacenter GPUs got more than just two of the copy engines, and my best guess is NVidia simply beefed up all of them. Possibly even just forked some of the existing ones, so no longer just 4, more more likely 6-8 copy engines on those in total.
For Vulkan users: Rejoice, it means you will get an extension soon and then you'll be able to distinguish the two engines. And that will implicitly at last give you the ability to distinguish them, and figure out which one is dedicated for upload and which one is dedicated for download...
I expect for DirectX users, Microsoft will keep this hardware detail hidden from you, and you are merely discouraged from trying to address the copy queues yourself, at all, from now on. Well, to be specific, NVidia will discourage you in their optimization guide.
Finally one PSA: GDeflate was a poison pill. Whoever used it in their asset pipeline introduced a major inefficiency for platforms prior to Blackwell which neither the CPU nor GPU can cope with. And now with NVidia enabling LZ4 support, NVidia immediately declared it as obsolete again.