I thought on PS5 you didn't have to "use" the decompressor, it was done by default by the api / hardware if the format / compressor was right (soo Oodle / Kraken) ?
Most ports and cross-generation titles (which is almost everything currently released on Xbox Series X and PS5) don't use the HW decoders at all and do SW decompression on the CPU cores. We're still selling lots of Oodle Data licenses for PS5 and Xbox Series X titles, despite both having free HW decompression. Just because it's available doesn't mean software actually uses it.
Either device has 16GB of RAM and easily reads above 3.2GB/s with decompression, and a good chunk of RAM goes into purely transient memory like GPU render targets etc, so in practice there's a lot less data that actually needs to be resident, especially since many of the most memory-intensive things like high-detail mip map levels don't need to be present for the first frame to render and can continue loading in the background. Any load times significantly above 5 seconds, on either device, have nothing to do with either the SSD or codec speeds and are bottlenecked elsewhere, usually CPU bound.
I think this a bit of misreading here. It isn't a CPU limitation, it's a code limitation. The CPU isn't "busy" in the slow examples at all, the big difference here (and the reason why you're seeing large assets gaining more) is because we're finally stuffing the disk I/O queue full of parallel things to do.You can clearly see that DirectStorage whether utilizing GPU decompression or not, is simply much quicker than without. Just look at the CPU Load Times....massive savings there. As the size of the asset goes up, the scaling improves. The CPU without DirectStorage can't even load the SpaceShuttle asset in the time that they travel through the bounding box and thus never fully loads in time. You can also see that DirectStorage with GPU decompression is over 2x as fast at loading the asset than CPU decompression.
I didn't think I was insinuating that it was a CPU limitation. I thought the takeaway from my post was that it's more of a code limitation of the old API more than anything else.. considering the improvement when using the CPU with Directstorage.I think this a bit of misreading here. It isn't a CPU limitation, it's a code limitation. The CPU isn't "busy" in the slow examples at all, the big difference here (and the reason why you're seeing large assets gaining more) is because we're finally stuffing the disk I/O queue full of parallel things to do.
Really, this issue dates back to games being written for storage based on ATA standards. Until NVMe showed up, every disk I/O request was a serial request -- issue request, wait to complete, repeat. NVMe gives us the ability to issue parallel commands (most commodity drives expose four simultaneous I/O channels) which thus allows a significant increase in I/O capability. And here's the trick: many high performance Windows applications exist which have figured out how to properly queue and pipeline disk I/O to really drive the maximum performance from the I/O stack.
THe really big winner here is the IORing work; it allows a developer to skip having to write the hard part of pipelineing and queueing, and instead use a kernel API call to load up the assets in the fastest way possible. It's not about CPU bypass, it's about finally pushing disks to their full ability without an app developer having to think about making it work.
The main bottleneck being solved is a code problem, not a CPU or disk problem.
Devs need to use the hardware API after it is automatic(decompression) but if you call the software oodle API it will not use the hardware part from what Fabian Giesen.
What would be the point? If decompression is automated and requires only to be presented the proper format, why would a dev ever go the software route?
Devs are just paying extra to unnecessarily consume cpu cycles and pci-e bandwidth. There has to be a motivating factor.
You've been on this 'the CPU processing savings with DirectStorage doesn't matter' trick for quite a while now, haven't you? Wasn't it you who kept trying to suggest that CPU decompression was still more than enough for modern, high bandwidth applications?I think this a bit of misreading here. It isn't a CPU limitation, it's a code limitation. The CPU isn't "busy" in the slow examples at all, the big difference here (and the reason why you're seeing large assets gaining more) is because we're finally stuffing the disk I/O queue full of parallel things to do.
Really, this issue dates back to games being written for storage based on ATA standards. Until NVMe showed up, every disk I/O request was a serial request -- issue request, wait to complete, repeat. NVMe gives us the ability to issue parallel commands (most commodity drives expose four simultaneous I/O channels) which thus allows a significant increase in I/O capability. And here's the trick: many high performance Windows applications exist which have figured out how to properly queue and pipeline disk I/O to really drive the maximum performance from the I/O stack.
THe really big winner here is the IORing work; it allows a developer to skip having to write the hard part of pipelineing and queueing, and instead use a kernel API call to load up the assets in the fastest way possible. It's not about CPU bypass, it's about finally pushing disks to their full ability without an app developer having to think about making it work.
The main bottleneck being solved is a code problem, not a CPU or disk problem.
As a matter of fact, I have. Turns out I was right. It was never a CPU bottleneck, or an I/O bottleneck, it was a code bottleneck the whole time -- exactly as I described. This is surprising to literally nobody who actually does this for a living..You've been on this 'the CPU processing savings with DirectStorage doesn't matter' trick for quite a while now, haven't you?
Actually no, very specifically I said GPU decompression might be the only really significant new thing DirectStorage may provide. Here's me saying exactly as much in the first link I posted above:Wasn't it you who kept trying to suggest that CPU decompression was still more than enough for modern, high bandwidth applications?
I also buy into the GPU decompression conversation being more of the "meat and potatoes" of a newfangled feature being added.
Looks like your memory is crap.As I have stated previously multiple times, I absolutely buy decompression needs can be far more highly optimized. <snip> Now, sustaining that "bandwidth" while decompressing textures needing extra CPU, as specifically linked to the decompression function itself? Yup, I buy that all day.
LOL, what are you even talking about? Everything I said would come to fruition, has.I think this is properly debunked by now,
I suspect what we're really facing here is an API which does all the heavy lifting work for game designers who don't want to put in the code effort, which is truly fine. Making it easier for a dev is a rational and reasonable argument, far more so than making the kernel servicing I/O in some remarkably, game-changing (ha!) faster way. I'm sure the kernel can use more tweaking as all code can; it isn't bottlenecking disk I/O today on the crap storage we find in commodity grade consumer devices like typical NVMe drives.
According to Nvidia GDeflate is a candidate for fixed function hardware. Would be interesting to see if any PC IHVs spend transistors on it in the future.
I don't think that's how it works. If that was the case... the CPU would take longer... and rendering would still have to halt until it's loaded...Now this isn't necessarily representative of the 'hit' you can expect in a game, even with how fast it is, decompressing that asset took around 250 ms - so if your rendering engine is waiting on those materials, you're going to get a very large stutter. So obviously this has to be managed in smaller chunks, and the resulting GPU hit will be lower as a result.
I don't think that's how it works. If that was the case... the CPU would take longer... and rendering would still have to halt until it's loaded...
Those metrics are simply showing the difference in time..
Was thinking about this since I saw those GDC slides from Amd on Directstorage, this is the hit a 7900XT takes (the Space Shuttle is 1GB compressed file):
View attachment 8702
Now this isn't necessarily representative of the 'hit' you can expect in a game, even with how fast it is, decompressing that asset took around 250 ms - so if your rendering engine is waiting on those materials, you're going to get a very large stutter - albeit of course, that depends on how soon/late you're calling it in. So there are ways to manage this in smaller chunks of course, and the resulting GPU hit will be lower as a result.
Still....1GB compressed is not that large, and that's a very powerful GPU.
View attachment 8703
But yeah, I'm thinking why even 'burden' the GPU with this at all and just have GPU's with fixed function hardware to do this going forward? AFAIK the silicon area actually dedicated to this in the PS5 is quite small.
That's completely ridiculous... There's the staging buffer size, typically 128MB - 256MB, and the size of the asset. The asset is compressed in VRAM.. and decompresses to the size the asset would have been transferred over the PCIe bus the old way..For those who watched the AMD presentation on DirectStorage, it's worth noting the accelerated GPU decompression comes at a notable cost: significant VRAM consumption. Compare these two slides from the AMD presentation; pay attention to the little flow blocks which indicate copies of the asset data: (hint: press play on each one, then press pause as it finally starts to play, and then "x" out of the more videos that shows up so you can see the full slide on pause)
The 15:30 mark:The 15:55 mark:
In the earlier slide, he calls out up to 5x copies of the asset data in memory: four in main memory, and the final asset data in GPU memory. Now look in the next slide, we still have just as many potential copies of the data, but now 3x of them live in VRAM. It's also worth considering, when the resources are decompressed in main memory, we can elect which sub-resources are transferred over the PCIe bus for inclusion in the next frame -- the whole asset doesn't actually HAVE to be in VRAM. With DirectStorage, the entire parent resource is loaded with all subresources into VRAM in their three buffered / staged states.
So it isn't just a tripling of VRAM consumption from today, because today we can selectively decide which subresources need to move over. No, the entire resource chain is a preallocated blob now, so everything about the resource is potentially moved into VRAM. This means our VRAM consumption (on a per-asset basis) is possibly a whole lot more than 3x what we face today.
GPU compression is definitely faster; it also comes at a cost and it isn't just GPU cycles.
For those who watched the AMD presentation on DirectStorage, it's worth noting the accelerated GPU decompression comes at a notable cost: significant VRAM consumption. Compare these two slides from the AMD presentation; pay attention to the little flow blocks which indicate copies of the asset data: (hint: press play on each one, then press pause as it finally starts to play, and then "x" out of the more videos that shows up so you can see the full slide on pause)
The 15:30 mark:The 15:55 mark:
In the earlier slide, he calls out up to 5x copies of the asset data in memory: four in main memory, and the final asset data in GPU memory. Now look in the next slide, we still have just as many potential copies of the data, but now 3x of them live in VRAM. It's also worth considering, when the resources are decompressed in main memory, we can elect which sub-resources are transferred over the PCIe bus for inclusion in the next frame -- the whole asset doesn't actually HAVE to be in VRAM. With DirectStorage, the entire parent resource is loaded with all subresources into VRAM in their three buffered / staged states.
So it isn't just a tripling of VRAM consumption from today, because today we can selectively decide which subresources need to move over. No, the entire resource chain is a preallocated blob now, so everything about the resource is potentially moved into VRAM. This means our VRAM consumption (on a per-asset basis) is possibly a whole lot more than 3x what we face today.
GPU compression is definitely faster; it also comes at a cost and it isn't just GPU cycles.
I understand your statement, but also consider the differences. in a prior life, we could pick and choose which very specific pieces of an asset needed to come out of system RAM into the VRAM pool. If you want a hand-wavey example, think of miplevels on a megatexture. No need to move the entire texture and all miplevels as a single pool, when you could move just the specific miplevel and even at block-granularity. Yeah, they came across the PCIe bus, but only those specific parts you needed. Is it still going to be faster to use GPU decompression? Absolutely. Is it going to come at a VRAM cost? Absolutely.That's completely ridiculous... There's the staging buffer size, typically 128MB - 256MB, and the size of the asset. The asset is compressed in VRAM.. and decompresses to the size the asset would have been transferred over the PCIe bus the old way..
So essentially you're adding the staging buffer size onto the VRAM.. so it will go up, likely around 1GB at most.. and yet allow you to stream in data much faster.
Start watching around here:What’s the definition of a subresource? DirectStorage supposedly supports tiled resources with tile size at 64KB and you can choose to stream / decompress a single tile.
Let's say there's a 1GB overhead required for decompression.. surely games which utilize DirectStorage and more specifically GPU decompression, can lean more heavily on the fact that data coming over the PCIe bus is ~half the size it would otherwise be.. and surely their games will be designed around streaming in assets with that guarantee in mind.I understand your statement, but also consider the differences. in a prior life, we could pick and choose which very specific pieces of an asset needed to come out of system RAM into the VRAM pool. If you want a hand-wavey example, think of miplevels on a megatexture. No need to move the entire texture and all miplevels as a single pool, when you could move just the specific miplevel and even at block-granularity. Yeah, they came across the PCIe bus, but only those specific parts you needed. Is it still going to be faster to use GPU decompression? Absolutely. Is it going to come at a VRAM cost? Absolutely.
And consider how much moaning we've all endured about how modern games are so VRAM hungry. 1GB of extra VRAM is meaningful when the overwhelming majority of video cards on the market are 8GB of VRAM or less... Versus system RAM which, bluntly, is a whole lot cheaper and typically larger (or at least, easier to make larger in a pinch.) For a VRAM-constrained system, GPU decompression has a good probability of being worse.
Here's the great news: GPU decompression is just one aspect of DirectStorage; the really BIG part (IMO, of course) of DirectStorage is the IORing API which is what gets all the fantastic loading speed increases. As such, we may see GPU decompression as an optional feature, reserved for those cards with enough VRAM to really make use of it. Nothing really wrong with that, IMO.