DirectStorage GPU Decompression, RTX IO, Smart Access Storage

pjbliverpool · Jul 20, 2021

Albuquerque said:
Yeah, so a desktop PC with eight cores at a minimum is going to be fine for level loading at the maximum rate of a commodity SSD. Remember, Zen2 is prior architecture -- and I've been panned for insinuating consoles are at a CPU deficit. Guess what? Consoles are at a CPU deficit compared to desktops.

And what of gamers without "commodity SSD's"? According to Microsofts own numbers an 8 core Zen 2 you would top out at around around 3.8GB/s which counts out the majority of PCIe4 SSD's. Granted most PC Zen 2 or higher performing CPU's have a per core performance advantage over the console CPU's but not enough to change the overall conclusion. And there are still plenty of PC gamers rocking 6 or even 4 core systems, or older 8 core systems. And again, this assumes perfect scaling across all CPU cores which you'll likely never get.

You're conflating terms here. CPU consumption isn't directly a function of I.O bandwidth, in fact you can achieve maximum disk bandwidth with relatively tiny amount of I/O and related CPU.

Not the game loading scenario's that Microsoft is talking about which utilise many small IO requests. To saturate higher bandwidth under that scenario it stands to reason you'd have to make many more IO requests and this would increase the CPU overhead accordingly.

Part of this goes back to file management of the game itself, which should be obvious now.

Yes there is no argument there. My original post on this matter stated that the two issues that needed to be overcome where the bottlenecks on the PC IO stack, and the way the game code itself handles IO. Fixing one without the other will have limited or no success. Hence why lots of new gen console back compat games don't have crazy fast load times.

Honestly? All of them. It's a level load! We aren't actively playing the game here, we're waiting to play the game. Want to make an argument about GPU decompression? Great, but you can't buy a "gaming" desktop today with less than eight CPU threads, that gives you six more threads for doing... something else.

I think your maths has gone wrong somewhere there. I (extrapolating Microsoft's own figures) said 15 cores, not threads. That's 30 threads so I'm not sure where the argument about 8 thread CPU's fits in, and even less so the point about having 6 threads left over. Even just the decompression alone would saturate 8.75 cores on a 7GB/s drive which puts it out of reach for the majority of gamers even if you ignore IO overhead completely.

Writes are far less painful than reads, especially asynchronous writes (which would be indicative of recording your streamed video game to a video file.) Reads are pathalogical because they're blocking operations; writes are not blocking unless you have a specific need for 100% data integrity, which isn't what any of the commodity streaming recording studio software is doing. Writes are cached and coalesced by the OS and then written to disk as a few, large contiguous blocks rather than scattered in zillions of individual I/Os (aka, random reads.)

I was talking about in game asset streaming. Not live streaming a video feed. i.e. loading data from the SSD at the same time as you need your CPU to actually run the game. Granted, streaming will be at a much lower rate than load times, but it's also far more important to keep the CPU load low when doing it. You also still have the potential for bursts of high bandwidth streaming which you don't want to tank your framerate because half your CPU is busy dealing with the IO and decompression.

Explanation is simple: they aren't waiting on disk.

Yes that's the point.

The level load time shows literally nothing about describing a bottleneck, pro or con.

Those benchmarks show is precisely that there is a bottleneck in the system that is not the drive itself. So where does that bottleneck sit, and why? Generally the answer is on the CPU once you pass a certain threshold of drive speed:

https://www.techspot.com/review/2117-intel-vs-amd-game-loading-time/

The CPU bottlenecks will obviously be down to different factors, not all IO related, but IO and decompression will certainly be a factor there. So once they're largely removed by Direct Storage, the remaining bottleneck will primarily be the game code itself and ensuring that's written correctly to maximise the IO systems capabilities.

Albuquerque · Jul 21, 2021

pjbliverpool said:
And what of gamers without "commodity SSD's"? According to Microsofts own numbers an 8 core Zen 2 you would top out at around around 3.8GB/s which counts out the majority of PCIe4 SSD's.

Sigh. Your inexperience with storage is showing, and it's coloring your thinking. Let's make sure you actually understand what you're talking about here:

First: A commodity drive is something you and I would buy for our desktops, not something the fortune 200 company I work for would buy for a high performance compute cluster. They are different things; the enterprise drives we use are substantially different than the NVMe stuff (yes, even the high-end ones) even a top-end enthusiast would drop into a desktop.

Second: you again continaully conflate bandwidth with IOPs. Maximum bandwidth on a modern (commodity and enterprise both) storage device is very far from maximum IOP rate. Here's why: when reading in large (64kb -> 256kb depending on storage technology) blocks, you only need maybe a hundred thousand IOPs to completely saturate a disk interface -- even a PCI-e gen4 one. At 128kb block size, 61,000 IOPs is eough to fill a PCIe gen4 NVMe interace (roughly 7.5Gbytes/sec.) Disk bandwidth (which is what you're talking about when you say anything "bytes per second") hasn't been CPU limited since the Pentium 3 on Windows XP. Seriously. It isn't CPU intensive, stop conflating those two, they aren't the same thing.

As I stated twice now, total IOP rate is the causal link to CPU consumption. Let's say it another way to it sticks in your head: maximum IOP rate is never linked to maximal bandwidth -- this is true in spinning and flash storage alike. Want to look at storage like an enterprise datacenter owner? Look at low queue depth 4K random reads the next time you read a storage review article. That's the pathological case, and that's where enterprise storage outclasses commodity storage, also why it's so much more expensive too. The consumer / commodity world panned Optane for being expensive and slow, because people who have no clue how storage truly works didn't understand the value Optane truly delivers in random 4K access pattern.

Your contiual focus on "bandwidth numbers" belies your naitivity in storage technology, full stop.

pjbliverpool said:
Not the game loading scenario's that Microsoft is talking about which utilise many small IO requests. To saturate higher bandwidth under that scenario it stands to reason you'd have to make many more IO requests and this would increase the CPU overhead accordingly.

For background streaming? Sure. This still depends on the file layout and the read-ahead caching structure of the application binary. Turns out, games aren't the only softwares out there with this problem. It's still a challenge even with enterprise applications. The reality is, the kernel I/O stack is still entirely capable of delivering literally a million IOPs on a single storage device using the Win10 kernel.

You also previously focused on "how many cores though?" That's the important distinction of a single storage device serving the IO: there can be no more active I/O threads than there are queues in the storage target, and right now the Optane device is still expsosing only four hardware access queues. That means, in the absolute highest-end case, the I/O stack can only have four initiators running, meaning a maximum of four CPUs for the task. Also consider that server CPUs are far lower in MHz (the "typical" CPU we run in our datacenter are the 6226R which are 16-core, 2.9GHz max and are the "clock optimized" XEON gold-glass.) They are at a substantial performance deficit to the far higher clocked desktop CPUs available to gamers.

In the StorageReview bench of Optane, they were using dual socket 16's at 2.1GHz on a single Optane P5600 and netted more than 1.3MM IOPs at full-bore, and more than 600,000 IOPs running SQL. If four 2.1GHz CPU's can pull that off on the Windows I/O stack, then there isn't much to defend a position of "Well Windows I/O can't deal..."

Albuquerque · Jul 21, 2021

pjbliverpool said:
Those benchmarks show is precisely that there is a bottleneck in the system that is not the drive itself. So where does that bottleneck sit, and why? Generally the answer is on the CPU once you pass a certain threshold of drive speed:

https://www.techspot.com/review/2117-intel-vs-amd-game-loading-time/

Look at the changes between CPU here:

Now look at the changes between CPU here:

Synthetically, faster clockspeeds and more threads resulted in notably / substantially more disk I/O wherever more queues were used to pull that I/O. Yet in games, moving to a faster CPU with more cores is doing nothing for game load times. Faster and more CPUs show direct improvement in I/O, yet not in games? How do you defend an argument that we're CPU bottlenecked in game load times then?

And you ask where the bottleneck is? Go back to the CrystalDisk image, look at random 4K reads with Queues:1. Did you guess where the bottleneck is?

The game code is single threading I/O, even on devices with multiple hardware queues available. Even the SATA interface performed better with overlapped I/O (queues:32.) The API changes don't have to be about radically changing the I/O stack, it's about doing the work for developers which they aren't doing themselves.

This is a solved problem in the enterprise software world.

This is also the bane of my professional existence, when some douche-canoe in the software dev world rolls up in our ticketing queue saying our resources are slow -- either disk or network or CPU or whatever, all because they have no idea how to write their garbage-ass code.

Software idiot: Network connections to the database are slow as balls, I'm only moving like 1000tps!

One of my leads: Yeah, so what are your connection pool settings?

Software idiot: my what?

One of my leads: *facepalm*

pjbliverpool · Jul 21, 2021

Albuquerque said:
Sigh. Your inexperience with storage is showing, and it's coloring your thinking. Let's make sure you actually understand what you're talking about here:

First: A commodity drive is something you and I would buy for our desktops, not something the fortune 200 company I work for would buy for a high performance compute cluster. They are different things; the enterprise drives we use are substantially different than the NVMe stuff (yes, even the high-end ones) even a top-end enthusiast would drop into a desktop.

Hang on, you're the one who defined an arbitrary delineation between commodity and non commodity drives as something any 8 core CPU would handle the full throughput of. And yet Microsoft themselves has said a 2.4GB/s SSD would utilise 5 full cores. Even if I accept your argument that the IO overhead doesn't scale linearly with throughput for a given workload (we'll get to that later), decompression overhead absolutely does. So according to Microsoft own figures a 2.4GB/s drive requires 3 Zen2 core for full speed decompression which means a 7GB/s drive would require 8.75 CPU cores alone. So straight away, even ignoring IO overhead and any other system overhead entirely, you are either wrong in saying that an 8 core CPU could handle the full data rate of a 7GB/s drive, or you're defining a commodity drive as something slower than 7GB/s. Your recent clarification reveals it's the former of those two options.

Second: you again continaully conflate bandwidth with IOPs.

That's because for a given workload IOPS absolutely do go up as more bandwidth is consumed. There's a very good reason why higher bandwidth drives are also capable of more IOPS.

Maximum bandwidth on a modern (commodity and enterprise both) storage device is very far from maximum IOP rate. Here's why: when reading in large (64kb -> 256kb depending on storage technology) blocks, you only need maybe a hundred thousand IOPs to completely saturate a disk interface -- even a PCI-e gen4 one. At 128kb block size, 61,000 IOPs is eough to fill a PCIe gen4 NVMe interace (roughly 7.5Gbytes/sec.)

Agreed, and so tell me, if 61K IOPs is enough to saturate a 7.5GB/s drive on a 128kb block sized workload, how many IOPs would be required to saturate a hypothetical 15GB/s drive on the same 128kb block size workload? More, correct? And more IOPS = more CPU overhead. Hence why higher bandwidth can be associated with ore CPU overhead if comparing apples to apples workloads.

Disk bandwidth (which is what you're talking about when you say anything "bytes per second") hasn't been CPU limited since the Pentium 3 on Windows XP. Seriously. It isn't CPU intensive, stop conflating those two, they aren't the same thing.

Here's where we disagree. Or rather, here's where you disagree with Microsoft because Microsoft very specifically state that the IO overhead to saturate a 2.4GB/s SSD in a typical game workload in Windows using current IO protocols would require 2 Zen 2 cores. Are you saying that Microsoft are lying? Or simply mistaken? Here's another Microsoft statement on the same subject:

Microsoft said:
Taking the Series X’s 2.4GB/s capable drive and the same 64k block sizes as an example, that amounts to >35,000 IO requests per second to saturate it.

Existing APIs require the application to manage and handle each of these requests one at a time first by submitting the request, waiting for it to complete, and then handling its completion. The overhead of each request is not very large and wasn’t a choke point for older games running on slower hard drives, but multiplied tens of thousands of times per second, IO overhead can quickly become too expensive preventing games from being able to take advantage of the increased NVMe drive bandwidths.

https://devblogs.microsoft.com/directx/directstorage-is-coming-to-pc/

In the StorageReview bench of Optane, they were using dual socket 16's at 2.1GHz on a single Optane P5600 and netted more than 1.3MM IOPs at full-bore, and more than 600,000 IOPs running SQL. If four 2.1GHz CPU's can pull that off on the Windows I/O stack, then there isn't much to defend a position of "Well Windows I/O can't deal..."

These are entirely different workloads, potentially using entirely different API's or programming models. The bottom line here is that Microsoft are saying one thing about their own IO protocols overhead in games and you're saying something entirely different based on what you see in completely different server workloads. I'm afraid it doesn't make for a convincing argument.

Faster and more CPUs show direct improvement in I/O, yet not in games? How do you defend an argument that we're CPU bottlenecked in game load times then?

I'm not quite sure what table you're looking at because there is a obvious, almost direct scaling in single threaded performance to load times. This is clear evidence that the CPU is the bottleneck here. I completely agree that there's a problem with this not being multithreaded, but hang on, if it where, these load times would be at best 4x faster due to the queue limit you mentioned earlier, right? So you could essentially quarter all of those times on the 3 SSD's while the HDD times would remain pretty similar as clearly the bottleneck there is the drive rather than the CPU. So the result would be 5GB/s NVMe drive performing the same as (or at best 4x faster) than a 500MB/s SATA SDD and 10x better than a 100MB HDD. So multithreading doesn't change the general picture here. The CPU is still limiting the the maximum drive throughput. Of course it's not necessarily the IO that's limiting load times here at all, it could be a completely different aspect of the setup. But the point remains, you won't find a single game loading benchmark anywhere that demonstrates a linear scaling (or even quarter linear scaling to account for the lack of multithreading) with drive speed differences. And the simplest and most realistic explanation for that is that Microsoft are actually telling us the truth when they say modern Windows IO protocols have a significant overhead in games with high speed drives to the point where in at least some circumstances, those drives can be bottlenecked by the IO overhead on the CPU - especially in combination with CPU decompression.

And that's why they created DirectStorage.

Albuquerque · Jul 22, 2021

pjbliverpool said:
And yet Microsoft themselves has said a 2.4GB/s SSD would utilise 5 full cores.

Sigh. Stop with the bandwidth. They dumbed the message down for people who don't understand, please stop trying to tie bandwidth to CPU consumption -- it doesn't work the way you think it does, and I've tried explaining multiple times.

A disk IO request (an IOP) can be of any Pow(2) size starting at 512 bytes to as big as (depending on a number of variables) 16 Megabytes. Therefore, both of the following statements are potentially true:

One million IOPS are needed to transfer ~500 megabytes per second in bandwidth. (IO request size = 512 bytes)
Five hundred and twelve IOPS ar needed to transfer ~8 gigabytes per second in bandwidth. (IO request size = 16 MBytes)

The CPU consumption necessary to push a mlillion IOPS is indeed quite high. The CPU consumption to push 512 IOPs is so small as to be nearly non-existent. I need you to understand this part, you can't keep assigning a CPU consumption value to a bandwidth number; they aren't linked in the way you think they are. Stop pointing to a marketing document that was intended for unwashed, non-storage-educated masses. Yes, it's marketing, because of how Awesome(TM) the SexBox is about to get...

pjbliverpool said:
So tell me, if 61K IOPs is enough to saturate a 7.5GB/s drive on a 128kb block sized workload, how many IOPs would be required to saturate a hypothetical 15GB/s drive on the same 128kb block size workload?

Go back to the CrystalDisk benchmarks above: the random, overlapped-IO 4K reads on the slowest CPU hit 2396MB/sec (which equals 2,453,504KB/sec) using 4K operations. It's simple math: 2453504 / 4 = the slowest CPU pounded out 613,376 IOPs. The fastest CPU pounded out 825,600 IOPS. Using my contrived 128KB IO request example, that's enough for a 3400G to shove more than 76 Gigabytes/sec.

This is why you need to stop conflating bandwidth with CPU consumption. They aren't causally linked; IOPs are. And without IOPs information, we have no actual data to make claims.

pjbliverpool said:
There is a obvious, almost direct scaling in single threaded performance to load times.

Look at the CrystalDisk benches, not the game load times. Right here:

Look at the single threaded (Q1:T1) throughput from the SATA SSD all the way to the PCIe-4 SSD. Let's choose the clearly worst case 3400G benchmarks for now.

On the SATA SSD, one thread at a queue depth of one transaction yields 34MB/sec, or 8704 IOPs.
On the fastest PCIe SSD, one thread at a queue depth of one transaction yields 49MB/sec, or 12544 IOPs.

That's a 50% increase in IOP rate (and equivalent bandwidth) for the same CPU "cost" on a single thread.

Did the game load times change by 50% moving between those two examples?

No, they didn't. They moved by 3.3%.

These apps aren't CPU bottlenecked and aren't I/O bottlenecked. They're application code bottlenecked.

pjbliverpool · Jul 22, 2021

Albuquerque said:
Sigh. Stop with the bandwidth. They dumbed the message down for people who don't understand, please stop trying to tie bandwidth to CPU consumption -- it doesn't work the way you think it does, and I've tried explaining multiple times.

A disk IO request (an IOP) can be of any Pow(2) size starting at 512 bytes to as big as (depending on a number of variables) 16 Megabytes. Therefore, both of the following statements are potentially true:

One million IOPS are needed to transfer ~500 megabytes per second in bandwidth. (IO request size = 512 bytes)

Five hundred and twelve IOPS ar needed to transfer ~8 gigabytes per second in bandwidth. (IO request size = 16 MBytes)

The CPU consumption necessary to push a mlillion IOPS is indeed quite high. The CPU consumption to push 512 IOPs is so small as to be nearly non-existent. I need you to understand this part, you can't keep assigning a CPU consumption value to a bandwidth number; they aren't linked in the way you think they are. Stop pointing to a marketing document that was intended for unwashed, non-storage-educated masses. Yes, it's marketing, because of how Awesome(TM) the SexBox is about to get...

I already addressed all of this in my previous post so I'm not sure why you're re-hashing it. The examples above are pointless as we're not talking about 512byte or 16MB block sizes. We're talking about typical gaming workloads which don't go much lower than 64KB on average. When Microsoft say that they're using up 2 cores for IO overhead it's almost certainly on a typical gaming workload with a typical average block size. So using a faster drive with the same typical average block size will obviously require more operations to saturate the bandwidth, hence more IO overhead.

Go back to the CrystalDisk benchmarks above: the random, overlapped-IO 4K reads on the slowest CPU hit 2396MB/sec (which equals 2,453,504KB/sec) using 4K operations. It's simple math: 2453504 / 4 = the slowest CPU pounded out 613,376 IOPs. The fastest CPU pounded out 825,600 IOPS. Using my contrived 128KB IO request example, that's enough for a 3400G to shove more than 76 Gigabytes/sec.

Again, this is an apples to oranges comparison. The topic is real world gaming workloads, not a synthetic benchmark designed specifically to minimize IO overhead and maximise throughput. Show me any game benchmark that demonstrates a linear speedup in load times from a faster drive. Surely not every single game ever written can be coded so badly as to barely scale even with the no CPU limitations that you claim and orders of magnitudes faster drives?

These apps aren't CPU bottlenecked and aren't I/O bottlenecked. They're application code bottlenecked.

Is the application code running on thin air?

PSman1700 · Jul 22, 2021

Albuquerque said:
Sigh.

Not trying to intervene and say whos wrong but, why did sony implement this decoder block if modern CPU's are more then capable enough?

Kaotik · Jul 22, 2021

PSman1700 said:
Not trying to intervene and say whos wrong but, why did sony implement this decoder block if modern CPU's are more then capable enough?

Just because you can doesn't mean you want to. But the decoder isn't relevant to the specific discussion you quoted, it's about bandwidth, not what and where decodes whatever

pjbliverpool · Jul 22, 2021

Kaotik said:
Just because you can doesn't mean you want to. But the decoder isn't relevant to the specific discussion you quoted, it's about bandwidth, not what and where decodes whatever

More specifically its about the CPU overhead of high throughput IO which does encompass both IO overhead and decompression. The discussion about bandwidth and specifically whether the cpu load increases with increased data throughput is just a subcomponent of that discussion.

Albuquerque · Jul 22, 2021

pjbliverpool said:
I already addressed all of this in my previous post so I'm not sure why you're re-hashing it.

I rehashed it because you incorrectly brought it up again, right here:

pjbliverpool said:
And yet Microsoft themselves has said a 2.4GB/s SSD would utilise 5 full cores.

You decided to start a conversation about some marketing bullshit which doesn't properly or accurately describe the supposed problem -- bandwidth (2.5GB/s) does not equate to CPU (OMG FIVE CORES) consumption. I'm left to assume you brought it up because you still don't understand. I'm left to now continue to try to educate you, those two things are not linked in the way you and that bullshit marketing speech try to presume.

pjbliverpool said:
The examples above are pointless as we're not talking about 512byte or 16MB block sizes. We're talking about typical gaming workloads which don't go much lower than 64KB on average.

Here's a question: what data are you basing this on? If you're going to get that specific, quote your source.

pjbliverpool said:
When Microsoft say that they're using up 2 cores for IO overhead it's almost certainly on a typical gaming workload with a typical average block size.

There is no "typical gaming workload" and there is no "typical average block size." The workload is unique to the game; you've mentioned the reasons so in prior posts. Some of it is bitwise streaming in the background, some of it is bulk-rate loads during an "intermission" of sorts. Some of those data blocks might be mapped texture regions that are 4k block-aligned so they perfectly fit into the physical 4k memory block sizes of both your main memory and GPU memory.

pjbliverpool said:
So using a faster drive with the same typical average block size will obviously require more operations to saturate the bandwidth, hence more IO overhead.

I have literally provided data, right above your post, where a faster disk resulted in a 50% increase in I/O rate with literally the exact same single-core CPU consumption -- and the load time of the games changed by a rounding error. You told me the Windows I/O stack was basically limited to SATA speeds, and yet that's demonstratably false. You're then told me the faster disk is being held back by CPU constraints which is why load times haven't changed, and yet the empirical data above shows the I/O improved by 50% while CPU load stayed completely constant and -- tada, a surprise to nobody who does high performance storage and compute for a living -- game load changed zero with that extra disk performance.

None of the available, empirical, and repeatable data available supports your arguements.

Find better data, and present your better data, or simply stop tiliting at windmills.

Rodéric · Jul 23, 2021

Not all games load using the same method (memory mapped, OS specific, asynchronously or synchronously), neither the same compression algorithm, or way to do things (load by chunks or marshall a stream, or load-in-place with pointers fix-up) it's rather difficult to tell why it seems games don't benefit much from hardware performance increase.
One sure thing is that keeping your archive files open at all time helps performance.
It's not really clear to me how to best use an NVMe drive, most standard benchmarks use settings that don't really match any game I worked on.

[Also compression algorithms vary a lot in performance vs compression ratios, with massive bandwidth you may want to pick one that compress a little less but runs much faster...]

CarstenS · Jul 23, 2021

Didn't they say somewhere, that an important part was also that the hardware decompression can use a new format BCPack for Textures in addition to ZLib?

pjbliverpool · Jul 23, 2021

Albuquerque said:
I rehashed it because you incorrectly brought it up again, right here:

You decided to start a conversation about some marketing bullshit which doesn't properly or accurately describe the supposed problem -- bandwidth (2.5GB/s) does not equate to CPU (OMG FIVE CORES) consumption. I'm left to assume you brought it up because you still don't understand. I'm left to now continue to try to educate you, those two things are not linked in the way you and that bullshit marketing speech try to presume.

Why do you keep ignoring the fact that we're talking about a specific workload being scaled up here? We're not swapping workloads between CrysalMark, giant media files or tiny SQL queries. We're talking about a typical gaming workload - or at least what Microsoft has defined as a typical gaming workload of the purposes of it's comparison - and simply scaling it up. And note both Sony and Nvidia have done the same. Granted, you may arrive at a different figure for a typical gaming workload depending on the games you base it upon, but using that specific workload and scaling it up in terms of data delivered is obviously going to require more IO ops.

In the typical gaming workload that Microsoft is using, it takes 2 cores to service the IO overhead at 2.4GB/s. If you take that same workload and widen it to 7GB/s, the IO is going to scale up. You've literally proven that already in your CrystalMark example above. See how the data transferred goes up as the drive speed goes up while the workload itself remains consistent at 4k blocks... 2396MB/s (the 3400G on the PCIe4 SSD) at 4k block size is ~600k IOPS. 211MB/s (the 3400G on the SATA SDD) at 4k block size is ~53k IOPS. Same workload, higher bandwidth, more IOPS.

Here's a question: what data are you basing this on? If you're going to get that specific, quote your source.

The DirectStorage blog says the following:

"In either case, previous gen games had an asset streaming budget on the order of 50MB/s which even at smaller 64k block sizes (ie. one texture tile)"

However doing some more digging on this, the following appears to be a much better source for a "typical gaming workload", at least during streaming:

https://www.gamersnexus.net/guides/...o-games-load-ssd-4k-random-relevant?showall=1

So it seems across these 5 games the blocks sit mainly in the 4k - 32k range, but it would be possible to work out an average from this data which I assume is similar to what Microsoft (and Nvidia, and Sony) have done for the purpose of these claims.

There is no "typical gaming workload" and there is no "typical average block size." The workload is unique to the game; you've mentioned the reasons so in prior posts. Some of it is bitwise streaming in the background, some of it is bulk-rate loads during an "intermission" of sorts. Some of those data blocks might be mapped texture regions that are 4k block-aligned so they perfectly fit into the physical 4k memory block sizes of both your main memory and GPU memory.

Of course the workload varies by game (as shown above). That's why I keep referring to a typical average. You take a cross section of games, look at their behaviour and calculate the average block size (or more directly IOPS usage at a given bandwidth) at load time across them all. There's your typical average that can be used to talk about a specific CPU requirement for a specific SDD bandwidth. Yes it will vary from game to game but the average will give a good indication of what kind of CPU requirement you'll need at least for that cross section of games. The wider the cross section, the more accurate it will be. I can;t say how accurate Microsofts example is.

One thing I will note is that Nvidia are claiming 2 full cores to cover the IO overhead on a 7GB/s drive here so there's clearly a large difference depending on what cross section of games you base your analysis on. But on the other hand they claim another 22 cores are required for the decompression (presumably a different routine than Microsoft are basing their comparison on, hence the higher overhead). So the bottom line here is still that between IO and decompression, you are going to be very much CPU bottlenecked with a high speed SDD without DirectStorage (if utilising the drives full speed at for example load times).

I have literally provided data, right above your post, where a faster disk resulted in a 50% increase in I/O rate with literally the exact same single-core CPU consumption .

There is no evidence in that post that the CPU consumption stayed the same. The CPU itself stays the same, but here is no suggestion that it's not performing more operations to achieve the higher throughput in that workload. Serious question: is it your argument that with a fixed block size like the CrystalMark benchmark (4k), there are not more IOPS required (i.e. more work from the CPU) to saturate a fast disk vs a slow disk? Because honestly that reads to me like what you're trying to say. Maybe I've just misunderstood you.

Albuquerque · Jul 23, 2021

"between disk IO and compression, you are going to be very much CPU bottlenecked."

As I have stated previously multiple times, I absolutely buy decompression needs can be far more highly optimized. Every data point we have regarding disk I/O says the kernel and the CPUs are both capable of delivering the needed I/O without the massive overhead which keeps being bandied about. Yes, there can be gains for an API where the proper threading and queuing of the disk I/O is "solved" for those games which haven't done deep I/O stack work before.

Having been directly resposible for high performance enterprise storage and the related servers AND the related operating systems for two decades for three Fortune 500 companies with a LOT of compute and storage on their datacenter floors, the marketing around disk I/O is precisely that -- marketing. Now, sustaining that "bandwidth" while decompressing textures needing extra CPU, as specifically linked to the decompression function itself? Yup, I buy that all day.

DegustatoR · Mar 14, 2022

https://devblogs.microsoft.com/directx/directstorage-api-available-on-pc/

iroboto · Mar 14, 2022

Looking forward to speedier experiences. May finally invest into M2

pjbliverpool · Mar 14, 2022

Great news. Quite unexpected too. Can't wait to see this in action. I wonder if we'll hear anything more about RTX-IO now or whether that will just quietly dissappear.

Scott_Arm · Mar 14, 2022

pjbliverpool said:
Great news. Quite unexpected too. Can't wait to see this in action. I wonder if we'll hear anything more about RTX-IO now or whether that will just quietly dissappear.

I'm curious to know if DirectStorage works with Vulkan. Not clear to me that it would.

DegustatoR · Mar 14, 2022

Scott_Arm said:
I'm curious to know if DirectStorage works with Vulkan. Not clear to me that it would.

On Windows? It likely will as it's just an OS level I/O API which may or may not plug into IHV's drivers for GPU decompression.
But tbh my expectations from DS are rather low. I doubt that we'll see much benefits from it in the coming years.

Remij · Mar 14, 2022

GPU decompression isn't even ready yet.

What a let down..

DirectStorage GPU Decompression, RTX IO, Smart Access Storage

pjbliverpool

B3D Scallywag

Albuquerque

Red-headed step child

Albuquerque

Red-headed step child

pjbliverpool

B3D Scallywag

Albuquerque

Red-headed step child

pjbliverpool

B3D Scallywag

PSman1700

Kaotik

Drunk Member

pjbliverpool

B3D Scallywag

Albuquerque

Red-headed step child

Rodéric

a.k.a. Ingenu

CarstenS

Moderator

pjbliverpool

B3D Scallywag

Albuquerque

Red-headed step child

DegustatoR

iroboto

Daft Funk

pjbliverpool

B3D Scallywag

Scott_Arm

DegustatoR

Remij

Similar threads