Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Sorry



I remember the days of having to upgrade constantly on the pc side. Not only that but there was much more you had to upgrade. CPU/MOBO/Ram of course that's as it always is. There were co processors at one point which gave way to graphics cards and sound cards and then optical storage and heck you'd have to upgrade your modem ! Damn I remember paying big bucks for a hardware 56k modem. Remember that ? There were some 56k modems that off loaded alot of work to the cpu so you'd often get worse performance but the more expensive modems would use less of the cpu and you'd get better connections. I remember having to buy a true cd rom because it used 2 lasers and got much faster read speeds. The good old expensive days.

Ram is cheap enough that most people running an 16 gig system can simply buy another 16 gigs for a $100-$200 bucks. I don't think if you have 32gigs of ram with a 16 gig graphics card that you really even need a 8GB/s nvme drive

that's also baring there being a different solution on pcs

Game has to get the data into ram before ram can be used for caching. That inevitably leads to either load time, lesser detail assets or popup when asset is accessed first time. It can also lead to needing more time for developer/artists to create game to implement more/smarter caching/minimize load time/minimize popup. Game sizes will grow and something like 32GB is not enough to hold full game like gta next in ram. i.e. streaming is needed,... I for one would like to get away from the elevator rides and whatnots used to mask load times(mass effect).

Having fast ssd and utilizing it is a Good Thing no matter what platform is being used. Platform exclusives will likely drive ssd hard, cross platform will take time to catch up due to lowest common denominator. DirectStorage will help pc catch up. If anything faster ssd is even more important on highend pc which can use even higher end assets than console. Hopefully bottle neck will be install size and there hw decompression will help some. Hopefully unreal5 like technology will go long ways to solve the engine/art creation side to be more economical allowing the bottle neck to transfer into install size/streaming speed.

Hopefully we will never again see atrocity of gtav load time or mass effect elevator. At worst ssd's would lead to multithreaded IO and faster load time. Don't use single cpu core crunch data for minutes like gtav loading does.

Edit. Latest game that drove me nuts with loading times was half life alyx. God dang it's distracting to hit load screen in vr. Also some textures in there were not so high quality, vr would really benefit from something like unreal5. i.e. allow hopping around, seeing large scale and then stick your face into an amazingly detailed objects. All while working without load times using streaming,...
 
Last edited:
I for one would like to get away from the elevator rides and whatnots used to mask load times(mass effect).
Not only loading textures and such but many of these transitions are used for dialog with characters which will have to dealt with somehow in multiplatform titles.
 
There were co processors at one point which gave way to graphics cards and sound cards and then optical storage and heck you'd have to upgrade your modem ! Damn I remember paying big bucks for a hardware 56k modem. Remember that ?
The first HD I bought was 400 MB for 400 bucks so yeah. Could never get the ATI Mach32 card working on Borland Turbo C though no matter how many drivers I downloaded 9600 bits per second :rolleyes:
Of course without the next gen consoles doing what they are doing would we see 32 GB minimum requirements, 16+core or NVME drives anytime in the next 5 years I wonder.
 
Game has to get the data into ram before ram can be used for caching. That inevitably leads to either load time, lesser detail assets or popup when asset is accessed first time. It can also lead to needing more time for developer/artists to create game to implement more/smarter caching/minimize load time/minimize popup. Game sizes will grow and something like 32GB is not enough to hold full game like gta next in ram. i.e. streaming is needed,... I for one would like to get away from the elevator rides and whatnots used to mask load times(mass effect).

Having fast ssd and utilizing it is a Good Thing no matter what platform is being used. Platform exclusives will likely drive ssd hard, cross platform will take time to catch up due to lowest common denominator. DirectStorage will help pc catch up. If anything faster ssd is even more important on highend pc which can use even higher end assets than console. Hopefully bottle neck will be install size and there hw decompression will help some. Hopefully unreal5 like technology will go long ways to solve the engine/art creation side to be more economical allowing the bottle neck to transfer into install size/streaming speed.

Hopefully we will never again see atrocity of gtav load time or mass effect elevator. At worst ssd's would lead to multithreaded IO and faster load time. Don't use single cpu core crunch data for minutes like gtav loading does.

Edit. Latest game that drove me nuts with loading times was half life alyx. God dang it's distracting to hit load screen in vr. Also some textures in there were not so high quality, vr would really benefit from something like unreal5. i.e. allow hopping around, seeing large scale and then stick your face into an amazingly detailed objects. All while working without load times using streaming,...

you'll have fast ssds and loads of system ram and loads of graphics ram. Beauty of the pc
 
I/O request packets (IRPs) are not copied or moved - they are passed down the driver stack by reference, i.e. by a pointer in IoCallDriver() . I/O stack parameters are indeed copied - the I/O stack is how each driver keeps track of its own actions on that particular IRP - but these only take 36 bytes.
But you realise these happen thousands and thousands of times, right?

I feel as thought you're dismissing the I/O issue. Most people recall when their switch from a HDD to SSD because it was a revolutionary change in perforamnce but moving from one SSD to another that 4x to 5x times nets you a marginal increase in performance. What is your explanation for why why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance?
 
But you realise these happen thousands and thousands of times, right?
What is your explanation for why why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance?
It's not 'thousands', most file I/O requests are cached through the filesystem driver fast I/O path. It's just real-world applications and services that request and process the data in small chunks, even those that perform bulk data copying.


You can see it for yourself with Sysinternals Process Monitor
https://docs.microsoft.com/en-us/sysinternals/downloads/procmon

It can track all ReadFile/ WriteFile/CreateFileMapping events and filter the results by process and file path; it can even track I/O Request Packets and Fast I/O requests.
You need to set the Show Filesystem Activity button, then check File - Capture Events and capture a couple of dozen seconds, then right click on the event to filter similar results.

Start with some benchmarks, like CrystalDiskMark and ATTO, then run some of your favourite apps. The pattern did not change from what it was 20 years ago, when I first ran this tool on a 300 Mhz Pentium II with 128 MBytes of PC-133 SDRAM rated at 1 GBytes/s, and a 5 GByte UDMA33 HDD rated at 10 MBytes/s.
 
Most people recall when their switch from a HDD to SSD because it was a revolutionary change in perforamnce but moving from one SSD to another that 4x to 5x times nets you a marginal increase in performance. What is your explanation for why why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance?

If you're talking about games, that's because games aren't optimized yet. In SC and doom, going from sata to nvme does improve quite some. Just like PS4 games running on PS5 won't see a meaningfull increase in streaming performance either.
Also, a pc with nvme running windows feels more snappy then one with an aging sata drive.
 
Start with some benchmarks, like CrystalDiskMark and ATTO, then run some of your favourite apps. The pattern did not change from what it was 20 years ago, when I first ran this tool on a 300 Mhz Pentium II with 128 MBytes of PC-133 SDRAM rated at 1 GBytes/s, and a 5 GByte UDMA33 HDD rated at 10 MBytes/s.

I'm just going to ask the same question again: What is your explanation for why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance? How do you explain the disparity of I/O throughput using the same hardware running Windows then Linux? Did you watch the LTT video @Davros LinkedIn this post, and LTT's apology to Tim Sweeny?

There is an issue with Windows I/O, perhaps it's not quite as bad in the kernel stack as I perceived it but fast driver I/O isn't always suitable, likewise you can go faster with unbuffered data but that's also not always a desirable choice. Benchmarks are only as good as their ability ro mirror realworld scenarios, not just reading data off an SSD but unpacking and decompressing it, putting some in RAM, sending some to the GPU. Once all parts of the system start working with data, the Windows I/O stack becomes a lot busier than when just reading data.
 
Just following along on the convo; I can’t really add anything. But @DSoup you asked what directstorage was an didn’t get an answer:

This became the foundation for the Xbox Velocity Architecture, which comprises our custom-designed NVME SSD, a custom dedicated hardware decompression block, our new DirectStorage API which provides developers with direct low-level access to the NVME controller,
the bearded guy in Xbox videos.
https://www.windowscentral.com/xbox-series-x-what-do-game-devs-think
 
I'm just going to ask the same question again: What is your explanation for why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance? How do you explain the disparity of I/O throughput using the same hardware running Windows then Linux? Did you watch the LTT video @Davros LinkedIn this post, and LTT's apology to Tim Sweeny?

There is an issue with Windows I/O, perhaps it's not quite as bad in the kernel stack as I perceived it but fast driver I/O isn't always suitable, likewise you can go faster with unbuffered data but that's also not always a desirable choice. Benchmarks are only as good as their ability ro mirror realworld scenarios, not just reading data off an SSD but unpacking and decompressing it, putting some in RAM, sending some to the GPU. Once all parts of the system start working with data, the Windows I/O stack becomes a lot busier than when just reading data.

There probably is inefficiency also outside the microsoft kernel/driver/io-stack domain. Even simple thing like decompression often is implemented stupidly using single core and taking forever versus using something that would parallelize to all cores. Decompression is one of those things that in reality can be stupidly parallel but is implemented by people who don't know and then it becomes single threaded load that takes forever.

-------------------------------------

I for one will wait for DirectStorage to appear before upgrading my old pc. Very curious to see if DirectStorage is pure sw api or if there is possibility for some hw improvements to happen to make decompression/io run without consuming cpu. Both sony and microsoft mentioned how many zen cores decompression can consume and those are real numbers. I would rather have hw solution do that than having to buy 16 core cpu for game to not hitch due to decompression. Another reason why I like to wait for my pc upgrade is that I would rather update it once the content is there that hw can use. For now my age old lowest common denominator pc is just fine.

The beauty of pc is awesome hw. The ugliness of pc is lowest common denominator. Consoles is great at pushing the bar higher and providing giant platform pushing games further. We have had ssd's in pc for such a long time and barely any games use ssd well. That's about to change in a big way including whole new API to take advantage of ssd.
 
The beauty of pc is awesome hw. The ugliness of pc is lowest common denominator. Consoles is great at pushing the bar higher and providing giant platform pushing games further. We have had ssd's in pc for such a long time and barely any games use ssd well. That's about to change in a big way including whole new API to take advantage of ssd.

Exactly, the disadvantage for consoles also is you're basically stuck for over 7 years to that same hardware. You can'y have both on one platform. Like ray tracing, pc got that 2/3 years before the consoles, same for higher settings, faster loading. one thing that has improved quite alot is scaling, high end games run on switch all the way to 2080ti level hardware and everything in between. Yet the game didn't suffer that much, it is up there with the best looking games this gen (doom, wolfenstein etc).
With consoles offering the cross gen games and the same arch this gen, scaling becomes all the more important.
 
I'm just going to ask the same question again: What is your explanation for why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance? How do you explain the disparity of I/O throughput using the same hardware running Windows then Linux? Did you watch the LTT video @Davros LinkedIn this post, and LTT's apology to Tim Sweeny?

There is an issue with Windows I/O, perhaps it's not quite as bad in the kernel stack as I perceived it but fast driver I/O isn't always suitable, likewise you can go faster with unbuffered data but that's also not always a desirable choice. Benchmarks are only as good as their ability ro mirror realworld scenarios, not just reading data off an SSD but unpacking and decompressing it, putting some in RAM, sending some to the GPU. Once all parts of the system start working with data, the Windows I/O stack becomes a lot busier than when just reading data.
The actual video WRT the server issue, quite fascinating. I don't think applicable to what's happening the console level.

 
The actual video WRT the server issue, quite fascinating. I don't think applicable to what's happening the console level.

Indeed, this is about the Windows issue. There is a general theme in LTT videos when referring to drive or I/O performance you get repeated comments like "and that's on Windows!" because their experience tells them Windows I/O is anywhere between 20-25% of what the hardware delivers on other *IX operating systems but particularly their preferred linux build. The crazy 29Gb/sec drive that test in the video @Davos posted peaks, about once, just above 5Gb/sec in Windows. In linux, a solid 29Gb/sec.

I will check out the video though! Their as obsessive about I/O as I am so maybe a new YT sub for me! :yes:
 
Indeed, this is about the Windows issue. There is a general theme in LTT videos when referring to drive or I/O performance you get repeated comments like "and that's on Windows!" because their experience tells them Windows I/O is anywhere between 20-25% of what the hardware delivers on other *IX operating systems but particularly their preferred linux build. The crazy 29Gb/sec drive that test in the video @Davos posted peaks, about once, just above 5Gb/sec in Windows. In linux, a solid 29Gb/sec.

I will check out the video though! Their as obsessive about I/O as I am so maybe a new YT sub for me! :yes:
Yea, well, the same issue shows up on Linux as well. Seems to be, a CPU issue lol, compression didn't help. I think a lot of things would have to be redone for his server build to work. They should have used something like Kraken ;) and have a smaller CPU footprint with that decompression. Memory was blowing up, etc.

'm not familiar with the one they used, so perhaps I'm out to lunch here. Lots of little bottlenecks he was running into. I think a whole solution would need to be architect-ed for this specific use case.
 
I don't know what it is about LTT, but the videos all sound too whiney, and I've seen them on other topics years ago and this current one. Maybe I just find his voice too irritating to be able to listen to the message.
 
Yea, well, the same issue shows up on Linux as well. Seems to be, a CPU issue lol, compression didn't help. I think a lot of things would have to be redone for his server build to work.
So yes, the drives were outperforming the CPU's ability to respond to CPU-driven I/O management. PS5 (and presumably XSX) sidestep this but having the I/O decoupled from the CPU which is a double win, the CPU is not only freed from slowing IO, crazy fast I/O no longer consumes significant CPU time. I guess this is why Mark Cerny said a few times that the design "saves X amount of Zen cores" for certain PS5 hardware functions. But CPU-driven I/O is how Windows does things. And the more you load on, the more load you're putting on the CPU and less CPU time you have left.
 
Last edited by a moderator:
So yes, the drives were outperforming the CPU's ability to respond to CPU-driven I/O management. PS5 (and presumably XSX) sidestep this but having the I/O decoupled from the CPU which is a double win, the CPU is not only freed from slower IO, crazy fast I/O no longer consumes significant CPU time. I guess this is why Mark Cerny said a few times that the design "saves X amount of Zen cores" for certain PS5 hardware functions. But CPU-driven I/O is how Windows does things. And the more you load on, the more load you're putting on the CPU and less CPU time you have left.
He mentioned it was fine if all the CPU was doing was waiting around for I/O requests. The issue is that when the CPU works on everything, it will switch away from the nvme thread, and it wouldn't be able to come back in time before the nvme thread timed out. So if you're working on compression on the CPU, it will leave the nvme thread to work on compression and not make it back in time. My understanding.

Sounds like a signalling issue that requires resolution.
 
I don't know what it is about LTT, but the videos all sound too whiney, and I've seen them on other topics years ago and this current one.

It's definitely a first world problem when your 29Gb/sec SSD only delivers 5Gb in one OS. :runaway:

He mentioned it was fine if all the CPU was doing was waiting around for I/O requests. The issue is that when the CPU works on everything, it will switch away from the nvme thread, and it wouldn't be able to come back in time before the nvme thread timed out. So if you're working on compression on the CPU, it will leave the nvme thread to work on compression and not make it back in time. My understanding.
It depends what the CPU is doing, and what other I/O the CPU is managing.

Moving away from the server issue, because that's a fairly niche issue, and returning to the video @Davros posted, which is a scenario more akin to what we're talking about here - realising SSD/controller read-performance into realworld PC performance, we have a 16 channel four-SSD PCI4.0 card that can sustain 29Gb/sec on linux but which couldn't sustain 5Gb/sec in Windows. That's a bottleneck. It's probably not a single issue but a combination of related issues, that's hampering I/O. The million dollar question is, can Microsoft remedy this without breaking everything? Linux is cool if you want to tune the I/O scheduling subsystem, or just change it altogether (interrupt to realtime polling). Windows has a legacy of umm.. well, software legacy. :yep2:
 
Back
Top