This became the foundation for the Xbox Velocity Architecture, which comprises our custom-designed NVME SSD, a custom dedicated hardware decompression block, our new DirectStorage API which provides developers with direct low-level access to the NVME controller
the bearded guy in Xbox videos.
https://www.windowscentral.com/xbox-series-x-what-do-game-devs-think
'Direct access to the NVMe controller' is sure an interesting point. On Windows, this could be implemented with
a new NVMe storage port driver that's designed around NVME command interface and
controller hints for optimal I/O block size - instead of
StorPort port driver/StorNVMe miniport driver model which is based on a
generalization of SCSI command set.
I still think the DirectStorage API would be a user-mode layer designed to issue large file I/O requests with deeper queues, which should be far more efficient on NVMe storage. This will still be based on Windows I/O Manager driver stack and virtual Memory Manager and file Cache Manager, as well as existing Installable File System drivers and filters.
This way they can tweak the I/O subsystem to reliably support large block sizes and use new or updated internal structures to reflect NVMe control flow, while also
remaining compatibile with the StorPort driver model for legacy SATA devices.
They could also intrercept ReadFile/WriteFile requests from legacy applications and rearrange them to use similar deep-queue and large-block transactions when the new storage drivers are installed.
I'm just going to ask the same question again: What is your explanation for why huge leaps on SSD NAND performance, controller improvements, PCI4.0 improvements fail to material as meaningful increases in actual performance?
It's because applications are not designed to efficiently utilize this enormous bandwidth. Did you really expect to get a different answer for the same question?
Decompression is one of those things that in reality can be stupidly parallel but is implemented by people who don't know and then it becomes single threaded load that takes forever.
It's not just decompression or other processing overhead, it's also overall program flow and the data set.
Imagine you have a 1970s era computer system with a tape archive application that reads 80 character lines from punch cards and writes them to text files on the magnetic tape, and a TTY application that sends text files over a 300 bit/s modem line.
If you port these applications and OS interfaces to a modern computer with SATA disks and Gigabit Ethernet and run them on the same set of text data from 1970s - do you really expect to max out on network and disk bandwidth?
Processing would only take a fraction of a second on modern hardware, so your theoretical bandwidth is hundreds of megabytes per second. Unfortunately you only have several hundred kilobytes of text to transfer and then your program stops - so your your real-life bandwidth is even less than a megabyte per second.
That's the difference between maxing at 3 GBytes/s in synthetic disk benchmarks and averaging to 30 MBytes/s in real-world applications.
if there is possibility for some hw improvements to happen to make decompression/io run without consuming cpu
It should be possible to plug hardware processing
into a filesystem minifilter driver.