Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Discussion in 'PC Hardware, Software and Displays' started by DavidGraham, May 18, 2020.

  1. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    967
    Likes Received:
    1,223
    Location:
    55°38′33″ N, 37°28′37″ E
    DirectStorage for Windows is going to be introduced at Game Stack Live (April 20-21, 2021).
    https://developer.microsoft.com/en-us/games/events/game-stack-live/

     
    Kej, Silent_Buddha, manux and 5 others like this.
  2. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,236
    Likes Received:
    4,259
    Location:
    Guess...
    Awesome, I was just thinking about this last night wondering if it was going to turn out to be one of those much talked about techs that never see the light of day. This is definitely a date for my diary.
     
    PSman1700 and Remij like this.
  3. Remij

    Regular

    Joined:
    May 3, 2008
    Messages:
    678
    Likes Received:
    1,258
    Yes! I can't wait to find out more about this. Hopefully we get details on the requirements and how DirectStorage incorporates RTX I/O and AMD's support.
     
    PSman1700 and pjbliverpool like this.
  4. Dampf

    Regular

    Joined:
    Nov 21, 2020
    Messages:
    284
    Likes Received:
    474
    Well a Microsoft engineer from the dev discord said DirectStorage works in conjunction with Sampler Feedback Streaming.

    So chances are you just need a DX12U GPU and a regular NVMe SSD. But I too wonder if it's really that easy.. April can't come soon enough
     
    Remij likes this.
  5. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,679
    I'm wondering if it's going to require resizable BAR support, or something like that.
     
  6. manux

    Veteran

    Joined:
    Sep 7, 2002
    Messages:
    3,034
    Likes Received:
    2,276
    Location:
    Self Imposed Exhile
    Microsoft probably provides API. From API pov it's likely easy but will require rewriting engine to support streaming optimally. Behind the API is layer of hw that could require specific implementation both in driver and hw to be optimal. Works isn't necessarily same as optimal.

    Worst case is engines like gtav which uses single thread for very long loading period. It wouldn't magically become faster as there is some insane cpu bottle neck that would need to be worked around. Similarly other games/engines could do something in cpu to process loaded data that would require to be changed for optimal performance.
     
  7. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    967
    Likes Received:
    1,223
    Location:
    55°38′33″ N, 37°28′37″ E
    They both make two important parts of what they call the Xbox Velocity Architecture (which also includes hardware LZ-family decompression).
    On Xbox Series X (and DirectX 12 Ultimate GPUs), Sampler Feedback augments Tiled Resources to help determine which missing tiles and MIP levels are to be streamed into video memory, while DirectStorage would help improve loading times on NVMe disks.

    https://devblogs.microsoft.com/dire...edback-some-useful-once-hidden-data-unlocked/

    https://news.xbox.com/en-us/2020/07/14/a-closer-look-at-xbox-velocity-architecture/

    https://devblogs.microsoft.com/directx/directx-12-ultimate-for-holiday-2020/
     
    #487 DmitryKo, Feb 26, 2021
    Last edited: Feb 26, 2021
    iroboto, Kej, Jawed and 2 others like this.
  8. Dampf

    Regular

    Joined:
    Nov 21, 2020
    Messages:
    284
    Likes Received:
    474
    Well hopefully not. That would exclude too many users from DirectStorage. And frankly, I don't see a reason why it could be depenend on rBAR.
     
    PSman1700, Rootax and pjbliverpool like this.
  9. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    967
    Likes Received:
    1,223
    Location:
    55°38′33″ N, 37°28′37″ E
    I looked through the pre-recorded video and the session slides, and it seems like the design is still in a preliminary stage, though this is expected given the ambitious goals.


    First of all, DirectStorage is indeed a user-mode layer on top of existing I/O stack, which is primarily designed to issue multiple parallel I/O requests. To make it work, they are redesigning the actual Windows I/O Manager subsystem described above to make it handle batch processing of I/O packets - so for example you would schedule loading (paging) of a thousand new MIP textures (as hinted by Tiled Resource/Sampler Feedback shaders) and only track the status of the entire I/O batch, not each individual I/O packet.

    It wasn't made clear whether this involves a redesign of the StorPort/StorNVMe driver or the use of block size / alignment / granularity hints from NVMe 1.3.

    He did mention they can bypass the filesystem driver and volume manager, but only in passing. I guess this would introduce another 'fast path' which could involve continuous pre-allocation of clusters/sectors on file write operations, similar to what CompactOS file compression is doing, to enable reading back with large continuous I/O batches and without the need to track complex LBA sector chains in the filesystem driver and Cache Manager.


    Second, the data from NVMe drive is initially loaded into system memory, then moved to GPU video memory (though without any CPU processing). So it's not using peer-to-peer DMA this time, and no specific requirements for NVMe / PCIe drive were raised.

    Texture compression is assumed to be a DEFLATE (i.e. LZ77/LZSS+Hoffman, the ZIP format) pass over standard BCn (i.e. S3TC/DXTC). DEFLATE decompression is performed in the GPU memory by either compute shaders or dedicated hardware blocks. They are still working on the compressor / decompressor toolset so I guess this will be different/improved from Xbox BCPACK toolset. They said it can keep up with typical NVMe bandwith, which didn't sound like multi GB/s performance to me.

    When specifically asked about NVidia RTX I/O slides (where the data is flown directly from the NVMe drive to onboard GPU memory) in the chat, the presenter said that peer-to-peer transfers could be implemented in future releases, but referred to Nvidia for RTX I/O details.


    Developer preview is expected this Summer, so the tentative release date would be end of 2021.
     
    #489 DmitryKo, Apr 21, 2021
    Last edited: Apr 26, 2021
    Kej, Silent_Buddha, Newguy and 5 others like this.
  10. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,236
    Likes Received:
    4,259
    Location:
    Guess...
    So the most important confirmation from this is that the GPU based decompression is a standard feature of DirectStorage which essentially solves the CPU decompression bottleneck and will be available on all DX12U class GPU's (at least). Hopefully this puts to rest any lingering arguments that GPU based decompression is just an Nvidia marketing ploy and not viable without hardware based units.

    Interesting also that there is no P2P transfer as part of the standard but not that surprising. So it seems as though that may well be a unique feature of RTX-IO over and above the standard DirectStorage. Although I'm not sure how much benefit that brings given the decompression element is already covered by DirectStorage.
     
    PSman1700 likes this.
  11. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,236
    Likes Received:
    4,259
    Location:
    Guess...
    Just wanted to add now that I've had a chance to watch the full video, he specifically says (23:30) that their GPU based decompressor easily saturates gaming NVMe SSD bandwidths so I'd take that to mean it can handle the 7GB/s rate of current SSD's without too much trouble. That also aligns with Nvidia's claims for RTX-IO. He uses the specific example earlier in the video of this being used on a 2.5GB/s drive (for obvious reasons) at full rate.

    I found the statements about eventually moving this decompression into dedicated hardware (on PC) particularly interesting though. Presumably this is on the roadmap for GPU vendors as a dedicated hardware unit in future GPU's.

    Also to note and closely linked to the above is that this is using a new and specific compression/decompression solution and not just using the GPU to decompress existing compression formats. That means 1. developers will have to use this specific compression tech for their games similar to how PS5 devs use Kraken and XSX devs use BCPACK and 2. that creating a dedicated hardware block for it in the future is more straight forward as we can rely on all DirectStorage based games to be using the same compression format.
     
    PSman1700 and BRiT like this.
  12. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    random thought:
    is it possible to use hardware decoder/encoders that they would use for video codecs to serve a similar purpose?
    ie
    nvidia RTX I/O is routing it from system memory directly to the hardware decode/encode (manipulated by) nvdecode and CUDA (nvencode in the reverse direction if necessary) then dumping directly into vram for this process? I suspect those hardware accelerators can handle a great deal of codecs, and I wonder if it's just firmware to support this.
     
  13. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    967
    Likes Received:
    1,223
    Location:
    55°38′33″ N, 37°28′37″ E
    Since he talks about sustaining 2.5 Gbyte/s earlier (at 13:40-15:15), it's safe to assume it's the same speed he talks about at a later point (23:30).

    Then again, there were no specifics about algorithm used and compression ratio achieved, and lossless dictionary-based compression is known to resist attempts at parallel processing implementations, as discussed above. So it made a lot of sense to me when he said that Microsoft works on a 'GPU-friendly' compressor/decompressor, and called it 'a new class of compression tech' (at 22:10-23:30). I'd assume this new toolset would require some fine-tuning before it could saturate PCIe 4.0/5.0 bandwidth figures.
     
    #493 DmitryKo, Apr 21, 2021
    Last edited: Apr 21, 2021
    BRiT likes this.
  14. Rikimaru

    Veteran

    Joined:
    Mar 18, 2015
    Messages:
    1,060
    Likes Received:
    426
    The question now how much of GPU resources 2.5 Gbps data decompression takes.
     
  15. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,236
    Likes Received:
    4,259
    Location:
    Guess...
    I'm not sure I'd draw that conclusion. His specific wording is that "the GPU can support a very consistent, constant maxed IO rate. So let's say you have drive capable of say 2.5GB/s, your GPU is capable of maintaining that".

    I'd assume he's just using the 2.5GB/s example because it's the speed of the XSX SSD which most of this work to date has been based around.

    He later goes on to say "There's an initial prototype [of the GPU decompression algorithm] that we have, it easily saturates gaming SSD bandwidths"

    That to me isn't setting a fairly low limit of 2.5GB/s but rather saying that there are no SSD's out there that the algorithm can't keep up with. And that would align perfectly with Nvidias own statement of:

    "GeForce RTX GPUs [I take this to mean anything from an RTX 2060 upwards] are capable of decompression performance beyond the limits of even Gen4 SSDs, offloading dozens of CPU cores’ worth of work to deliver maximum overall system performance for next generation games."

    Here's the video in case anyone wants to view it:



    If you're transferring at 2.5GB/s it probably doesn't matter as you're likely at a load screen rather than background streaming which likely wouldn't be transferring at anywhere near that speed (or else you'd have streamed the entire game content of most games in about 20 seconds). Background streaming is likely to be just a small fraction of that transfer rate.
     
    pharma likes this.
  16. Remij

    Regular

    Joined:
    May 3, 2008
    Messages:
    678
    Likes Received:
    1,258
    Nvidia stated that with RTX I/O the performance hit was "negligible" and that was assuming a full Gen 4 7GB/s saturation.. likely due to what pbjliverpool said directly above me. During load screens is when the GPU would actually saturate the bus at full Gen3 and Gen4 speeds. The streaming requirements during gameplay would likely be far less. Even still, I don't think the performance impact is going to be very large. Games will be designed with it in mind, and will perform as well as they can.

    Think about it this way... the performance hit to the GPU is guaranteed to be far less than it would be to the CPU :)

    This will easily hold us over until dedicated decompression blocks can be added in hardware to the GPU in the future.
     
    PSman1700 likes this.
  17. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,401
    Likes Received:
    1,845
    Location:
    France
    For the existing situation, one of the diagram is showing that the data is doing nvme drive =>ram=>cpu=>ram=>gpu.

    I thought that data could do cpu=>gpu directly now ?
     
  18. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    Why not nvme => Gpu ?
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,018
    Likes Received:
    582
    Location:
    Taiwan
    Of course, it'll be nice if it's possible to bypass main memory. But I guess there are just too many compatibility hurdles, and the performance gain is probably not big enough to worth it.
    Even resizable BAR, which is in the standard for quite some time and is supposed to be a relatively simple feature, is still treated quite cautiously by the vendors.
     
  20. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Isn't this exactly what AMD did with Radeon SSGs? To my understanding the GPUs communicated directly with the pair of SSDs via PCIe bridge chip without the roundtrip through system memory.
    Whether it would be doable as "universal solution" which would work with every vendor is of course another matter.
     
    #500 Kaotik, Apr 25, 2021
    Last edited: Apr 25, 2021
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...