Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

Discussion in 'PC Hardware, Software and Displays' started by DavidGraham, May 18, 2020.

  1. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    12,568
    Likes Received:
    3,507
    damn thats sexy. I don't need it but damn
     
  2. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,605
    Likes Received:
    2,993
    Location:
    Guess...
    I wonder if DirectStorage and RTXIO are able to scale up that far? 56GB/s is faster than your average RAM throughput!
     
  3. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    12,568
    Likes Received:
    3,507
    will be interesting . I would still just rather see a ssd added to the graphics card itself.

    Imagine this card but directly added to your graphics card. by pass everything else
     
    milk and BRiT like this.
  4. LordVulkan

    Joined:
    Mar 31, 2015
    Messages:
    8
    Likes Received:
    12
    RTX IO detailed in Ampere Whitepaper

    https://www.nvidia.com/content/dam/...pere-GA102-GPU-Architecture-Whitepaper-V1.pdf

    Looks promising.
     
  5. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    904
    Likes Received:
    1,081
    Location:
    55°38′33″ N, 37°28′37″ E
    The AORUS Gen4 AIC card goes for $150 at NewEgg, and a 4TB NVMe disk would part you with ~$800.

    Initial Phison E18 SSDs seem to be limited to 2TB though, and Samsung 980 Pro only goes up to 1TB from the leaked specs.


    56 Gbyte/s is quite possible with dual-channel DDR4-3600 (PC4-28800), and PCIe 4.0 x16 goes up to 32 GByte/s (in each direction).

    But I don't think upcoming DirectStorage games could make use of simultaneous reads/writes on the scale of 30 Gbyte/s, because any GPU will be unable to keep up with decompression at this data rate. Not until the year 2028 - and I would rather spend on a trip to the Los Angeles Summer Olympics than on a $1500 NVMe RAID, a $2000 HEDT platform, and a $2500 Titan video card. :nope:

    It would make no difference if the PCIe Switch was located directly on the add-on card and not in the CPU Root Complex. PCIe is a point-to-point protocol, unlike conventional PCI.
     
    #385 DmitryKo, Sep 16, 2020
    Last edited: Dec 12, 2020
    pjbliverpool and BRiT like this.
  6. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    18,938
    Likes Received:
    21,387
    So, 4 * $800 + $150 = a cool $3350, with tax around $3618. No problem. :lol:
     
  7. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    12,568
    Likes Received:
    3,507
    Doesn't it currently go hdd/ssd to cpu cpu to gpu ? So if you could just do ssd to you'd have an easier time and you'd be able to iterate faster in storage because you wouldn't be reliant on motherboards and cpus supporting it.
     
  8. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    904
    Likes Received:
    1,081
    Location:
    55°38′33″ N, 37°28′37″ E
    PCIe devices use physical point-to-point links - it does not matter if transfers go through the built-in PCIe Switch in the CPU's PCIe Root Complex, or through a dedicated PCIe Switch chip on the add-on board or a multi-function ASIC. Either way the links are physically swtiched to connect different endpoints.

    Only if you connect the NVMe disks to the GPU memory controller with dedicated PCIe x4 links - however for the SSD to be visible to the host CPU and accessible by the OS disk/file management, such dedicated GPU links would still have to go through a PCIe Switch.

    So if you can simply use the SSD connected to the CPU Root Port to the same effect, then why bother with dedicated GPU links and either proprietary driver code or another radical redesign of the Windows Display Driver Model?

    Dedicated 16-lane 2-port PCIe Switch chip costs an extra, and Gen4 switches are not available on the market yet - whereas a built-in PCIe Switch is included with your Zen* processor for free.
     
    #388 DmitryKo, Sep 17, 2020
    Last edited: Sep 17, 2020
    BRiT likes this.
  9. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,605
    Likes Received:
    2,993
    Location:
    Guess...
    Agreed with your whole post apart from this. Nvidia are claiming the performance cost is trivial at 7GB/s. So it doesn't sound like 4x that thoughput would be out of reach. Pointless, probably. But out of reach?
     
    PSman1700, tinokun and BRiT like this.
  10. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    904
    Likes Received:
    1,081
    Location:
    55°38′33″ N, 37°28′37″ E
    The GA102 whitepaper above talks about memory bandwidth, not decompression performance (and uses the same slide where the SSD goes through the NIC and decompressed data have double the bandwidth).

    Suppose RTX IO / DirectStorage could sustain 7 GByte/s reads from the SSD at all times, and their lossless algorithm has an average 2:1 compression rate (50%) - it does not really follow that decompression takes zero time with no significant delay, or that assigned GPU cores can always keep up with any dataset to double the bandwidth at any input rate.
     
  11. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,605
    Likes Received:
    2,993
    Location:
    Guess...
    There have been various quotes from Nvidia about how small the performance impact is. Here's one but I've seen at least a couple of others along the same lines:

    https://www.back2gaming.com/guides/nvidia-rtx-io-in-detail/

    "When asked about the performance hit of RTX IO on the GPU itself, an NVIDIA representative responded that RTX IO utilizes only a tiny fraction of the GPU, “probably not measurable”."
     
    PSman1700, pharma and tinokun like this.
  12. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    904
    Likes Received:
    1,081
    Location:
    55°38′33″ N, 37°28′37″ E
    7 GByte/s is indeed a fraction of the total video memory bandwidth, no matter 200 GByte/s in a mid-range card or 1 TByte/s in a high-end card. However compression overhead is another thing - traditional lossless algorihthms are based on dictionary coding, which is not easily parallelized even with large blocks. LZ78/LZX/LZW-based algorithms surely can't sustain ~30 GByte/s output even on top-tier HEDT CPUs with 32 or more threads, and GPU implementations have not shown any significant speed-up comparing to CPUs.
     
    #392 DmitryKo, Sep 18, 2020
    Last edited: Sep 18, 2020
  13. LordVulkan

    Joined:
    Mar 31, 2015
    Messages:
    8
    Likes Received:
    12
    Did you just ignore the whitepaper?

     
    PSman1700 likes this.
  14. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    904
    Likes Received:
    1,081
    Location:
    55°38′33″ N, 37°28′37″ E
    Did I?
     
  15. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    16,826
    Likes Received:
    4,129
    Let uncle Davros try and straighten everything out
    You say "nvidia cant keep up with the demands of decompression"
    nvidia say "yes we can"
     
  16. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    4,692
    Likes Received:
    2,131
    Compression on the GPU is probably both faster, efficient and more flexible then whats in the consoles.
     
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,819
    Likes Received:
    3,976
    Location:
    Finland
    If it was all that on GPU they wouldn't bothered creating custom hardware blocks to do it and would have just invested in beefier GPU. And you probably mean decompression.
     
    DmitryKo likes this.
  18. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,605
    Likes Received:
    2,993
    Location:
    Guess...
    More flexible is a given. Faster is already confirmed, at least in relation to XSX, but very likely in relation to PS5 as well.

    Efficiency almost certainly goes to the consoles though in terms of silicon budget and power draw given that fixed function hardware almost always beats similarly performing general purpose hardware in that regard.

    And also the advantage of not having your limited GPU resources pulling double duty.
     
    PSman1700 likes this.
  19. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    18,938
    Likes Received:
    21,387
    GPU decompression won't take much to exceed that of even the PS5, maybe 2TF or so. I posted it before, but on an early version 1.0 of GPU-based decompression they were getting 60-120 GB/s on a PS5. There's still possibilities of improvements, but even at first pass that's 6-12 GB/s per GPU TF used.

    Here's a repost of the info from the Nvidia Ampere thread.

    ----------
    Radgames comes up quite a few times in the Console Tech section, here's some posts referring to some nice posts about it. Be sure to read the full twitter threads about it.

    Mostly that you could get 60-120 GB/s of textures decompressed if you used the entire PS5 GPU (10.28 TF). The Ampere has near that much TF to spare over and above the PS5.

    Naturally, you wouldn't need to use that much, but it gives you an idea on how powerful the GPUs are when it comes to decompression.

    https://forum.beyond3d.com/posts/2134570/

    https://forum.beyond3d.com/posts/2151140/
    https://forum.beyond3d.com/posts/2134405/


    External references --

    http://www.radgametools.com/oodlecompressors.htm
    http://www.radgametools.com/oodletexture.htm
    https://cbloomrants.blogspot.com/






    GPU benchmark info thread unrolled: https://threadreaderapp.com/thread/1274120303249985536

    A few people have asked the last few days, and I hadn't benchmarked it before, so FWIW: BC7Prep GPU decode on PS5 (the only platform it currently ships on) is around 60-120GB/s throughput for large enough jobs (preferably, you want to decode >=256k at a time).

    That's 60-120GB/s output BC7 data written; you also pay ~the same in read BW. MANY caveats here, primarily that peak decode BW if you get the entire GPU to do it is kind of the opposite of the way this is intended to be used.

    These are quite lightweight async compute jobs meant to be running in the background. Also the shaders are very much not final, this is the initial version, there's already several improvements in the pipe. (...so many TODO items, so little time...)

    Also, the GPU is not busy the entire time. There are several sync points in between so utilization isn't awesome (but hey that's why it's async compute - do it along other stuff). This is all likely to improve in the future; we're still at v1.0 of everything. :)
     
  20. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    904
    Likes Received:
    1,081
    Location:
    55°38′33″ N, 37°28′37″ E
    No. Nvidia did not specify the algorithm or the maximum bandwidth, but if they're using LZ-family block compression, CUDA-based libraries are typically reaching a few GByte/s according to academic papers, so 28 GByte/s would be too high even when you consume the entire GPU, and not just 'a fraction of GPU'.

    The entire idea of DirectStorage is to free the CPU from loading and decompression tasks by streaming the data directly to video memory as fast as possible and using dedicated hardware chip (on the Xbox) or compute units (on the PC) - which means their block compression algorithm has to be designed for simplicity and low decompression overhead, not for best possible compression efficiency or processing bandwidth. Not sure why it is so hard to understand.


    Interesting, but BC7Prep is not a general-purpose compression algorithm - it's a more efficient version of BC7 texture compression from 2009, which has to be decoded to the baseline BC7 format before the actual TMUs can consume it.
     
    #400 DmitryKo, Sep 19, 2020
    Last edited: Sep 21, 2020
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...