Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

Discussion in 'Console Technology' started by Shortbread, Sep 18, 2020.

  1. t0mb3rt

    Newcomer

    Joined:
    Jun 8, 2020
    Messages:
    59
    Likes Received:
    121
    Yes, but I don't think the decompression side of things is an issue. We already have Oodle saying their compression formats work really well on existing GPUs and it would probably cost nVidia or AMD pennies to add decompression ASICs to their cards somehow.
     
  2. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,089
    I think using existing hardware available on RTX gpus was NV's hardware solution, its both faster, more flexible and wider compatible range. I assume AMD GPUs will have the same support someway. Its in NV, MS, and AMD's intrest to address this universally since they all have huge stakes in this market.
     
    t0mb3rt likes this.
  3. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,324
    Hindsight is 20/20. But during the Nvidia 2000/3000 design phase HDDs were probably the dominant drives in terms of market share in the gaming space and DirectStorage wasn't anywhere in sight. There was no point in offering decompression ASICs capable of decoding 10s of GB of data per second of compressed data (ignoring DCC) on their consumer hardware.

    HDDs were incapable of pushing that much data to the GPU and the data had to transverse the CPU anyway which was capable of handling decompression of compressed bandwidth measuring in MBps.

    Consoles have the tech because SDDs are standard on new-gen hardware and console manufacturers have to be more forward-thinking as their hardware has to last 7-8 years into the future.
     
    Lalaland and B-Nice like this.
  4. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,324
    I imagine it will be a shader-based solution. It's not ideal unless it offers random access because all the data that has to be shuttled back and forth across the GPU memory bus. It also takes up more VRAM. It's better to have ASICs sitting between video memory and the SSD.
     
    #144 dobwal, Sep 30, 2020
    Last edited: Sep 30, 2020
    B-Nice likes this.
  5. Johnny Awesome

    Veteran

    Joined:
    Feb 18, 2002
    Messages:
    2,806
    Likes Received:
    737
    Location:
    Windsor, ON
    So what you're saying is that future PC solutions will be better than consoles coming out a month from now? WOW!

    In other news: The earth orbits the sun.
     
    B-Nice likes this.
  6. Dictator

    Regular

    Joined:
    Feb 11, 2011
    Messages:
    681
    Likes Received:
    3,969
    We do not know the exact specifics but I think runs into conspiracy theory level territory to assume all those negative things about it being shady and mixing it in with those fp32/int32 whatever you mentioned. Why talk about possibilities of such negatives if there is no evidence of them?
     
    DSoup, BRiT, Allandor and 3 others like this.
  7. Rikimaru

    Veteran

    Joined:
    Mar 18, 2015
    Messages:
    1,060
    Likes Received:
    426
    Have nVidia said which compression formats it support? I assume zlib?
     
  8. Strange

    Veteran

    Joined:
    May 16, 2007
    Messages:
    1,698
    Likes Received:
    428
    Location:
    Somewhere out there
    Technically both orbit around the center of mass of the Solar System, and that point moves around, even moving outside of the sun so no.
     
  9. Lalaland

    Regular

    Joined:
    Feb 24, 2013
    Messages:
    864
    Likes Received:
    693
    I suspect that you were at least partially addressing ToTTenTranz with your comment but I'm going to expand out on my point as to why I think this is not going to be a simple win for any vendor trying to do direct DMA across the PCI-E bus today. One of the advantages of PCI-Express versus the older PCI standard is that moved the PCI bus itself to a switched standard allowing CPU manufacturers to add arbitrary numbers of lanes in a simple hierarchy by adding multiple PCI roots and bridging them with PCI to PCI bridges. Internally the CPU has switches on the root bus to handle swapping between these in a fashion that is basically transparent to the user so they perceive themselves as having 48 PCIe lanes when internally they have 3 x 16 for example.

    DF themselves ran into the complexities this generates in the Horizon Zero Dawn benchmarking when it was discovered that due to a misconfiguration of expansion cards Alex had actually inadvertently halved the bandwidth available to his PCIe x16 slot, a quick juggling of expansion cards and his GPU got back an additional 8 lanes of PCIe. What HZD was doing should have been a bog standard utilisation of PCIe bus transactions but most games steer well clear of doing things like that because of the support risks Alex ran into, why bother with having to deal with how customers have configured their boards when you can just utilise a technique that doesn't require as many bus transactions (it also reinforces that HZD was a late port, no allowance was made at design stage for PC issues because it was designed for a fixed system where this was not a concern).

    This remains an ongoing concern in general with PCIe as can be seen in these notes from the Linux Kernel (https://www.kernel.org/doc/html/latest/driver-api/pci/p2pdma.html), in this context they are mostly discussing the advanced ultra low latency NICs used by the likes of algo traders that attempt to bypass any sort of CPU involvement in NIC transactions at all. In their implementation they have been limited to only allowing P2P transactions within a given root complex of which modern CPUs have multiple instances. For example Kaby-Lake Intel CPUs have a x16 link controlled by the CPU which is typically dedicated to the GPU and the other PCIe lanes are controlled by the root complex in the PCH southbridge, does that restrict transactions between the two in a P2P example? How do the multiple complexes in a Ryzen CPU deal with this? Do I need my NVMe drives to be on the same root complex as my GPU? Is any of this even relevant for Windows?

    P2P transactions across the PCI-E bus are a fairly new area for m/b manufacturers to consider and while long term this will all work out very well right now Sony and MS have an obvious advantage in doing this with their total control over bus and board topology versus the PC market where some of these choices on board layout defy explanation. If you are going to announce significant changes in how memory and bus transactions work in my PC I am going to be sceptical it can be simply and easily deployed if you don't come with detailed explanations of how it all works. When it comes from a section of the PC market already notorious for launching features early and aggressively marketing them as if they were already widely adopted I'm going to be even more sceptical.

    Additional context (nice deep dive on PCIe, PCH and bus layout here): https://forums.tomshardware.com/thr...-root-complex-pcie-lanes-and-the-pch.2115479/
     
    egoless, Pete, pjbliverpool and 2 others like this.
  10. You think I'm in the conspiracy theory level territory, I think you're in the drinking the Kool-Aid too soon territory. Perhaps the real territory is somewhere in the middle.
    Though I did my best to explain my position, and I've yet to see a reasonable counter-argument so far. Saying we should take everything from nvidia at face value because they wouldn't lie doesn't seem reasonable enough to me, especially at B3D.

    If everything in a videogame could be done with low-clocked highly parallel processors, the new consoles would have tens of Jaguar/Atom-class cores at 1.5GHz instead of "only" 8 Zen2 cores at 3.4GHz+.
    What would be your opinion if nvidia came out saying their GPUs are now 20x faster than a Zen2 core at running Javascript code?



    You.. do know @Dictator is Alex, right?
     
    Lalaland likes this.
  11. Lalaland

    Regular

    Joined:
    Feb 24, 2013
    Messages:
    864
    Likes Received:
    693
    Nope I thought that was John Linneman, sorry Alex!
     
  12. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    Nor was I suggesting they were. I'm fully aware they use a custom hardware unit.

    I assume you're well aware that modern GPU's have huge INT performance as well?

    Which is great apart from a couple of things:
    1. zlib does scale with CPU core count. Here it is showing clear scaling up to 18 cores (the highest number tested).
    2. No-one said RTX-IO would be using zlib or indeed any LZ family based routine. BCPACK which is a block compression algorithm seems more likely but it could easily be something else entirely.
    3. Nvidia, the worlds foremost GPU maker say they can do decompression on the GPU at >14GB/s output with minimal performance impact. They probably know a thing or two about this so I'm inclined to believe them. Oh, and they've demonstrated it working to the press.

    And he didn't. He spent, what, a couple of minutes? And unless you know the economic trade off between adding additional silicon to the main APU vs a custom hardware ASIC then it seems premature to dismiss the GPU based option merely because it wasn't implemented in the consoles.

    As mentioned above, modern GPU's have massive INT throughput too. It may even be the case that RTXIO uses the Tensor cores which would explain why it's limited to RTX class GPU's and is described as having a tiny performance hit. You're looking at hundreds of TOPS on offer there which is largely unused.

    Or they're constrained by NDA due to links into Direct Storage.

    Okay, so Nvidia, aren't lying, they're just being dishonest. And presumably the demonstration they showed to the press was faked. Got it.
     
  13. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    Apparently all AMD CPU's since Zen support P2P DMA between root ports on the same complex (which would mean any 2 capable devices in a typical desktop system:

    https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.2-AMD-Zen-P2P-DMA

    I'm not sure about Intel though. It looks like their support may be more sketchy.

    I think the more fundamental question at this stage though is whether we're actually talking about P2P DMA here or in fact the data still goes via the CPU/system memory, but has much less interaction with the CPU. e.g. if we're looking at unbuffered data transfers with no CPU decompression then cutting the CPU out of the data flow in the diagrams might be warranted for illustration purposes to demonstrate the significantly reduced interaction.
     
    Pete, Lalaland, BRiT and 1 other person like this.
  14. Jay

    Jay
    Veteran

    Joined:
    Aug 3, 2013
    Messages:
    4,029
    Likes Received:
    3,428
    I wouldn't be surprised if it didn't have to support both methods.

    DirectStorage would need to, it wouldn't cut out that amount of hardware.
     
  15. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,632
    Location:
    The North
    That's a pretty intense take on his words here and I don't think Alex said anything offensive either, at least not to warrant a direct attack on another forum member. He said to assume all of those negative things was to be conspiratorial. Now I haven't followed this thread so I'm just jumping in here, but there is a big difference between looking at the data points and suggesting an explanation versus choosing a explanation and then looking for data points to support it. The latter is actually considered conspiratorial, most of the time this happens anyway, but people concede their point when the evidence is mounted against them. You'll need to decide if you have been working with grounded data points and working your way to an explanation, or choosing an end point (ie. Nvidia is lying) and finding data to prove that.

    I think if a journalist who has access to materials and bound by embargo dates says just wait around for the real news, is a far reach to assume he means to drink the kool-aid and take everything Nvidia says at face value. He just may know things you don't.
     
  16. Rikimaru

    Veteran

    Joined:
    Mar 18, 2015
    Messages:
    1,060
    Likes Received:
    426
    It does not scale if it is a continious archive.
    PS5 for example uses 256Kb chunks. Each of them could be decompressed simultaniously.
     
    Pete, PSman1700 and pjbliverpool like this.
  17. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    Yup, you're going to need a heck of a lot of those being transferred and decompressed simultaneously to get anywhere near saturating a 5.5GB/s SSD.
     
    DSoup, BRiT and PSman1700 like this.
  18. DSoup

    DSoup Series Soup
    Legend Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    16,775
    Likes Received:
    12,690
    Location:
    London, UK
    You're assuming Sony won't resurrect Studio Liverpool and have them reprise Wipeout at 120Hz with more and more bespoke geometry and textures until the PS5 explodes. :runaway: After that they can remaster G-Police and I will be happy. :yes:
     
  19. dobwal

    Legend

    Joined:
    Oct 26, 2005
    Messages:
    5,955
    Likes Received:
    2,324
    gwzip. gw stands for GameWorks. LOL
     
  20. It's just using decompression of several files in parallel.

    Observing that ZIP isn't scalable with CPU cores isn't that hard to observe. Anyone with a PC can do it.
    Just try to compress a ~3GB folder using 7zip and then decompress it, preferably from one SSD to another (or from a SSD to e.g. a RAMdrive) so that storage doesn't become the bottleneck.
    Then check on the Windows Task Manager how many threads are being pushed with high utilization during decompression.


    I recommend reading this paper on the subject, in which the authors from IBM and Columbia U. comment on the limitations of Zlib for parallel computing performance and propose a new compression format that is parallel-friendly:
    Massively-Parallel Lossless Data Decompression



    Note: they're using compression of html text and sparse matrixes that compress a lot more than textures, which is why zlib reaches 3:1 and 5:1 compression ratios in there, whereas with textures it's usually ~1.8 or less.

    [​IMG]

    In the end, they came up with a compressor that is indeed much faster at decompressing, but it also has a much lower compression ratio (meaning effective throughput is very far from nvidia's "as fast as 24 cores" claim). With their method they spend a bit less energy than CPU Zlib for the decompression operation, though at the cost of significantly higher disk space and they depend on a very high storage source throughput. There's no free meal here. Kraken is probably much better here, and so should be BCPack for textures.


    They couldn't do anything remotely close to the performance of dedicated ASICs or hardware blocks.





    Or they're just using massive amounts of parallel decompression threads of different texture files, which in the end makes the 14GB/s throughput an unrealistic load for any real-life scenario. And although the aggregated throughput is high, the time it takes to decompress one large texture makes it unusable for actual texture streaming in games.
    And only future GPUs that actually have dedicated hardware blocks for decompression, will ever make DirectStorage with GPU decompression usable.




    It's the same nvidia-the-worlds-foremost-gpu-maker who presented on stage a graphics card for (paper-)launch day, which photos later showed was being held together by woodscrews.
    Lack of information is usually suspicious, and in this case they're omitting a ton of it.




    He starts talking about storage at the 5 minute mark. He starts talking about Kraken and the Custom IO Unit at ~17m. He moves on from the storage talk at ~24m. In a 53 minute presentation.



    It's large throughput with very low single-threaded performance. Still not a good match for decompression.



    They didn't show DirectStorage with CPU decompression, they only showed DirectStorage GPU vs. current on CPU. It's apples vs. oranges.
    Why didn't they show CPU vs. GPU both on DirectStorage? Ask yourself why they would hide that, if their GPU is so much faster than the CPU at decompression. For all I know the IO overhead reduction alone is responsible for that speedup.
    They also didn't say whether those 24 Threadripper cores were concurrently being taxed or not. For all you know, a 6-core Ryzen 3600 using DirectStorage could have achieved faster loading times than the RTX IO result.




    Few textures are going to be 256KB. A 32bit color 4K*4K decompressed texture is ~67MBytes ([4096*4096*32bit] / 8). With lossless delta color compression I think we're looking at about half of that, so 33MBytes, and Kraken compression should put it on the ~20MB mark.
    Using 256KB for block size is just a means to guarantee maximum throughput from the SSD controller considering the IO operations limit. It doesn't mean you can do anything out of every isolated block. The PS5's custom IO controller probably needs to gather several 256KB blocks to gather join into a larger compressed texture file (probably inside the ESRAM they mentioned).
    Texture decompression can only happen after the large compressed file is put together in one place. In the case of a 20MB compressed 4K texture we're looking at 80x 256KB blocks that you need to put together before starting to decompress the texture.





    Well, now I'm just trying to figure out what you interpreted as a pretty intense take on whose words, and what exactly you're implying was a direct attack.
    I certainly didn't mean any of what I wrote as an attack, only as presenting a diverging opinion.
     
    Unknown Soldier, Lalaland and Pete like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...