DirectStorage GPU Decompression / RTX IO

Discussion in 'Rendering Technology and APIs' started by DavidGraham, Apr 21, 2021.

  1. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    The "others" literally being Microsoft themselves.

     
    PSman1700 likes this.
  2. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    And again, you fail to recognize the context which I've twice now tried to educate you with.

    Issuing tens or hundreds of thousands of I/O operations requires a lot of power to service those requests. A reduction in CPU time absolutely helps a system which is power constrained or otherwise CPU burdened -- both of which directly apply to consoles with "reserved" CPU capacity and shared iGPU power draw to contend with.

    On a desktop PC with dedicated power budget to just the CPU, the cycles aren't a limiter.

    We can directly prove this by showing Windows boxes crushing the IOP rate of any bullshit game when executing real, high performance enterprise workloads without any problem whatsoever. The paltry I/O needs of any video game pale in comparison to a SQL cluster managing centralized store replenishment for six thousand five hundred stores cross-linked with DMV registration data for every car in every county, cross-linked again with weather statistics for the 18 months, cross-linked again with prior sales data for the same 18 months, cross linked once more with the topology of a spoke-and-hub supply chain.

    I could probably light a block of concrete on fire with the power consumption of that SQL cluster and the NVMeOF frame it's connected to. Your assertion of IOP limits in Windows being anything related to a SATA interface is laughable, at best.
     
    #62 Albuquerque, Jul 7, 2021
    Last edited: Jul 7, 2021
  3. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    9,044
    Likes Received:
    1,116
    Location:
    WI, USA
    Yeah you can get by pretty well with your games on a lowly hard drive even today. Load times are longer of course but it doesn't cripple the experience.

    I guess we'll see if upcoming games do something drastically different.
     
    PSman1700 likes this.
  4. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    This makes no sense. The consoles have fixed frequency CPU's (yes the PS5 is slightly variable, but not massively so like a mobile CPU).

    But taking the XSX CPU for example, regardless of the consoles power constraints you have a guaranteed frequency of 3.5Ghz. Thats on an 8 core Zen2 based CPU. There are plenty of PC CPU's that don't have those resources and so suggesting PC's don't need DirectStorage because they don't have the same resource constraints as consoles makes no sense. And that's before we even consider the hardware decompressors on the consoles.
     
    PSman1700 likes this.
  5. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    Despite it not making sense to you, the reality is a modern Windows kernel can push multiple millions of IOPS with service times in the dozens microseconds.

    Game load times are not bottlenecked by a Windows I/O stack failure on modern, capable hardware.
     
  6. Remij

    Regular

    Joined:
    May 3, 2008
    Messages:
    677
    Likes Received:
    1,256
    The reality is that MS are creating DirectStorage for a reason... Sub 1 second.. or close to 1 second load times... ie instantaneous.. and to reduce as much strain on the cpu at run-time as possible.
     
  7. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    Yes, we agree: they're optimizing an I/O path for devices which are CPU constrained. It seems to make a lot of sense on consoles.

    There are multitudes of datapoints and reviews showing desktop gaming PC's are not experiencing I/O wait for loading game assets. Your statement does not invalidate anything I've said, nor does it refute millions of Windows servers (running the exact same kernel as Windows 10) being able to perform disk I/O at a level far beyond anything a video game would need.
     
  8. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    For whatever* reason, decompression of assets seems to be single-threaded or at least very shyly multithreaded on gaming PCs. DirectStorage removes that and give a kind-of-guaranteed decompression performance that does not tax your PC's CPU.
    upload_2021-7-8_18-50-44.png
    upload_2021-7-8_18-52-32.png
    *my guess is: You can never know how many cores there are in a PC before and most devs cannot be bothered to write a dynamic decompression.
     
    BRiT likes this.
  9. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    So just to be clear, why do you think Microsoft are saying that they are, and what do you think is the real bottleneck preventing NVMe drives with 10x the throughput of SATA SSD's from delivering significant speed ups in game load time?

    Then why wouldn't it also make sense on PC? Granted there are much more powerful CPU's in the PC space but unless you're rocking a 3700x which is still pretty high end compared to the average PC then you're likely just as CPU constrained or even more so than the consoles.

    So if this makes sense for the consoles then surely it makes sense for those PC's as well.

    And taking that one step further, even if you have a 5950X with 16 cores, why would you want to bog down 5 or 6 of them purely to service IO requests and decompression when DirectStorage can reduce that to a small fraction of a single core?

    I'm curious to see these benchmarks showing NVMe drives demonstrating multiple times the game load speed performance of SATA SDD's.
     
    PSman1700 likes this.
  10. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    17,879
    Likes Received:
    5,330
    I don't think the first quote says what you think it does ;)
     
  11. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    I'm sorry that you think millions of incredibly high-end Windows servers running the Windows 10 kernel are limited to the speed of a SATA disk interface.

    I don't know why you think that.

    It isn't correct in the slightest.

    Microsoft pointed out an optimization to their console systems which reduces CPU overhead for issusing lots and lots of I/O requests. Yeah, that's nice and probably worthwhile. Yeah, it's easily ported to the PC.

    No, that doesn't equate to every Windows 10 PC being massively bottlenecked to SATA disk speeds.

    I'm done having this incredibly stupid and narrow-minded conversation. There is infinitely more data showing the Windows 10 kernel having enormous amounts of disk I/O headroom than there are data showing the contrary.

    There are plenty of sales pitches for it though, aren't there?

    NV RTX IO, available only for your $2000/ea video card! Buy a few :)

    I bet I can find a boxed copy of SoftRAM for you, too. I'll make you a deal...
     
  12. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    Yeah, that appears to be the main selling point of DirectStorage on PC.

    You are absolutely correct, Windows severs can offer millions of I/Os, but the cost of CPU time is high, CPUs also has to decompress game data during loading and during streaming. The problem is that current gaming APIs is incredibly inefficient in using the CPU resources for these logistical tasks, as most of these tasks are still single threaded. DirectStorage is simply a way to correct these limitations in the DirectX API specifically, freeing the CPU to focus more on it's natural tasks of handling game logic, physics and simulation rendering.
     
    PSman1700 likes this.
  13. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,393
    Key point of DS on PC is the standardization of GPU decompression. Read speeds on PC are not limited by storage, they are limited by data processing (decompression) which tend to happen on a couple of CPU threads at best.
     
  14. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,393
    https://devblogs.microsoft.com/directx/directstorage-developer-preview-now-available/

    So compatible with Win10 as old as 1909 even.
     
    PSman1700, Krteq and BRiT like this.
  15. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,088
  16. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    And that's actually a key feature here. Because even though we do have overlapped IO with IO completion ports for a long time now, it requires a decent amount of overhead per each single request for bookkeeping what's in-flight, both in-application, and the OVERLAPPED structures. Plus, even though there was already GetQueuedCompletionStatusEx for batched processing of completion events before, batched submission on the other hand did not exist.

    Well, if you don't insist on batched submission, massive parallel scatter loading was already possible before too, though. Batching ain't that difficult either, if you are willing to pay for the bookkeeping across multiple pending batches.

    There is yet another catch though, if you can't batch requests, good old friends named AV software (and a bunch of other freeloaders on the IO stack) are also waking up for every single request, and these are often where the actual CPU time part of the IO overhead comes from.
     
  17. Seanspeed

    Newcomer

    Joined:
    Apr 23, 2021
    Messages:
    137
    Likes Received:
    204
    I'm not trying to weigh in on who definitely is or isn't correct, just want to point out that using current PC games as a benchmark doesn't really work cuz they aren't built, in terms of data structure and memory management and all that, in order to take advantage of SSD's the way some select 'next gen' games are on consoles so far. The point is that IO requests will increase drastically. Maybe you're right and this still wont bottleneck the system and Microsoft are basically wasting their time on a fairly pointless update(given that next gen games will be targeting fairly well spec'd systems), but that seems a bit strange, no? You say it's just for CPU-limited systems like consoles, except the consoles have pretty good CPU's in them, that are only a mild glance back compared to desktop CPU's. It's hard to imagine that this gap includes, somewhere in its fairly short size, the defining line at which things go from not good enough to good enough, so that MS have needed to step in and do some emergency upgrades to ensure things work fine on consoles.

    I would guess MS do know something we dont and that there will be tangible benefits to come from them doing all this work.
     
  18. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    Show us one and we can have that conversation.

    There are Optane P5800X reviews (PCIe 4 x 4,U.2 single interface) on the net now showing Microsoft SQL Server performance nearing a million IOPS with I/O service times below 100 μs. This again is on a single storage device, using the same Windows kernel as a Windows 10 distribution. This isn't bulk-rate reading a single enormous flat file for maximum bandwidth, this is query result which means a LOT of random I/O.

    I'm still waiting for someone to show any data at all that the Windows kernel is bottlenecking disk I/O today. I'd love to hear a realistic story as to why stupidly-simply-in-comparison game data files are somehow more complex and more onerous to disk throughput than an enterprise-scale transactional database.

    I suspect what we're really facing here is an API which does all the heavy lifting work for game designers who don't want to put in the code effort, which is truly fine. Making it easier for a dev is a rational and reasonable argument, far more so than making the kernel servicing I/O in some remarkably, game-changing (ha!) faster way. I'm sure the kernel can use more tweaking as all code can; it isn't bottlenecking disk I/O today on the crap storage we find in commodity grade consumer devices like typical NVMe drives.

    I also buy into the GPU decompression conversation being more of the "meat and potatoes" of a newfangled feature being added.
     
  19. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    But how many CPU cores are those servers using to hit that level of data transfer? And how much of that data are they having to decompress on the fly? This isn't about saying PC's can't move that much data, it's about reducing the massive overhead associated with it.

    We literally have it straight from Digital Foundry and Microsoft's Andrew Goossen:

    That's just the overhead associated with a 2.4GB/s SDD. You can multiple that by almost 3x for the fastest SSD's on the market which are themselves capable of 1m IOPs. How many gaming PC's can afford to throw 15 Zen 2 cores at a game loading scenario? Or even worse in game streaming? And that is of course assuming that games can spread the IO load and decompression over multiple CPU cores evenly rather than being single thread limited which is more often the case. In which case you're limited to about 1.2GB/s on the IO side, and around 800MB/s on the decompression side.

    Then what's the explanation for this?

    [​IMG]

    https://www.techpowerup.com/review/western-digital-wd-black-sn850-1-tb-ssd/13.html
     
    PSman1700 likes this.
  20. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    Yeah, so a desktop PC with eight cores at a minimum is going to be fine for level loading at the maximum rate of a commodity SSD. Remember, Zen2 is prior architecture -- and I've been panned for insinuating consoles are at a CPU deficit. Guess what? Consoles are at a CPU deficit compared to desktops.

    You're conflating terms here. CPU consumption isn't directly a function of I.O bandwidth, in fact you can achieve maximum disk bandwidth with relatively tiny amount of I/O and related CPU. Processor consumption is a function of total outstanding I/Os in the stack, which doesn't necessarily relate to overall bandwidth. There are multiple factors there: total depth of the I/O queues, number of I/O queues available (at least four on commodity NVMe disks, only one for ATA), and service times of the disk serving up your requests. If you're going to proxy CPU to disk I/O, then IOPs is a closer way to track this, not bandwidth. Now to get to the point: how many IOPs are we dealing with during these level loads? Do we know? Because if we don't, then we ineed to find out.

    Part of this goes back to file management of the game itself, which should be obvious now.

    Honestly? All of them. It's a level load! We aren't actively playing the game here, we're waiting to play the game. Want to make an argument about GPU decompression? Great, but you can't buy a "gaming" desktop today with less than eight CPU threads, that gives you six more threads for doing... something else.

    Writes are far less painful than reads, especially asynchronous writes (which would be indicative of recording your streamed video game to a video file.) Reads are pathalogical because they're blocking operations; writes are not blocking unless you have a specific need for 100% data integrity, which isn't what any of the commodity streaming recording studio software is doing. Writes are cached and coalesced by the OS and then written to disk as a few, large contiguous blocks rather than scattered in zillions of individual I/Os (aka, random reads.)
    Explanation is simple: they aren't waiting on disk.

    The level load time shows literally nothing about describing a bottleneck, pro or con. Want to actually demonstrate support for your attempt at a point? Toss in a perfmon log during level load and have it show disk service times. Here's a hint: you aren't waiting on the service times of the disk, which is why faster and faster disks aren't making a difference to the level load time.
     
    #80 Albuquerque, Jul 20, 2021
    Last edited: Jul 20, 2021
Loading...
Similar Threads - DirectStorage Decompression
  1. Kelemit
    Replies:
    20
    Views:
    4,051
  2. Dave Baumann
    Replies:
    42
    Views:
    6,845

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...