Velocity Architecture - more than 100GB available for game assets

Discussion in 'Console Technology' started by invictis, Apr 22, 2020.

  1. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    11,514
    Likes Received:
    12,373
    Location:
    The North
    There's a little more nuance to what MS built in XSX. We're waiting for more details, but it's a minor addition to a greater piece of the problem.
    One might look at 360 emulation as being entirely software, but there is a small hardware custom component in which there is hardware to emulate a very specific part that would be too difficult to do.

    In the same way, the XVA is a bit like that, mainly software, but there is a small component on how the GPU can access the SSD storage that isn't done in software. Unfortunately we know little about it. So that makes it hard to discuss and easily dismissed as being critical.

    If you recall earlier, they used XBO code to test 4K on Scorpio. Scorpio gave MS the data they needed to optimize for 4K, because so many titles were. They could see what is happening with game code with respect to the memory usage. Then they focused on building this. So now they have this software solution, and they have bespoke hardware to improve it. I'm not going to tell you that this is a superior solution or anything. I'm just saying that Sony and MS had different data to use to solve a problem, the way that Sony approached it and the way that MS approached it differently based on their requirements.

    So I think what I see here is that MS opted for a slower SSD hardware solution, in hopes that they could develop a solution in which streaming assets is significantly smaller in size. Sony may have been looking at more traditional methods and worked out that they needed a significantly faster drive to do the same thing.

    It might be easy to say that Sony can just leverage these learnings and integrate it into PS5; but it's not that straight forward to just pick up and go unfortunately. So we don't know if Sony will support these features. They may have architected their API/chip differently here. And it happens, you can sometimes program your shit in such a way that the only way to get in a different feature is to re-write everything. Sometimes that's not worth it.

    I would not be surprised if bespoke hardware operated using this manner - a little bit like how your phones respond very well to 'Hey Google', 'Hey Siri'. But if your voice command exceeds like 3-5 words it needs to send it to the cloud for processing, whereas very small 3-5 length word voice commands can be processed on a tiny NN locally.

    Small quick running NN for texturing up-resolution in the scenario the drive isn't fast enough.
     
    #21 iroboto, Apr 23, 2020
    Last edited: Apr 23, 2020
  2. mrcorbo

    mrcorbo Foo Fighter
    Veteran

    Joined:
    Dec 8, 2004
    Messages:
    3,957
    Likes Received:
    2,706
    This is from the DF article.

    This is seperate and distinct from the hardware decompression block.

     
    blakjedi, Scott_Arm and PSman1700 like this.
  3. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,461
    Likes Received:
    3,127
    Location:
    Wrong thread
    Okay so this is a post I've been thinking about for a couple of days (not a good sign) so it's a copy and paster from Notepad.

    I've been wondering about the "Veolcity Archtecture" and MS's repeated mentions of low latency for their SSD - something that's been sort of backed up by the Dirt 5 technical director. There's also the talk of the "100GB of instantly accessible data" aka "virtual ram".

    Granted I could be reading too much into some fairly vague comments, but I think there's probably something to the comments, and also that the two things are possibly related. So I think that maybe one of the key things that that allows MS to have such (presumably) low latency from the SSD is also responsible for the strange seeming "100GB" figure.

    Now I'm assuming that the "virtual memory" is storing data as if it were already in, well, memory. So the setup, initialisation and all that is already done and that saves you some time and overhead when accessing from storage compared to, say, loading assets for an SSD on PC. But this virtual memory will need to be accessed via a page table, that then has to go through a Flash Translation Layer. Normally this FTL is handled by the flash controller on the SSD, accessing, if I've got this right, a FTL stored in either an area of flash memory, or in dram on the SSD or on the host system.

    XSX has a middling flash controller, and no dram on the SSD. So that should be relatively slow. But apparently it's not (if we optimistically run with the comments so far).

    My hypothesis is that for the "100 GB of virtual ram" the main SoC is handling the FTL, doing so more quickly than the middling flash controller with no dram of its own, and storing a 100GB snapshot of the FTL for the current game in an area of system reserved / protected memory to make the process secure for the system and transparent to the game. Because this is a proprietary drive with a custom firmware, MS can access the drive in "raw mode" like way bypassing all kinds of checks and driver overhead that simply couldn't be done on PC, and because it's mostly or totally read access other than during install / patching, data coherency shouldn't be a worry either.

    My thought is that this map of physical addresses for the system managed FTL would be created at install time, updated when wear levelling operations or patching take place, and stored perhaps in some kind of meta data file for the install. So you just load it in with the game.

    And as for the "100GB" number, well, the amount of reserved memory allocated to the task might be responsible for the arbitrary seeming 100GB figure too.

    Best I could find out from Google, on a MS research paper from 2012 (https://static.usenix.org/events/fast12/tech/full_papers/Grupp.pdf), was that they estimated the FTL might be costing about 30 microseconds on latency. Which wouldn't be insignificant if you could improve it somewhat.

    So the plus side of this arrangement would be, by my thinking:
    - Greatly reduced read latency
    - Greatly improved QoS guarantees compared to PC
    - No penalty for a dram-less SSD
    - A lower cost SSD controller being just as good as a fast one, because it's doing a lot less
    - Simplified testing for, and lower requirements from, external add on SSDs

    The down sides would be:
    - You can only support the SSDs that you specifically make for the system, with your custom driver and custom controller firmware
    - Probably some additional use of system reserved dram required (someone else will probably know more!)

    Any thoughts would be appreciated.

    @DSoup, as your earlier posts about accessing data from an SSD on Windows have sort of lead to this idea, your feedback would be especially appreciated. Hopefully some of this makes sense to ... somebody!

    Edit: thanks for the move to the proper thread. mod! :oops:
     
    #23 function, Jul 12, 2020
    Last edited: Jul 12, 2020
  4. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    13,245
    Likes Received:
    8,759
    Location:
    London, UK
    I can't offer much insight because, as you said, these are thoughts based on a number of vague comments and much of my comments where about the Windows I/O stack which is likely very different than Xbox Series X, but it would indeed be truly amazing if Sony have prioritised raw bandwidth and Microsoft have prioritised latency. :yes:

    My gut tells me that if this is what has happened that they'll largely cancel each other out except in cases where one scenario favours bandwidth over latency and another favours latency over bandwidth. Nextgen consoles have 16Gb GDDR6 so raw bandwidth is likely to be a preferable in cases where you want to start/load a game quicker, e.g. loading 10Gb in 1.7 seconds at 100ms latency compared to 3.6 seconds at 10ms latency. Where the latency could make a critical difference is frame-to-frame rendering and pulling data off the SSD for the next frame, or the frame after.

    The SSDs in both nextgen consoles hugely improve both latency and bandwidth over current gen (and PC) today but it really feels like that no matter what decisions Microsoft and Sony made for this generation and what they have made for next generation, that there is actually only marginal differences between the actual games themselves. Look at launch PlayStation 4 vs Xbox One and the clear disparity in GPU/Compute (18 CUs vs 12 CUs) and how that really didn't make much difference to actual games.
     
    JPT and function like this.
  5. P_EQUALS_NP

    Joined:
    Jun 17, 2020
    Messages:
    8
    Likes Received:
    1
    that 100GB may be a special section of the nand flash. for example the first page of a flash erase block can be accessed quicker then the next few pages of the erase block.
     
  6. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    592
    Likes Received:
    312
    It will not be. The 100GB will not be a specific area of the drive, it's just 100GB worth of game assets, wherever they happen to be located.

    A lot of people here seem to think that the 100GB is some kind of scratchpad where things will be written. This cannot possibly work, because there simply isn't enough write endurance to allow frequent large writes into the drive. Instead, the way it probably works is that you are just allowed to map up to 100GB worth of game assets into memory, specifying what compression you are using for it, and the hardware will page it into ram for GPU on demand.
     
    Silent_Buddha, milk, rntongo and 6 others like this.
  7. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,818
    Likes Received:
    800
    Location:
    Somewhere over the ocean
    And if part of the disk is configured as slc to increase endurance in frequent writes, and increase speed?
     
    John Norum likes this.
  8. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    592
    Likes Received:
    312
    Still does not give enough endurance, and configuring part of the disk as slc does not improve read speeds.
     
  9. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,818
    Likes Received:
    800
    Location:
    Somewhere over the ocean
    Slc has lower latency and better read/write compared to tlc/qlc.
    In fact, it's not my idea but a popular configuration in low end m2 ssd to keep price down and performance up on normal desktop workloads.
     
    John Norum likes this.
  10. Ronaldo8

    Newcomer

    Joined:
    May 18, 2020
    Messages:
    233
    Likes Received:
    232
    The "100 GB" thing is not a scratchpad. MS never implied that this is the case and it will be a mindbogglingly stupid way to use an SSD in a system where RAM quantity has been carefully provided for. The scratchpad idea has been pushed by people who have been lazily assuming that the XSX I/O is good old PC virtual memory paging when it is in fact memory mapping akin to bank switching of old but with a modern twist (Direststorage?).
     
    tinokun likes this.
  11. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,031
    Location:
    Under my bridge
    You need to chill out a bit. I don't think there's any pushing whatsoever, and there are plenty of reasons other than 'laziness' for someone to not understand something which hasn't been openly explained.

    More realistically, "The scratchpad idea has been considered by people who haven't been informed by MS how it works..."
     
  12. Ronaldo8

    Newcomer

    Joined:
    May 18, 2020
    Messages:
    233
    Likes Received:
    232
    Ya.. I apologise for the turn of words.
     
    Shifty Geezer likes this.
  13. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    592
    Likes Received:
    312
    SLC on a normal drive will not improve read speeds, only write speeds. They act as a combining write buffer, they use a small area of the drive in SLC to write everything first, and then copy it out to the large pool. The practical read speed from the SLC portion is the same as the rest of the drive. Didn't we already to over this in some other thread?

    And this way of doing things will not be relevant for console drives because their normal workloads will not be like desktop workloads. There will be effectively no writes. The sensible way of using a lot of flash in a read-dominant workload does not involve moving data around.
     
    Silent_Buddha and tinokun like this.
  14. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,461
    Likes Received:
    3,127
    Location:
    Wrong thread
    Thanks man. I know my thoughts and data-points were a little vague/PRish!!

    My main thought was split into two: that there is the access penalty and the loading/processing penalty.

    I don't think MS can hit 10 microseconds, but current QLC is about 100 microseconds for simple access according to Anandtech. If you could shave 30 microseonds off that, and have a tiny penalty for copying a page from SSD to dram, I think that *might* be a game changer in terms of supplementing system dram with solid state storage.

    Small low latency changes would better serve 99% of gameplay than a complete level change within two seconds.

    I don't thing erase is something either MS or Sony are keen to encourage.

    For every layer of storage per cell there is a penalty for flash. But in the short term, 100GB of special SLC flash would required many tens of seconds to fill when changing between games. And that is not the behaviour we've seen demonstrated so far.
     
    DSoup likes this.
  15. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,461
    Likes Received:
    3,127
    Location:
    Wrong thread
    SLC doesn't improve read bandwidth, you are quite correct, but it can improve read latency. This is one of those tricky things that folks have been doggedly trying to benchmark for years! :)
     
    John Norum likes this.
  16. P_EQUALS_NP

    Joined:
    Jun 17, 2020
    Messages:
    8
    Likes Received:
    1
    I was refereeing to the the nand block and pages in those blocks, you see 1 nand flash chip is made up of many blocks and within those blocks are the pages. ssds can read|initialize on a page granularity but erasing is done on the larger block granularity.
     
  17. rntongo

    Newcomer

    Joined:
    May 23, 2020
    Messages:
    11
    Likes Received:
    7
    This is exactly it. But it will be aided by Microsoft's version of the high bandwidth cache controller but built for the Xbox Series X.
     
  18. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    13,245
    Likes Received:
    8,759
    Location:
    London, UK
    My latency numbers are pure fiction. We know that next to XSX the PS5 has around double the base hardware bandwidth, which doesn't account for certain types of data that may compress/expand better/worse than the other console can handle, but I wanted to toss in a scenario where XSX had an order of magnitude advantage at such low latencies where it may make a difference. I also think we're looking at much higher latencies.

    I'm certain both XSX (and PS5) will have vastly lower latency than any SSD on PC but you can't really build anything other than platform-exclusive games relying on that and Microsoft are committed bringing their games to Windows where they may have to run off a 2.5" spinning HDD in a laptop.

    I still have difficulty comprehending a future where loading times are virtually eliminated having gotten used to startup times sometimes being minutes long - first loading to the game main menu then continue/load game save.
     
  19. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,031
    Location:
    Under my bridge
    I don't. Whether 70ms or 100ms or even 30ms, a direct fetch is too slow to be immediately addressable. You'll still be working with prefetching. The difference will be prefetching two frames in advance or six.

    Sure, but you don't need some virtualised RAM to do that.
     
    DSoup likes this.
  20. DSoup

    DSoup meh
    Legend Veteran Subscriber

    Joined:
    Nov 23, 2007
    Messages:
    13,245
    Likes Received:
    8,759
    Location:
    London, UK
    Yup. Assuming Microsoft does have a latency advantage I think that the window of latency and relative difference will be far less pronounced than the different in raw bandwidth. But if you're talking 2.5 seconds to load to the point where you're ready to play vs 4.8 seconds, it's still a huge advance over what console and PC gamers experience now.
     
    temesgen and function like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...