Velocity Architecture - Limited only by asset install sizes

iroboto · Apr 23, 2020

Love_In_Rio said:
Yes, the decompressor is a hardware block as said before. The lack of other IO hardware takes you from the 100% SSD speed to the 20% real result. See Cernys slide. No software can overcome that 80% without eating many CPU resources.

There's a little more nuance to what MS built in XSX. We're waiting for more details, but it's a minor addition to a greater piece of the problem.
One might look at 360 emulation as being entirely software, but there is a small hardware custom component in which there is hardware to emulate a very specific part that would be too difficult to do.

In the same way, the XVA is a bit like that, mainly software, but there is a small component on how the GPU can access the SSD storage that isn't done in software. Unfortunately we know little about it. So that makes it hard to discuss and easily dismissed as being critical.

"We observed that typically, only a small percentage of memory loaded by games was ever accessed," reveals Goossen. "This wastage comes principally from the textures. Textures are universally the biggest consumers of memory for games. However, only a fraction of the memory for each texture is typically accessed by the GPU during the scene. For example, the largest mip of a 4K texture is eight megabytes and often more, but typically only a small portion of that mip is visible in the scene and so only that small portion really needs to be read by the GPU."

As textures have ballooned in size to match 4K displays, efficiency in memory utilisation has got progressively worse - something Microsoft was able to confirm by building in special monitoring hardware into Xbox One X's Scorpio Engine SoC. "From this, we found a game typically accessed at best only one-half to one-third of their allocated pages over long windows of time," says Goossen. "So if a game never had to load pages that are ultimately never actually used, that means a 2-3x multiplier on the effective amount of physical memory, and a 2-3x multiplier on our effective IO performance."

A technique called Sampler Feedback Streaming - SFS - was built to more closely marry the memory demands of the GPU, intelligently loading in the texture mip data that's actually required with the guarantee of a lower quality mip available if the higher quality version isn't readily available, stopping GPU stalls and frame-time spikes. Bespoke hardware within the GPU is available to smooth the transition between mips, on the off-chance that the higher quality texture arrives a frame or two later.

If you recall earlier, they used XBO code to test 4K on Scorpio. Scorpio gave MS the data they needed to optimize for 4K, because so many titles were. They could see what is happening with game code with respect to the memory usage. Then they focused on building this. So now they have this software solution, and they have bespoke hardware to improve it. I'm not going to tell you that this is a superior solution or anything. I'm just saying that Sony and MS had different data to use to solve a problem, the way that Sony approached it and the way that MS approached it differently based on their requirements.

So I think what I see here is that MS opted for a slower SSD hardware solution, in hopes that they could develop a solution in which streaming assets is significantly smaller in size. Sony may have been looking at more traditional methods and worked out that they needed a significantly faster drive to do the same thing.

It might be easy to say that Sony can just leverage these learnings and integrate it into PS5; but it's not that straight forward to just pick up and go unfortunately. So we don't know if Sony will support these features. They may have architected their API/chip differently here. And it happens, you can sometimes program your shit in such a way that the only way to get in a different feature is to re-write everything. Sometimes that's not worth it.

I would not be surprised if bespoke hardware operated using this manner - a little bit like how your phones respond very well to 'Hey Google', 'Hey Siri'. But if your voice command exceeds like 3-5 words it needs to send it to the cloud for processing, whereas very small 3-5 length word voice commands can be processed on a tiny NN locally.

"We knew that many inference algorithms need only 8-bit and 4-bit integer positions for weights and the math operations involving those weights comprise the bulk of the performance overhead for those algorithms," says Andrew Goossen. "So we added special hardware support for this specific scenario. The result is that Series X offers 49 TOPS for 8-bit integer operations and 97 TOPS for 4-bit integer operations. Note that the weights are integers, so those are TOPS and not TFLOPs. The net result is that Series X offers unparalleled intelligence for machine learning. (on a console, cough - iroboto's add)"

Small quick running NN for texturing up-resolution in the scenario the drive isn't fast enough.

mrcorbo · Apr 23, 2020

Love_In_Rio said:
Yes, the decompressor is a hardware block as said before. The lack of other IO hardware takes you from the 100% SSD speed to the 20% real result. See Cernys slide. No software can overcome that 80% without eating many CPU resources.

This is from the DF article.

The final component in the triumvirate is an extension to DirectX - DirectStorage - a necessary upgrade bearing in mind that existing file I/O protocols are knocking on for 30 years old, and in their current form would require two Zen CPU cores simply to cover the overhead, which DirectStorage reduces to just one tenth of single core.

This is seperate and distinct from the hardware decompression block.

With the best competitive solution, we found doing decompression software to match the SSD rate would have consumed three Zen 2 CPU cores. When you add in the IO CPU overhead, that's another two cores. So the resulting workload would have completely consumed five Zen 2 CPU cores when now it only takes a tenth of a CPU core.

function · Jul 12, 2020

Okay so this is a post I've been thinking about for a couple of days (not a good sign) so it's a copy and paster from Notepad.

I've been wondering about the "Veolcity Archtecture" and MS's repeated mentions of low latency for their SSD - something that's been sort of backed up by the Dirt 5 technical director. There's also the talk of the "100GB of instantly accessible data" aka "virtual ram".

Granted I could be reading too much into some fairly vague comments, but I think there's probably something to the comments, and also that the two things are possibly related. So I think that maybe one of the key things that that allows MS to have such (presumably) low latency from the SSD is also responsible for the strange seeming "100GB" figure.

Now I'm assuming that the "virtual memory" is storing data as if it were already in, well, memory. So the setup, initialisation and all that is already done and that saves you some time and overhead when accessing from storage compared to, say, loading assets for an SSD on PC. But this virtual memory will need to be accessed via a page table, that then has to go through a Flash Translation Layer. Normally this FTL is handled by the flash controller on the SSD, accessing, if I've got this right, a FTL stored in either an area of flash memory, or in dram on the SSD or on the host system.

XSX has a middling flash controller, and no dram on the SSD. So that should be relatively slow. But apparently it's not (if we optimistically run with the comments so far).

My hypothesis is that for the "100 GB of virtual ram" the main SoC is handling the FTL, doing so more quickly than the middling flash controller with no dram of its own, and storing a 100GB snapshot of the FTL for the current game in an area of system reserved / protected memory to make the process secure for the system and transparent to the game. Because this is a proprietary drive with a custom firmware, MS can access the drive in "raw mode" like way bypassing all kinds of checks and driver overhead that simply couldn't be done on PC, and because it's mostly or totally read access other than during install / patching, data coherency shouldn't be a worry either.

My thought is that this map of physical addresses for the system managed FTL would be created at install time, updated when wear levelling operations or patching take place, and stored perhaps in some kind of meta data file for the install. So you just load it in with the game.

And as for the "100GB" number, well, the amount of reserved memory allocated to the task might be responsible for the arbitrary seeming 100GB figure too.

Best I could find out from Google, on a MS research paper from 2012 (https://static.usenix.org/events/fast12/tech/full_papers/Grupp.pdf), was that they estimated the FTL might be costing about 30 microseconds on latency. Which wouldn't be insignificant if you could improve it somewhat.

So the plus side of this arrangement would be, by my thinking:
- Greatly reduced read latency
- Greatly improved QoS guarantees compared to PC
- No penalty for a dram-less SSD
- A lower cost SSD controller being just as good as a fast one, because it's doing a lot less
- Simplified testing for, and lower requirements from, external add on SSDs

The down sides would be:
- You can only support the SSDs that you specifically make for the system, with your custom driver and custom controller firmware
- Probably some additional use of system reserved dram required (someone else will probably know more!)

Any thoughts would be appreciated.

@DSoup, as your earlier posts about accessing data from an SSD on Windows have sort of lead to this idea, your feedback would be especially appreciated. Hopefully some of this makes sense to ... somebody!

Edit: thanks for the move to the proper thread. mod!

Deleted member 11852 · Jul 12, 2020

function said:
@DSoup, as your earlier posts about accessing data from an SSD on Windows have sort of lead to this idea, your feedback would be especially appreciated. Hopefully some of this makes sense to ... somebody!

I can't offer much insight because, as you said, these are thoughts based on a number of vague comments and much of my comments where about the Windows I/O stack which is likely very different than Xbox Series X, but it would indeed be truly amazing if Sony have prioritised raw bandwidth and Microsoft have prioritised latency. :yes:

My gut tells me that if this is what has happened that they'll largely cancel each other out except in cases where one scenario favours bandwidth over latency and another favours latency over bandwidth. Nextgen consoles have 16Gb GDDR6 so raw bandwidth is likely to be a preferable in cases where you want to start/load a game quicker, e.g. loading 10Gb in 1.7 seconds at 100ms latency compared to 3.6 seconds at 10ms latency. Where the latency could make a critical difference is frame-to-frame rendering and pulling data off the SSD for the next frame, or the frame after.

The SSDs in both nextgen consoles hugely improve both latency and bandwidth over current gen (and PC) today but it really feels like that no matter what decisions Microsoft and Sony made for this generation and what they have made for next generation, that there is actually only marginal differences between the actual games themselves. Look at launch PlayStation 4 vs Xbox One and the clear disparity in GPU/Compute (18 CUs vs 12 CUs) and how that really didn't make much difference to actual games.

P_EQUALS_NP · Jul 12, 2020

that 100GB may be a special section of the nand flash. for example the first page of a flash erase block can be accessed quicker then the next few pages of the erase block.

tunafish · Jul 12, 2020

P_EQUALS_NP said:
that 100GB may be a special section of the nand flash. for example the first page of a flash erase block can be accessed quicker then the next few pages of the erase block.

It will not be. The 100GB will not be a specific area of the drive, it's just 100GB worth of game assets, wherever they happen to be located.

A lot of people here seem to think that the 100GB is some kind of scratchpad where things will be written. This cannot possibly work, because there simply isn't enough write endurance to allow frequent large writes into the drive. Instead, the way it probably works is that you are just allowed to map up to 100GB worth of game assets into memory, specifying what compression you are using for it, and the hardware will page it into ram for GPU on demand.

fehu · Jul 12, 2020

And if part of the disk is configured as slc to increase endurance in frequent writes, and increase speed?

tunafish · Jul 13, 2020

fehu said:
And if part of the disk is configured as slc to increase endurance in frequent writes, and increase speed?

Still does not give enough endurance, and configuring part of the disk as slc does not improve read speeds.

fehu · Jul 13, 2020

Slc has lower latency and better read/write compared to tlc/qlc.
In fact, it's not my idea but a popular configuration in low end m2 ssd to keep price down and performance up on normal desktop workloads.

Ronaldo8 · Jul 13, 2020

fehu said:
And if part of the disk is configured as slc to increase endurance in frequent writes, and increase speed?

The "100 GB" thing is not a scratchpad. MS never implied that this is the case and it will be a mindbogglingly stupid way to use an SSD in a system where RAM quantity has been carefully provided for. The scratchpad idea has been pushed by people who have been lazily assuming that the XSX I/O is good old PC virtual memory paging when it is in fact memory mapping akin to bank switching of old but with a modern twist (Direststorage?).

Shifty Geezer · Jul 13, 2020

Ronaldo8 said:
The scratchpad idea has been pushed by people who have been lazily assuming...

You need to chill out a bit. I don't think there's any pushing whatsoever, and there are plenty of reasons other than 'laziness' for someone to not understand something which hasn't been openly explained.

More realistically, "The scratchpad idea has been considered by people who haven't been informed by MS how it works..."

Ronaldo8 · Jul 13, 2020

Shifty Geezer said:
You need to chill out a bit. I don't think there's any pushing whatsoever, and there are plenty of reasons other than 'laziness' for someone to not understand something which hasn't been openly explained.

More realistically, "The scratchpad idea has been considered by people who haven't been informed by MS how it works..."

Ya.. I apologise for the turn of words.

tunafish · Jul 13, 2020

fehu said:
Slc has lower latency and better read/write compared to tlc/qlc.
In fact, it's not my idea but a popular configuration in low end m2 ssd to keep price down and performance up on normal desktop workloads.

SLC on a normal drive will not improve read speeds, only write speeds. They act as a combining write buffer, they use a small area of the drive in SLC to write everything first, and then copy it out to the large pool. The practical read speed from the SLC portion is the same as the rest of the drive. Didn't we already to over this in some other thread?

And this way of doing things will not be relevant for console drives because their normal workloads will not be like desktop workloads. There will be effectively no writes. The sensible way of using a lot of flash in a read-dominant workload does not involve moving data around.

function · Jul 13, 2020

DSoup said:
I can't offer much insight because, as you said, these are thoughts based on a number of vague comments and much of my comments where about the Windows I/O stack which is likely very different than Xbox Series X, but it would indeed be truly amazing if Sony have prioritised raw bandwidth and Microsoft have prioritised latency.

My gut tells me that if this is what has happened that they'll largely cancel each other out except in cases where one scenario favours bandwidth over latency and another favours latency over bandwidth. Nextgen consoles have 16Gb GDDR6 so raw bandwidth is likely to be a preferable in cases where you want to start/load a game quicker, e.g. loading 10Gb in 1.7 seconds at 100ms latency compared to 3.6 seconds at 10ms latency. Where the latency could make a critical difference is frame-to-frame rendering and pulling data off the SSD for the next frame, or the frame after.

The SSDs in both nextgen consoles hugely improve both latency and bandwidth over current gen (and PC) today but it really feels like that no matter what decisions Microsoft and Sony made for this generation and what they have made for next generation, that there is actually only marginal differences between the actual games themselves. Look at launch PlayStation 4 vs Xbox One and the clear disparity in GPU/Compute (18 CUs vs 12 CUs) and how that really didn't make much difference to actual games.

Thanks man. I know my thoughts and data-points were a little vague/PRish!!

My main thought was split into two: that there is the access penalty and the loading/processing penalty.

I don't think MS can hit 10 microseconds, but current QLC is about 100 microseconds for simple access according to Anandtech. If you could shave 30 microseonds off that, and have a tiny penalty for copying a page from SSD to dram, I think that *might* be a game changer in terms of supplementing system dram with solid state storage.

Small low latency changes would better serve 99% of gameplay than a complete level change within two seconds.

P_EQUALS_NP said:
that 100GB may be a special section of the nand flash. for example the first page of a flash erase block can be accessed quicker then the next few pages of the erase block.

I don't thing erase is something either MS or Sony are keen to encourage.

For every layer of storage per cell there is a penalty for flash. But in the short term, 100GB of special SLC flash would required many tens of seconds to fill when changing between games. And that is not the behaviour we've seen demonstrated so far.

function · Jul 14, 2020

tunafish said:
SLC on a normal drive will not improve read speeds, only write speeds.

SLC doesn't improve read bandwidth, you are quite correct, but it can improve read latency. This is one of those tricky things that folks have been doggedly trying to benchmark for years!

P_EQUALS_NP · Jul 14, 2020

function said:
Thanks man. I know my thoughts and data-points were a little vague/PRish!!

My main thought was split into two: that there is the access penalty and the loading/processing penalty.

I don't think MS can hit 10 microseconds, but current QLC is about 100 microseconds for simple access according to Anandtech. If you could shave 30 microseonds off that, and have a tiny penalty for copying a page from SSD to dram, I think that *might* be a game changer in terms of supplementing system dram with solid state storage.

Small low latency changes would better serve 99% of gameplay than a complete level change within two seconds.

I don't thing erase is something either MS or Sony are keen to encourage.

For every layer of storage per cell there is a penalty for flash. But in the short term, 100GB of special SLC flash would required many tens of seconds to fill when changing between games. And that is not the behaviour we've seen demonstrated so far.

I was refereeing to the the nand block and pages in those blocks, you see 1 nand flash chip is made up of many blocks and within those blocks are the pages. ssds can read|initialize on a page granularity but erasing is done on the larger block granularity.

rntongo · Jul 14, 2020

tunafish said:
It will not be. The 100GB will not be a specific area of the drive, it's just 100GB worth of game assets, wherever they happen to be located.

A lot of people here seem to think that the 100GB is some kind of scratchpad where things will be written. This cannot possibly work, because there simply isn't enough write endurance to allow frequent large writes into the drive. Instead, the way it probably works is that you are just allowed to map up to 100GB worth of game assets into memory, specifying what compression you are using for it, and the hardware will page it into ram for GPU on demand.

This is exactly it. But it will be aided by Microsoft's version of the high bandwidth cache controller but built for the Xbox Series X.

Deleted member 11852 · Jul 14, 2020

function said:
I don't think MS can hit 10 microseconds, but current QLC is about 100 microseconds for simple access according to Anandtech. If you could shave 30 microseonds off that, and have a tiny penalty for copying a page from SSD to dram, I think that *might* be a game changer in terms of supplementing system dram with solid state storage.

My latency numbers are pure fiction. We know that next to XSX the PS5 has around double the base hardware bandwidth, which doesn't account for certain types of data that may compress/expand better/worse than the other console can handle, but I wanted to toss in a scenario where XSX had an order of magnitude advantage at such low latencies where it may make a difference. I also think we're looking at much higher latencies.

I'm certain both XSX (and PS5) will have vastly lower latency than any SSD on PC but you can't really build anything other than platform-exclusive games relying on that and Microsoft are committed bringing their games to Windows where they may have to run off a 2.5" spinning HDD in a laptop.

I still have difficulty comprehending a future where loading times are virtually eliminated having gotten used to startup times sometimes being minutes long - first loading to the game main menu then continue/load game save.

Shifty Geezer · Jul 14, 2020

function said:
I don't think MS can hit 10 microseconds, but current QLC is about 100 microseconds for simple access according to Anandtech. If you could shave 30 microseconds off that, and have a tiny penalty for copying a page from SSD to dram, I think that *might* be a game changer in terms of supplementing system dram with solid state storage.

I don't. Whether 70ms or 100ms or even 30ms, a direct fetch is too slow to be immediately addressable. You'll still be working with prefetching. The difference will be prefetching two frames in advance or six.

Small low latency changes would better serve 99% of gameplay than a complete level change within two seconds.

Sure, but you don't need some virtualised RAM to do that.

Deleted member 11852 · Jul 14, 2020

Shifty Geezer said:
I don't. Whether 70ms or 100ms or even 30ms, a direct fetch is too slow to be immediately addressable. You'll still be working with prefetching. The difference will be prefetching two frames in advance or six.

Yup. Assuming Microsoft does have a latency advantage I think that the window of latency and relative difference will be far less pronounced than the different in raw bandwidth. But if you're talking 2.5 seconds to load to the point where you're ready to play vs 4.8 seconds, it's still a huge advance over what console and PC gamers experience now.

Velocity Architecture - Limited only by asset install sizes

iroboto

Daft Funk

mrcorbo

Foo Fighter

function

None functional

Deleted member 11852

Guest

P_EQUALS_NP

tunafish

fehu

tunafish

fehu

Ronaldo8

Shifty Geezer

uber-Troll!

Ronaldo8

tunafish

function

None functional

function

None functional

P_EQUALS_NP

rntongo

Deleted member 11852

Guest

Shifty Geezer

uber-Troll!

Deleted member 11852

Guest

Similar threads