Next-Generation NVMe SSD and I/O Technology [PC, PS5, XBSX|S]

They both matter. The greater utility of big bandwidth isn't only realized at max throughput.

In other words, low cycle latency doesn’t mean much if you have a thin pipe with a lot data to move. You can have a device where a single request performed in isolation may be outstanding but in a sea of data requests, latency may be poor because the device lacks the bandwidth to feed its computational needs.
Yes, but 40 GB/s??

I need to find the article or blog post again but I think on Direct Storage side, they did some test and they think PCOE 4.0 is not enough. They need PCIE 5.0 SSD and it will be enough for the foreseeable future.
For what engines? Is that a perfect streaming engine, or the modern sort? For comparison, Trials had perfect streaming off HDD, every single element of the game could be put into one level.

But this is going too OT intoa software architecture debate. Presently, the software state is what it is and isn't changing in a big way any time soon. SSDs and compression are going to get faster and faster. Lots of big numbers to post here and oggle over and get excited when they get bigger and bigger. ;)
 
Yes, but 40 GB/s??


For what engines? Is that a perfect streaming engine, or the modern sort? For comparison, Trials had perfect streaming off HDD, every single element of the game could be put into one level.

But this is going too OT intoa software architecture debate. Presently, the software state is what it is and isn't changing in a big way any time soon. SSDs and compression are going to get faster and faster. Lots of big numbers to post here and oggle over and get excited when they get bigger and bigger. ;)

Did I talk about 40 GB/s, this is coming from a benchmark for datacenter? The only interest we can know the cost of decompression on GPU. If PCIE 5 is enough it means maximum 28 GB/s.

And I talk a lot about a Doctor Strange game because there is a rumor the character is in Insomniac Spiderman 2 as a NPC.

 
But the less memory you have the more bandwidth you need, games are more than geometry and texture. I think for example next generation maybe there won't be more RAM inside the consoles.

Yup. Everybody has good points because they're probably thinking about use-case scenarios. Low-latency is key to load data JIT, high-bandwidth is key to keeping load-times down and it's great that 1-2 minutes load times of last generation are largely a thing of the past. Similarly, QuickResume on Xbox, high-bandwidth is key to swiftly switching between games. Neither of these I/O attributes negate of a good foundation of RAM to begin with.
 
Last edited by a moderator:
Ram is still needed if you want decent latency.

Did I say you don't need RAM? I said I think it is possible we won't have more than 16GB of RAM in next generation console or much more if we go with non 256 bit memory bus. If we go with 256 bits bus, I don't believe we will have 32 GB in consoles. it means need to load more data per frame.
 
Did I talk about 40 GB/s, this is coming from a benchmark for datacenter? The only interest we can know the cost of decompression on GPU. If PCIE 5 is enough it means maximum 28 GB/s.

And I talk a lot about a Doctor Strange game because there is a rumor the character is in Insomniac Spiderman 2 as a NPC.

I wasn't disagreeing with your particular point. However, the actual requirements for a 'perfect' streamed data game is the data needed for what's in view. The calculations on that reveal it's actually sparse. Eg. your multi dimensional multi-portal example. Current methods need to load whole levels to renderer. Fully streamed resources would only load enough LOD to draw each portal, no more than to render the normal view.

This conversation hasn't really got a future until someone can do the math and present the theoretical 'maximum bandwidth requirements' (plus latencies etc.) for rendering photorealistic visuals at 4k120, or even 8k120 if we want to future proof. Without that, I'm going by a rough extrapolation of Sebbbi's calculations on this point years ago that were quite frankly mind-blowing when you realise how much 'waste' there is. So much of the data we consumer is sitting unused, waiting for its brief moment of appearance before being archived in system RAM. But that waste might be necessary for various reasons and the perfect streaming engine might be impossible. ¯\_(ツ)_/¯ Again, that's a topic in its own right. ;)
 
I wasn't disagreeing with your particular point. However, the actual requirements for a 'perfect' streamed data game is the data needed for what's in view. The calculations on that reveal it's actually sparse. Eg. your multi dimensional multi-portal example. Current methods need to load whole levels to renderer. Fully streamed resources would only load enough LOD to draw each portal, no more than to render the normal view.

This conversation hasn't really got a future until someone can do the math and present the theoretical 'maximum bandwidth requirements' (plus latencies etc.) for rendering photorealistic visuals at 4k120, or even 8k120 if we want to future proof. Without that, I'm going by a rough extrapolation of Sebbbi's calculations on this point years ago that were quite frankly mind-blowing when you realise how much 'waste' there is. So much of the data we consumer is sitting unused, waiting for its brief moment of appearance before being archived in system RAM. But that waste might be necessary for various reasons and the perfect streaming engine might be impossible. ¯\_(ツ)_/¯ Again, that's a topic in its own right. ;)

There is a little bit more than portal in Dr Strange. I suppose to bend the scenery if it is too heavy for real-time, they can use a baked alembic animation and load it when needed. I talk about this because I heard the demo Sony did to dev about the SSD is destruction changing depending of viewport using probably baked animation because it was too complex for real-time.


MRD2140_comp_v090_019901.1141_R.jpg


And Alembic animation size is big.

Page 12 of the Gears 5 presentation slide where they use it for scenery destruction when the player shoot.


Houdini Tools – Alembic Cache Destruction
• Custom Alembic Cache Destruction workflows.
• Compressed transforms that we skin per-frame.
• 8 bytes per key frame per object
• Implemented for Gears 4 before Epic had fully implemented in UE4.
• Basically at parity with UE4 4.23’s implementation.
• These can be incredibly expensive for memory usage so use sparingly.


This is common to use alembic animation in games when this is too heavy for realtime physics. Same with HFW where they baked the sea water simulation offline.
 
Last edited:
Did I say you don't need RAM? I said I think it is possible we won't have more than 16GB of RAM in next generation console or much more if we go with non 256 bit memory bus. If we go with 256 bits bus, I don't believe we will have 32 GB in consoles. it means need to load more data per frame.

Bad wording on my part, sorry. What I meant was, I believe we'll have more ram at some point, despite the ssd improvement and all. My "thinking" is, not every game engine and/or devs will be able, or willing to, make a good streaming engine I guess ? At some point, some devs will just need more than 16gb of data with a low latency. And I would guess some stuff you just can't stream into ram, but you have to compute them, like shadows&such ?
 
Bad wording on my part, sorry. What I meant was, I believe we'll have more ram at some point, despite the ssd improvement and all. My "thinking" is, not every game engine and/or devs will be able, or willing to, make a good streaming engine I guess ? At some point, some devs will just need more than 16gb of data with a low latency. And I would guess some stuff you just can't stream into ram, but you have to compute them, like shadows&such ?

But you don't need at least on PS5 to have a good streaming engine. For example, ND had a very good streaming engine and they scrap it for Part 1 and Uncharted Collection* because you just need to do an API command and ask multiples uncompressed files and maybe set a priority level for each file and do your call in big chunk and it works. No one fgetc() at a time, because this is inefficient ;) The PS5 API is invisible you don't care about the compression out of packaging the data, managing memory, managing coherency with different buffer for streaming data and the the resident data in memory. Maybe it is a bit more difficult to implement and less transparent on Direct Storage side but probably much easier than when everything was constrained by HDD.


*Kurt Margenau told it in an interview.

This bug is more hilarious. Tying the loading speed to the monitor refresh rate

EDIT: If we have more memory in next generation console maybe 20 or 24 GB would be more realistic
 
Last edited:

Interview of ND but I translate it is in french.

kurt margenau said:
On PS4 we had a sophisticated streaming management solution to load different quality assets depending of what the player was seeing. This was a very technical program for memory management with crazy algorithms. On PS5 we don't use it, we just load the best texture quality.

EDIT:

Do you foresee any changes to future storage standards beyond increases in raw bandwidth? For example, there must have been an understanding amongst those in the industry that SATA wasn't sufficient for new-fangled SSDs, spurring the development of NVMe as a standard which did more than just increase the available bandwidth. Is anything similar on the (distant) horizon, where we'll need something more than just a doubling of PCIe bandwidth each generation, or will that be sufficient for the foreseeable future?

Currently our focus is on the transition from SATA to NVMe and more specifically NVMe Gen 4, with Gen 5 then on the horizon, the performance gains achieved will most likely be sufficient for the foreseeable future.
 
Last edited:
This bug is more hilarious. Tying the loading speed to the monitor refresh rate
Bethesda's Creation engine does something similar, limiting loading speed according to the software framerate - effectively the same result.

One of the most popular Nexus mods for Fallout 4 on PC is patching the loading-screen to an uncapped framerate, so when loading from an SSD, most loads are near-instant compared to 10-40 seconds - depending on your PC. Some mods, like Stories of the Commonwealth, have a devastating effect on some load times in Fallout 4 for the bespoke areas that the mod adds.

This has been a 'feature' of the engine since Fallout 3, and probably before and I am not optimistic it will be changed for Starfield, because Bethesda's physics are also related to framerate. But theoretically, if Starfield offers a 120Hz mode, the game will load faster for 120hz screen owners than people using a 60Hz screen. :runaway: It's obviously completely logical to buy a better display to speed-up load times!
 
But you don't need at least on PS5 to have a good streaming engine. For example, ND had a very good streaming engine
That's not what I mean by 'streaming engine'. I mean all virtual textures and virtual geometry, highly optimised data where you only need as much as what you are drawing. A classical engine will have a scene. If there's a house in the scene where you can only see a corner of its roof, the entire house needs to be loaded to draw that little piece. All house geometry and textures. Current 'asset streaming' preloads large caches of complete objects but that's far from the ideal case.

In a true virtualised engine, only the parts of the house actually being rendered need to be in memory. For 4k, you need about 24 MBs of texture data. Nanite is a great step, providing 'unlimited detail' at existing transfer speeds. I can't find the post by Sebbbi on the actaul requirements to texture every pixel using virtual texturing but it was miniscule. Maybe someone else can recall it? @BRiT? In short, most of your storage and RAM usage is wasted, occupied by data waiting to be used. You can only render so much geometry given pixel count, and only so many texture samples. You can only change a small data from frame to frame unless doing a complete scene swap. So there are clear, calculable upper limits in actual data that can be consumed. Everything else is just cache, buffering the data close enough for when it's used. That was necessary in the days of floppy drives and HDDs, but completely different in an age of super fast SSDs. There would be value in re-evaluating the whole rendering systems, but that's now limited by business decisions.
 
My point is more that the pursuit of higher and higher BW via hardware and compression probably isn't necessary if the software was developing differently. The high BW of SSDs and primary debate at the moment (40 GB/s compressed data etc) isn't their main benefit, but the low latency is.

Yeah, I also think so. I think they went bonkers with max-throughput (specially sony) to guarantee the death of loading times for good. (which still do exist in many lazy-dev'ed cross-gen titles)

A high max BW from storage is still good to have in extreme scenarios, such as sudden spawns of large assets or level swaps. But it does not really need to be sustained. I think as long as the SSD can reach high BW at certain burts moments, but then has a short "cool-down" time, most software would barely notice it.
 
Frostbite Engine @ https://forum.beyond3d.com/threads/...a-20-rumors-and-discussion.59649/post-1974637

Frostbite needs 472 MB for 4K (all render targets + all temporary resources) in DX12. Page 58:
http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

Assets (textures, meshes, etc) are of course loaded on top of this, but this kind of data can be easily paged in/out based on demand. I'd say 4 GB is enough for 4K (possibly even 2 GB), but too early to say how well Vega's memory paging system works. Let's talk more when Vega has launched. I have only worked with custom software paging solutions that are specially engineered for single engine's point of view. Obviously a fully generic automatic solution isn't going to as efficient.

Game's data sets tend to change slowly (smooth animation). You only need to load new pages from DDR4 every frame. 95%+ of data in GPU memory stays the same.

Some additional nice tidbits in there about engine memory management with a link to a GDC presentation @ https://forum.beyond3d.com/threads/will-gpus-with-4gb-vram-age-poorly.58233/post-1973249

Current engines haven't been primarily designed for 4K in mind. Render targets and other temporary buffers have so far been quite small, but 4K makes them 4x larger compared to 1080p. Developers are improving their memory managers and rendering pipelines to utilize memory better.

Brand new Frostbite GDC presentation is perfect example of this:
http://www.frostbite.com/2017/03/framegraph-extensible-rendering-architecture-in-frostbite/

Pages 57 and 58 describe their PC 4K GPU memory utilization. Old system used 1042 MB with 4K. New system uses only 472 MB. This is a modern engine with lots of post processing passes. Assets (textures and meshes) are obviously additional memory cost on top of this, and this is where a good fine grained texture streaming technology helps a lot (whether it is a fully custom solution or automatic GPU caching/paging solution).

A Sebbbi post about wasting memory for textures because folks use Uncompressed textures @ https://forum.beyond3d.com/threads/is-4gb-enough-for-a-high-end-gpu-in-2015.56964/post-1852564

In both cases the developers are needlessly wasting GPU memory.

If you are afraid of texture popping from HDD streaming, you can load your assets to main RAM in a lossless compressed format (LZMA or such in addition to DXT). Uncompress the data when a texture region becomes visible and stream to GPU memory using tiled resources.

Uncompressed textures (no DXT compression in addition to ZIP/LZMA) are just a stupid idea in huge majority of the use cases. You just waste a lot of bandwidth (performance) and memory for no visible gain. With normalization/renormalization the quality is very good for material properties and albedo/diffuse and BC5 actually beats uncompressed R8B8 in quality for normal maps (the Crytek paper about this method is a good read). BC7 format in DX11 gives you extra quality compared to BC3 with no extra runtime cost.

Most games are not using BC7 yet on PC, because the developer needs to also support DX10 GPUs. Duplicate assets would double the download size. Tiled resources need DX11.2 and DX11.2 unfortunately needs Windows 8. This is not yet broad enough audience. These problems will fix themselves in a few years. In addition, DX12 adds asych copy queues and async compute allowing faster streaming with less latency (much reduced texture popping).

Hopefully these new features will stop the brute force memory wasting seen in some PC games. Everything we have seen so far could have been easily implemented using less than 2GB of video memory (even at 4K), if the memory usage was tightly optimized.
 
This is one thing that I was thinking about.
Any engine and hardware is trying to just load and render what's visible onscreen, but with RT what happens if I'm in front of an open door and near the door there's an object illuminated that casts reflections in front of the door?
I think that the engine must load the whole object, the textures, the room where is it and all the rest just in case to correctly bounce the light.
And what if instead of an object is an entire city?
 
Yeah objects and even materials on those objects can really influence what appears in the rendered view, so it's not just a matter of streaming in just what we can see. This is especially true with raytracing now more common. You can't really decide what light bounces off without that geometry being present in memory.
 
This is one thing that I was thinking about.
Any engine and hardware is trying to just load and render what's visible onscreen, but with RT what happens if I'm in front of an open door and near the door there's an object illuminated that casts reflections in front of the door?
I think that the engine must load the whole object, the textures, the room where is it and all the rest just in case to correctly bounce the light.
And what if instead of an object is an entire city?

Insomniac claimed for Ratchet & Clank on PS5 they unload everything from memory that's not on screen and found a solution to still get off screen objects to still appear in the RT reflections.
 
Clouds, the club, not the one in the sky, in cyberpunk2077 may be can be a clue?

RT reflection on some of the glass/mirror/window thing did not properly reflect reality.
 
That's a bespoke solution, I was thinking of a general problem for the time that RT will be used in the whole engine
 
That's not what I mean by 'streaming engine'. I mean all virtual textures and virtual geometry, highly optimised data where you only need as much as what you are drawing. A classical engine will have a scene. If there's a house in the scene where you can only see a corner of its roof, the entire house needs to be loaded to draw that little piece. All house geometry and textures. Current 'asset streaming' preloads large caches of complete objects but that's far from the ideal case.

In a true virtualised engine, only the parts of the house actually being rendered need to be in memory. For 4k, you need about 24 MBs of texture data. Nanite is a great step, providing 'unlimited detail' at existing transfer speeds. I can't find the post by Sebbbi on the actaul requirements to texture every pixel using virtual texturing but it was miniscule. Maybe someone else can recall it? @BRiT? In short, most of your storage and RAM usage is wasted, occupied by data waiting to be used. You can only render so much geometry given pixel count, and only so many texture samples. You can only change a small data from frame to frame unless doing a complete scene swap. So there are clear, calculable upper limits in actual data that can be consumed. Everything else is just cache, buffering the data close enough for when it's used. That was necessary in the days of floppy drives and HDDs, but completely different in an age of super fast SSDs. There would be value in re-evaluating the whole rendering systems, but that's now limited by business decisions.

I know what is is virtual texturing and virtual geometry(nanite). One day Unreal Engine will use only Nanite for all geometry and virtual texturing. Since my first post I said there is more than geometry and texture to load, First this is cool to use virtual geometry but if the game use raytracing you need a BVH and some offscreen geometry. The game probably stream BVH for static geometry. For the moment, it is using proxy geometry and not Nanite triangle data.

If the game uses this type of rendering where they mix sdf tracing and triangle based raytracing. You can stream two data structures for static geometry. In the paper the SDF is generated at runtime.


After like I said animation, sound or Alembic Cache animation/destruction or any other baked stuff can be stream too. And if I remember well 2019 GDC Spiderman postmortem, animations takes tons of place on the disc.
 
Last edited:
Has some useful information.
In terms of current discussion from about 10:50.
Or bit earlier 9:43 where starts comparing why standard mip streaming may be better fit.
 
Back
Top