Blazing Fast NVMEs and Direct Storage API for PCs spawn

PSman1700 · Sep 2, 2020

iroboto said:
how does all of this tie into it being inferior to fixed function hardware?

The GPU is probably the more advanced and much faster path. The fixed function hardware is just that, fixed. NV (and AMD) can up the speed by allocating more resources for example.

I think some have shot down NV's decompression tech way to early, without knowing how it even works.

Remij · Sep 2, 2020

PSman1700 said:
The GPU is probably the more advanced and much faster path. The fixed function hardware is just that, fixed. NV (and AMD) can up the speed by allocating more resources for example.

I think some have shot down NV's decompression tech way to early, without knowing how it even works.

Never bet against Nvidia.

Vega86 · Sep 2, 2020

Have some new questions.

Next gen consoles have about 13.5 GB for games. I assume that you can squeeze all data there using their new, respective ssd compression systems.

How is this gonna work on PC?

In Nvidia's graph, everything goes through the GPU. Does this mean Nvidia Amperes with sub 13.5 GB VRAM will have some issues?

Does using DLSS make VRAM data smaller for PC because it's only really of lower resolution?

Or are pc developers gonna have to split data, only some can go through Ampere's compression system, the rest, through typical pc i/o?

BRiT · Sep 2, 2020

Only what has to go to the GPU would go through the RTX IO. Program code will always go through CPU to main memory and be executed there. The PC can also use main memory as even faster subsystem than NVME.

Vega86 · Sep 2, 2020

BRiT said:
Only what has to go to the GPU would go through the RTX IO. Program code will always go through CPU to main memory and be executed there. The PC can also use main memory as even faster subsystem than NVME.

Does that mean only consoles can do top to bottom data that's significantly compressed, for both CPU/GPU data while still performant?

While on PC, some data are not meant to be significantly compressed for CPU and only some can be significantly compressed while performant, on the GPU?

Would this lead to less optimized PC ports, worse than current gen?

LordVulkan · Sep 3, 2020

Vega86 said:
Does that mean only consoles can do top to bottom data that's significantly compressed, for both CPU/GPU data while still performant?

While on PC, some data are not meant to be significantly compressed for CPU and only some can be significantly compressed while performant, on the GPU?

Would this lead to less optimized PC ports, worse than current gen?

I don't know why you would want to compress any CPU data in systems with at least 16GB RAM, you would just want to fill as much RAM as possible at the games start.

And there is no reason to think that NVIDIA doesn't have some big performant lossless decompressor(either HW or using compute shaders) when they are advertising a decompressed throughput of 14 Gbps.

LordVulkan · Sep 3, 2020

Can't I edit messages in this forum? The first quote was unintentional and with "16GB de RAM" I meant "16GB RAM".

Sorry for the mistake.

BRiT · Sep 3, 2020

LordVulkan said:
Can't I edit messages in this forum? The first quote was unintentional and with "16GB de RAM" I meant "16GB RAM".

Sorry for the mistake.

No problems. It's forum anti-spam measures. After X number of posts or Y number of days users should be able to edit posts within Z minutes after posting. I don't recall the specifics.

Infinisearch · Sep 3, 2020

iroboto said:
How is it all connected? The original assertion is that just because GPUs do not have dedicated hardware for decompression does not make it inferior; just different and perhaps more specific for its own use case.

how does all of this tie into it being inferior to fixed function hardware?

Yeah I know what you mean... why would fixed function vs programmable have an effect on the nature of memory management for the decompression algorithm? As in malloc type functionality would still be the same. So if a fixed function solution exists as long as the programmable was similar in regards to memory allocation whats the problem? As far as caches go I've never heard of special cache (something like unaligned access in a single line) for FF hardware.

Vega86 · Sep 3, 2020

LordVulkan said:
I don't know why you would want to compress any CPU data in systems with at least 16GB RAM, you would just want to fill as much RAM as possible at the games start.

And there is no reason to think that NVIDIA doesn't have some big performant lossless decompressor(either HW or using compute shaders) when they are advertising a decompressed throughput of 14 Gbps.

Would you happen to know what kinds of data are for RAM vs VRAM in terms of video games? I'm assuming textures go to VRAM but what else?

Infinisearch · Sep 3, 2020

Vega86 said:
Would you happen to know what kinds of data are for RAM vs VRAM in terms of video games? I'm assuming textures go to VRAM but what else?

Consoles or PC? Before DX12/Vulkan memory management for the GPU was handled by the drivers. I've heard sometimes they would put index buffer data into system RAM but vertex buffers went into VRAM. Of course things like render targets go into VRAM. Beyond that unless you use GPU compute for something pretty much everything else goes into system RAM. BTW just so you know in the case of a UMA like with the PS4 VRAM and system ram are the same/similar (potentially partitioned) but are potentially in different but sometimes 'related' address spaces.
edit - different virtual memory translation
units.

Dictator · Sep 3, 2020

Will I need PCIE Gen 4 for RTX IO?

Per Tony Tamasi from NVIDIA:

There is no SSD speed requirement for RTX IO, but obviously, faster SSD’s such as the latest generation of Gen4 NVMe SSD’s will produce better results, meaning faster load times, and the ability for games to stream more data into the world dynamically. Some games may have minimum requirements for SSD performance in the future, but those would be determined by the game developers. RTX IO will accelerate SSD performance regardless of how fast it is, by reducing the CPU load required for I/O, and by enabling GPU-based decompression, allowing game assets to be stored in a compressed format and offloading potentially dozens of CPU cores from doing that work. Compression ratios are typically 2:1, so that would effectively amplify the read performance of any SSD by 2x.

Source: Reddit Megathread on RTX 3000 launch (I would post it here, but the forum automatically interprets it as media and makes it fill the page)

It looks like it will run with any NVMe type, just a matter of finding out what the motherboard requirements are. Also good to see the average/"typical" compression ratio is 2x.

Jawed · Sep 3, 2020

BRiT said:
I wonder what portion caused the fragmentation. I thought every engine would have their own memory chunking setup (Memory Carver), where you carve off larger blocks of memory to be used for specific items. So instead of allocating memory for 1 item, you allocate a pool of 100 or 1000 all at once, and cycle through the unused indexes. It reminds me of the old school C programming when you had to manage the heap yourself, but then everyone learned you did things in larger chunks to manage fragmentation.

Maybe GPU resources don't lend themselves to that?

Those techniques are crucial, agreed, and I would expect every dev to have tackled this problem in some way, up to the limits of the APIs they're working with.

So far I've not seen a description of how NVidia's decompression system works. There are two basic models we can talk about:

a "block" of data is loaded from storage into VRAM, then the decompressor reads the block and writes to a new block somewhere else in VRAM
a "block" of data is streamed from storage through the decompressor and ends up as a block somewhere in VRAM

2 appears to be the preferable model. In terms of VRAM fragmentation it would appear to be better. As I understand it, PS5 is using this latter model.

1 combined with "too large" VRAM, e.g. 24GB

would probably suffer rarely if ever with fragmentation (if the dev is paying attention).

Regardless of model, the problem with textures is that it's tricky to predict how much VRAM is required at a given instant. The PS5 "textures load as the camera moves" model makes texture consumption of VRAM even more fiddlesome, since it encourages devs to massively overcommit textures: now, instead of loading all textures that could possibly be used within the next hour of stealth game play, say, the game is loading all textures needed for the next 60ms of game play.

Also, in games where there are no loading screens, there's no "dedicated time" to perform "memory defragmentation".

As a PC dev, how are you supposed to build a game where EVERY user is expected to have a PC powerful enough to support loading all required textures 10s of times per second? You're gonna install a "1337" version of the game for the 0.1% and use loadscreens and blurry textures for everyone else?

LordVulkan · Sep 3, 2020

Jawed said:
As a PC dev, how are you supposed to build a game where EVERY user is expected to have a PC powerful enough to support loading all required textures 10s of times per second? You're gonna install a "1337" version of the game for the 0.1% and use loadscreens and blurry textures for everyone else?

Algorithms involved will scale with bandwidth, only PS5 first party studies can allow design an engine around a fixed storage bandwidth. And I doubt they will in the longterm as they seem to be expanding their games on PC and they want their technology to be ready for the next console iteration. I think they will only do it during PS5 first or two years in order to have a showcase for their console.

I assume there are already some billiant minds looking for how to properly scale with bandwidth allowing the optimal results for any bandwidth and we will start to see their solutions in GDC 2022.

Altough there are already some scalable technologies announced like Epic's Nanite.

Infinisearch · Sep 3, 2020

Jawed said:
So far I've not seen a description of how NVidia's decompression system works. There are two basic models we can talk about:

a "block" of data is loaded from storage into VRAM, then the decompressor reads the block and writes to a new block somewhere else in VRAM

a "block" of data is streamed from storage through the decompressor and ends up as a block somewhere in VRAM

2 appears to be the preferable model. In terms of VRAM fragmentation it would appear to be better. As I understand it, PS5 is using this latter model.

So overprovisioned pool/chunk of memory is out the question?
edit - as in pool allocators and for anything that doesn't fit into that a power of two malloc that runs off pools.

Infinisearch · Sep 3, 2020

Jawed said:
So far I've not seen a description of how NVidia's decompression system works. There are two basic models we can talk about:

a "block" of data is loaded from storage into VRAM, then the decompressor reads the block and writes to a new block somewhere else in VRAM

a "block" of data is streamed from storage through the decompressor and ends up as a block somewhere in VRAM

2 appears to be the preferable model. In terms of VRAM fragmentation it would appear to be better. As I understand it, PS5 is using this latter model.

LordVulkan said:
I assume there are already some billiant minds looking for how to properly scale with bandwidth allowing the optimal results for any bandwidth and we will start to see their solutions in GDC 2022.

Isn't this technology only for games where you can traverse parts of the level either by teleportation or rapid movement for an fairly extended period of time? I never got the impression they were trying to do away with 'VRAM'.

iroboto · Sep 3, 2020

Infinisearch said:
Isn't this technology only for games where you can traverse parts of the level either by teleportation or rapid movement for an fairly extended period of time? I never got the impression they were trying to do away with 'VRAM'.

There is that and largely I think the goal is to allow for games to be designed Without the need of gates in your level design where a small QTE event occurs to unload parts of a level and load in the next parts of the level.

And then there are other design paradigms where extremely detailed worlds and textures in an area caused the playing field to reduce immensely, and having that level of fidelity but in a vast area wasn’t possible due to memory limitations.

and then everything else you sort of indicated

Jawed · Sep 3, 2020

Infinisearch said:
So overprovisioned pool/chunk of memory is out the question?

No.

I can imagine that PS5 style "continuously streaming textures" actually reduces the pressure on VRAM. With ultra-low latency texture streaming the problem becomes how much space is there on the disk for the game install, not VRAM.

For PC games I doubt latencies will ever allow for PS5 style ultra-streaming game engines, because the lowest common denominator of a PC with 300GB/s max disk bandwidth is unavoidable.

pjbliverpool · Sep 3, 2020

Jawed said:
As a PC dev, how are you supposed to build a game where EVERY user is expected to have a PC powerful enough to support loading all required textures 10s of times per second? You're gonna install a "1337" version of the game for the 0.1% and use loadscreens and blurry textures for everyone else?

NVMe drives could be made a minimum requirement I guess. Simply going from 4k textures to 2k textures would quarter your texture streaming requirements and allow you to scale from the top end NVMe's to the bottom end.

Relative to consoles, decent amounts of system RAM should also alleviate a lot of pressure on the storage IO with good pre-caching.

I wouldn't be that surprised to see requirements along the lines of "32GB RAM with SATA SSD or 16GB RAM with NVMe SSD"

manux · Sep 3, 2020

So few people have 32GB ram that it really becomes easier to get folks to buy entry level nvme ssd. Buy nvme ssd instead of memory upgrade. Funnily the laptop peasants are ahead of the ssd curve here.

Blazing Fast NVMEs and Direct Storage API for PCs spawn

PSman1700

Remij

Vega86

BRiT

(>• •)>⌐■-■ (⌐■-■)

Vega86

LordVulkan

LordVulkan

BRiT

(>• •)>⌐■-■ (⌐■-■)

Infinisearch

Vega86

Infinisearch

Dictator

Jawed

LordVulkan

Infinisearch

Infinisearch

iroboto

Daft Funk

Jawed

pjbliverpool

B3D Scallywag

manux

Blazing Fast NVMEs and Direct Storage API for PCs *spawn*

(>• •)>⌐■-■ (⌐■-■)

(>• •)>⌐■-■ (⌐■-■)

Daft Funk

B3D Scallywag

Blazing Fast NVMEs and Direct Storage API for PCs spawn