Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
True, but it's still only an issue when you reusing data in the cache between loads from SSD.

Yes, but we now have two consoles where gigabytes of data can be streamed from the SSD each so I would expect this be a much more common scenario for nextgen exclusive games. If not then Microsoft and Sony have just wasted a lot of time and effort designing these systems. :runaway:
 
I presume it's not a useless feature, so the reasoning for their inclusion must make sense. Which makes me wonder if it's something else other than resetting caches for fresh data, because that just doesn't fit my understanding of cache functionality. ;)

Just going with that simple example, let's say you have 1 MB of data. As a dev, load one MB of data, then while it's processing, load the other MB into a different buffer. Then the GPU can work on that buffer and you can load replacement data into buffer one.

I just can't get my head around putting a few MBs of data into the cache and then loading more data into that RAM space! :runaway:You'd have to be using SSD as virtual RAM and working with notably more than 16 (12?) MBs of game data per frame.
 
Yes, but we now have two consoles where gigabytes of data can be streamed from the SSD each so I would expect this be a much more common scenario for nextgen exclusive games. If not then Microsoft and Sony have just wasted a lot of time and effort designing these systems. :runaway:
Why does it make a difference if its coming from SSD?
It still needs to process the same data?
The difference being you don't need to duplicate textures, or store so much in memory, e.g. only need to store 3 seconds worth instead of 20 seconds.
 
Why does it make a difference if its coming from SSD?

Presumably because that the a likely source of data to be requiring the caches to be scrubbed. From the DF/Eurogamer interview with Mark Cerny:

"Coherency comes up in a lot of places, probably the biggest coherency issue is stale data in the GPU caches," explains Cerny in his presentation. "Flushing all the GPU caches whenever the SSD is read is an unattractive option - it could really hurt the GPU performance - so we've implemented a gentler way of doing things, where the coherency engines inform the GPU of the overwritten address ranges and custom scrubbers in several dozen GPU caches do pinpoint evictions of just those address ranges."​
 
Presumably because that the a likely source of data to be requiring the caches to be scrubbed. From the DF/Eurogamer interview with Mark Cerny:

"Coherency comes up in a lot of places, probably the biggest coherency issue is stale data in the GPU caches," explains Cerny in his presentation. "Flushing all the GPU caches whenever the SSD is read is an unattractive option - it could really hurt the GPU performance - so we've implemented a gentler way of doing things, where the coherency engines inform the GPU of the overwritten address ranges and custom scrubbers in several dozen GPU caches do pinpoint evictions of just those address ranges."​

Without cache scrubbers, I suppose you will allocate the memory in two parts. The one with assets for current frame and the other part for streaming assets.
 
This is what I think I understand:

Without additional mechanisms, anything writing to memory from outside of the last level cache breaks coherency. So any DMA operation breaks coherency. From inside the GPU, a process would ask the cache if it has a certain memory address and it could end up giving a mix of new and old data for that region. Maybe a piece of texture which was just released and that address got reused for an unrelated new texture. It cannot know it's cache have stale data.

With a lot of streaming, it will reuse the same addresses as it releases and reallocates memory for new data, meaning there's an off chance of corruption, even if tiny. So they basically need to flush caches before using anything that came from outside.

In normal cases, it might be simple to just flush the cache once per frame. Wait until all it needs was loaded from disk, flush caches to ensure coherency, and render the whole thing. But the loaded data now have an entire frame of latency before it can be used. So much for microseconds class of latency of nvme.

Cerny indicated he wants to stream based on the view frustum as the player turns around, so I suppose it would help if all these little bits being loaded are ready to use individually as soon as possible, instead of waiting all of them are loaded to avoid flushing the caches thousands of times per frame.
 
https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf

AMD states that L2 clients were limited to the CUs for Polaris so the RBEs, CP and Copy Engine had to write directly to memory which forced a lot of L2 flushes. In Vega, the CP and RBE were made clients of the L2, which reduced L2 flushes. However, copy queue uploads still required a L2 flush. In RDNA, the Copy Engine was added as client of the L2 which further reduced L2 flushes. AMD describes L2 flushes as rare on RDNA.

"Next, to support the case where you want to use the GPU L2 cache simultaneously for both graphics processing and asynchronous compute, we have added a bit in the tags of the cache lines, we call it the 'volatile' bit. You can then selectively mark all accesses by compute as 'volatile,' and when it's time for compute to read from system memory, it can invalidate, selectively, the lines it uses in the L2. When it comes time to write back the results, it can write back selectively the lines that it uses. This innovation allows compute to use the GPU L2 cache and perform the required operations without significantly impacting the graphics operations going on at the same time -- in other words, it radically reduces the overhead of running compute and graphics together on the GPU."

Non-PS4 GCN hardware is not devoid of volatile cache line invalidation or write back functionality. Its just that GCN usage is limited to the L1 cache. Sony extended the functionality to the L2 but now I wonder where that extension of functionality resides as the cache hierarchy of RDNA is different than GCN.
 
So here is a diagram I super professionally made in MSPaint that should clarify how the memory in SeriesX is split between two "pools" of memory with different bandwidths:


Krj0RF5.png



The "two pools" aren't really two physical pools, that division is virtual. There are 10 chips, but 6 of them have twice the capacity. To achieve maximum bandwidth whenever possible, all data is interleaved among the 10 chips (10*32bit = 320bit). This means a 10MB file will supposedly be split into 10*1MB partitions, one for each chip, so that the memory controller can write/read from all chips in parallel, hence using a 320bit bus.

But the memory controller can only interleave the data among all 10 chips while all 10 chips have space available to do so. When the 1GB chips are full, then the memory controller can only interleave the data among the 2GB chips that still have space available. There are 6 chips with 1 extra GB after the 1st GB is full, so that leaves us with 6*32bit = 192bit.

Of course, the system knows of this, so it's making a virtual split from the get-go, meaning the memory addresses pointing to the red squares become the "fast pool", and the ones pointing to the orange squares become the "slow pool". This way the devs can determine if a certain data can go to the fast red pool or the slow orange pool, depending on how bandwidth-sensitive the data is. They're not left wondering if e.g. a shadow map is going to be accessed at 560GB/s or 336GB/s, as that could become a real problem.


BTW, I chose to make the distinction between memory chips and not PHYs because I don't know if AMD is using 32bit or 64bit wide units (IIRC it's usually the later), but I do know each GDDR6 uses a 32bit / 2*16bit connection.


Cache srubbers sound like they'd be sensible in the PC space too unless the programming model on the PC makes this impractical.
IIRC Cerny specifically mentioned the cache scrubbers as blocks that AMD chose not to adopt for their RDNA2 PC architecture, and stayed a feature exclusive to the PS5.
 
This is what I think I understand:

Without additional mechanisms, anything writing to memory from outside of the last level cache breaks coherency. So any DMA operation breaks coherency. From inside the GPU, a process would ask the cache if it has a certain memory address and it could end up giving a mix of new and old data for that region. Maybe a piece of texture which was just released and that address got reused for an unrelated new texture. It cannot know it's cache have stale data.

With a lot of streaming, it will reuse the same addresses as it releases and reallocates memory for new data, meaning there's an off chance of corruption, even if tiny. So they basically need to flush caches before using anything that came from outside.

In normal cases, it might be simple to just flush the cache once per frame. Wait until all it needs was loaded from disk, flush caches to ensure coherency, and render the whole thing. But the loaded data now have an entire frame of latency before it can be used. So much for microseconds class of latency of nvme.

Cerny indicated he wants to stream based on the view frustum as the player turns around, so I suppose it would help if all these little bits being loaded are ready to use individually as soon as possible, instead of waiting all of them are loaded to avoid flushing the caches thousands of times per frame.
Yes, it should improve latency, but not only. It should also save a bit of time and bandwidth because there should be moments of unused main ram bandwidth during the rendering of the frame. So in the end it should also improve CU occupancy. It's basically applying async job to any kind of streaming from disk with all the benefits. But they can only do that because the streaming is completely bypassing CPU or GPU so they can be fully used for the frame rendering while stuff is being loaded into the ram, almost for free.

I could be completely wrong, though. But it makes sense given Cerny's objectives with PS5: efficiency.
 
Yes, it should improve latency, but not only. It should also save a bit of time and bandwidth because there should be moments of unused main ram bandwidth during the rendering of the frame. So in the end it should also improve CU occupancy. It's basically applying async job to any kind of streaming from disk with all the benefits. But they can only do that because the streaming is completely bypassing CPU or GPU so they can be fully used for the frame rendering while stuff is being loaded into the ram, almost for free.

I could be completely wrong, though. But it makes sense given Cerny's objectives with PS5: efficiency.
That sort of explains why it's not useful on PC: in that case the data must go through the CPU to be decompressed and formatted correctly, so it works like any other write operations to VRAM. PCs cannot DMA from the nvme directly into VRAM because it doesn't have all the other hardware building blocks.
 
That sort of explains why it's not useful on PC: in that case the data must go through the CPU to be decompressed and formatted correctly, so it works like any other write operations to VRAM. PCs cannot DMA from the nvme directly into VRAM because it doesn't have all the other hardware building blocks.

Why not? I don't recall AMD PC APUs being restricted from directly loading data from a harddrive into the local video memory partition of its RAM.
 
Nvidias GPUDirectStorage does exactly that. Its not available on commercial GPU's yet but that could change with Ampere.
Ah... I feel like maybe you linked to this before:
https://devblogs.nvidia.com/gpudirect-storage/

hm...

DMA engines, however, need to be programmed by a driver on the CPU. When the CPU programs the GPU’s DMA, the commands from the CPU to GPU can interfere with other commands to the GPU. If a DMA engine in an NVMe drive or elsewhere near storage can be used to move data instead of the GPU’s DMA engine, then there’s no interference in the path between the CPU and GPU.
Sounds like it'll be a blast. :oops:

  • Explicit data transfers that don’t fault and don’t go through a bounce buffer are also lower latency; we demonstrated examples with 3.8x lower end-to-end latency.
  • Avoiding faulting with explicit and direct transfers enables latency to remain stable and flat as GPU concurrency increases.
  • Use of DMA engines near storage is less invasive to CPU load and does not interfere with GPU load. The ratio of bandwidth to fractional CPU utilization is much higher with GPUDirect Storage at larger sizes. We observed (but did not graphically show in this blog) that GPU utilization remains near zero when other DMA engines push or pull data into GPU memory.
 
Last edited:
I meant how do you decompress the game data if you don't go throught the cpu? Games twice as big and require twice the bandwidth? What sort of hardware addition can we expect on PC to make it feasible to have games designed to dma straight to vram?
 
I meant how do you decompress the game data if you don't go throught the cpu? Games twice as big and require twice the bandwidth? What sort of hardware addition can we expect on PC to make it feasible to have games designed to dma straight to vram?

More system RAM and VRAM?

5-9 GBs worth of bandwidth from the drive to RAM is a ton of data. But how do you sustain that level of bandwidth with only a 100-200 GB game? The SSDs will provide 200 GBs of data in about ~20-40 secs. So unless we get huge game sizes that measure in TBs, we are talking about a ton of repetitive transfers of the same game data. Expansion of RAM (system and video) on a PC can mitigate some of speed offered by the consoles' SSDs. The game can be more aggressive in how much data it prefetches into RAM. How fast does a HDD/SDD have to be if your PC has 10+ GBs of VRAM and 16+ GBs of system RAM?
 
Status
Not open for further replies.
Back
Top