Playstation 5 [PS5] [Release November 12 2020]

Without generalising too much, Japanese are usually very conservative in their statements. And if you add the fact that he’s a tech guy, the PR levels of this is zero. Any CEO or marketing adjacent person would go “PS5 is the quietest console ever in the history of humankind”
I'd imagine that a setup with a fan that size could make a lot of noise if it ramped to full speed. If Cerny's presentation is accurate, they'd know the upper power limit and shouldn't need to ramp the fan to a high level unless even with that hardware they need more airflow at the upper range than can be kept quieter than the prior consoles.
Maybe the PS5 could be louder at top end than some of the non-jet engine PS4s, or there's allowance for non-ideal ambient conditions as well.
 
This is good video about load times. It's pretty clear ssd or fast cpu alone does not do much. It's a complex issue starting from how engine is designed to how low level io works. In simple term sony approach doesn't do anything that cannot be done in other hw but the proof can be in the ease of use and efficiency pudding. Microsoft will likely have similar ease of use via directstorage and efficiency we will find out once there are pc games out there using directstorage. How much cpu/gpu resources will the io suck using directstorage being the measurable part of efficiency.

 
Extreme case would be new ratchet&clank where the gameplay almost seemlessly switches between levels.
That's a high-level description that doesn't really map to what is happening at the level of the GPU caches.
For example, if a character hops from one region with a tree to a region with a soccer ball, what exactly is being done from the perspective of the system?
The tree and ball have associated textures, geometry, and shaders as part of what might be loaded.
If those elements are still in use for the tree, deciding to overwrite the regions the tree occupies would screw up the tree. So the SSD stalls?
The scrubbers can't help with that portion, but if the load is supposed to timely, waiting for the the tree to be discarded means the GPU is then waiting for thousands to millions of cycles for that data to come in and for the scrubbers to finally pinpoint those cache lines.
If the ball's assets are needed immediately, the SSD read would be going to a different virtual memory page. If that's the case, then the scrubbers aren't needed because the cache wouldn't confuse the two sets of resources.

In virtual memory, the pages dedicated to the tree could be marked as being unneeded pretty quickly relative to an SSD read, and there's no low-hanging ceiling in the address space like there is for physical RAM. The old cached lines of the unneeded page would be naturally evicted.
 
That's a high-level description that doesn't really map to what is happening at the level of the GPU caches.

So let's take the grand theft auto type use case of entering building. Engine would remove the detailed car and pedestrian meshes and textures from memory. They would be replaced with lower level lod that might or might not have already been in memory. The freed memory is then use to load in the building interiors like desks, walls, windows etc. The lower quality meshes for cars/pedestrians would be used to draw stuff visible through windows. The cache invalidation is needed or otherwise the new assets for interior would misbehave as caches could contain now invalid high level assets that were removed to make room for interior.

In ratchet&clank case the use case seems to akin to legacy level loading. Current level is removed from memory, there is short 1s transition effect and new level appears.

From engine POV in sony solution developer would just say that load and decompress this data from ssd and insert it directly into this memory address. No need for cpu/os/driver to invalidate caches as the io-chip takes care of orchestrating cache invalidation. Once the data is available in ram and gpu/cpu tries to access it, the cache line has been invalidated causing the data to be loaded into cache from ram.
 
If I say above in simpler way. In sony solution the content from ssd is dma'd directly to memory. There is very little in way of os/driver overhead as the overhead has been well designed into hw allowing io-chip to orchestrate dma, cache scrubbing etc. Also this should be nice api for developer as file access looks very much something like memcpy, where the data just appears in ram in decompressed form. I believe sony also support the regular mmap way of mapping mass storage to ram but that is whole another story..
 
That's a high-level description that doesn't really map to what is happening at the level of the GPU caches.
For example, if a character hops from one region with a tree to a region with a soccer ball, what exactly is being done from the perspective of the system?
The tree and ball have associated textures, geometry, and shaders as part of what might be loaded.
If those elements are still in use for the tree, deciding to overwrite the regions the tree occupies would screw up the tree. So the SSD stalls?
The scrubbers can't help with that portion, but if the load is supposed to timely, waiting for the the tree to be discarded means the GPU is then waiting for thousands to millions of cycles for that data to come in and for the scrubbers to finally pinpoint those cache lines.
If the ball's assets are needed immediately, the SSD read would be going to a different virtual memory page. If that's the case, then the scrubbers aren't needed because the cache wouldn't confuse the two sets of resources.

In virtual memory, the pages dedicated to the tree could be marked as being unneeded pretty quickly relative to an SSD read, and there's no low-hanging ceiling in the address space like there is for physical RAM. The old cached lines of the unneeded page would be naturally evicted.

Could you dumb it down further for me?
PS5 cache scrubbers are useless?
 
So let's take the grand theft auto type use case of entering building. Engine would remove the detailed car and pedestrian meshes and textures from memory.
At a low level, those resources take up pages in memory at various capacities, but typically 4KB per the standard for x86. That's 4KB in RAM, but also 4KB in the virtual address space.
What does it actually take for the engine to remove them?
If the data isn't needed anymore, a clean virtual page can have a single bit changed in its attributes and it's marked invalid. If it's not clean, then it probably needs to be written back to disk, which I suppose scrubbers might help with. However, most of these resources aren't being modified, and the console's SSD is more concerned with a read-heavy scenario. This situation is also not related to any SSD input.

They would be replaced with lower level lod that might or might not have already been in memory.
If the LOD is already in memory, then the scrubbers are unnecessary. The data in memory has a virtual address, and it would be distinct from the data being removed. The GPU cache would already track them separately.

The freed memory is then use to load in the building interiors like desks, walls, windows etc. The lower quality meshes for cars/pedestrians would be used to draw stuff visible through windows. The cache invalidation is needed or otherwise the new assets for interior would misbehave as caches could contain now invalid high level assets that were removed to make room for interior.
It only misbehaves if the SSD is directed to write to a virtual memory range that is in-use or recently in-use. My question is what part of the process is being optimized with the scrubbers? Removing a page from memory is a bit flip. Writing to a page if there's a risk that it's use could stall the SSD, since scrubbers can't stop the shaders reading the old data from reading new data out of order.
If the SSD instead reads its data into memory with its own virtual page, the scrubbers are unnecessary and a pointer or handle would need to be updated. The pointer operation seems like a minor update, particularly since the prior issue with needing to stall if the destination page's old data is being used means there's some kind of barrier.

From engine POV in sony solution developer would just say that load and decompress this data from ssd and insert it directly into this memory address.
That sounds like reusing the same virtual address, but what scenario needs this versus changing a pointer? Is there some other pitfall that reusing the same page avoids, given that it's not free of overhead itself?


If I say above in simpler way. In sony solution the content from ssd is dma'd directly to memory. There is very little in way of os/driver overhead as the overhead has been well designed into hw allowing io-chip to orchestrate dma, cache scrubbing etc. Also this should be nice api for developer as file access looks very much something like memcpy, where the data just appears in ram in decompressed form. I believe sony also support the regular mmap way of mapping mass storage to ram but that is whole another story..
If not reusing virtual addresses that close to those resources being used, much of that would be irrelevant to the GPU caches we know of. The CPU needs to be more careful because it has caches that are physically tagged, and so updates to virtual memory may apply to the same physical lines, but the GPU's scrubbers don't help with that.
How many pages are being used in such a narrow time window, but in such a way that changing a few bits in the virtual memory subsystem is a significant source of overhead relative to an IO operation and some form of GPU cache flush? How much memory is in that state saved versus swaths of texture and object data that could be marked invalid trivially?
 
At a low level, those resources take up pages in memory at various capacities, but typically 4KB per the standard for x86. That's 4KB in RAM, but also 4KB in the virtual address space.
What does it actually take for the engine to remove them?

Many high performing engines avoid memory allocations. In essence allocate memory upfront as continuous blocks. Then whenever new stuff is needed remove something old by replacing it with new data. New data reuses both same virtual and physical addresses. Game engines are paranoid about pointer chasing and it's all data driven design. There isn't going to be 4kB mallocs and frees in realtime apps that go for high performance. In sony world developer would just dma's the data from ssd into ram. In that operation io-controller will handle the decompression, dma and making sure caches are in correct state. No cpu involved. The api probably looks a lot like memcpy, but the input is ssd and output is ram. There is probably tiny amount of metadata outside bulk data that also needs to be updated for engine to know what is where.

edit. There could also be custom memory allocator that has chunks of preallocated memory in a pool. These chunks would never go through malloc/free, rather the engine would give some chunk back to pool and another piece in engine would "allocate" same chunk for different purpose. Then dma:in data into these chunks would again reuse same virtual and physical memory when old data is replaced with new.
 
Last edited:
Could you dumb it down further for me?
PS5 cache scrubbers are useless?
Cache scrubbers would be used to invalidate data loaded into the GPU caches that has been replaced by data from the SSD.
That implies that whatever the old data was, it was either recently used or still in use by the GPU. Some kind of synchronization would be needed in the latter case.

There are two different ways to look at something in-memory. There is a physical address range of bytes taken up by the data in RAM, which goes most directly to the 16GB limit of the console.
The other is the virtual address, which among other things tracks what is in RAM, or what has been moved to and from more permanent storage like the SSD.
The virtual address space is vast, particularly with 64-bit processors. You can generate many virtual pages whose capacity exceeds physical memory, needed pages would be swapped to and from disk. That was the original point to having virtual memory.

At least traditionally, GPU caches have cared more about the virtual addresses besides a small amount of translation to for CPU-coherent data. That generally is not what the SSD can provide, because the SSD is an IO device, which the GPU for now is mostly treated like for most operations.
If that is the case, then you only worry about needing cache scrubbers if the SSD is being told to write to the same virtual memory page. This would be different from swapping, because this is pushing a different version or unrelated data on top of the same page.
However, the alternative is to use a different virtual page, which you can make many of. Virtual pages that aren't immediately needed can often be removed with low overhead, so what use cases do the scrubbers enhance, and by how much.

As for how useful they are, we know that AMD has opted not to use them.
Similarly, the PS4 had something for marking cache lines used by compute to speed up cache invalidations that AMD never used and nobody seems to have missed on the AMD side.
A number of console optimizations seem to fall under a nice if frequently unimportant in the long run.

The PS4 introduced having 8 compute front ends, which Cerny took credit for when AMD's Hawaii did something similar. However, subsequent GPU cut the count back and eventually replaced that method completely.
The PS4 also had a triangle sieve customization, which may have been an early instance of primitive shaders or the compute-based culling shaders. However, that specific formulation is quite different from all the others, and it's not clear how often that customization found use.
 
Is he double speaking? I think he also said AMD roadmaps can change based on their collaboration and might use Sony-AMD collab results for AMD's later products?
There’s no doublespeak, as far as I can tell. Cerny’s very precise in his language. “So, collaboration is born. If we bring concepts to AMD that are felt to be widely useful, then they can be adopted into RDNA2 and used broadly, including in PC GPUs. If the ideas are sufficiently specific to what we’re trying to accomplish—like the GPU cache scrubbers I was talking about—then they end up being just for us. If you see a similar discrete GPU available as a PC card at roughly the same time as we release our console, that means our collaboration with AMD succeeded in producing technology useful in both worlds.”. (Back up to 25:01 for the full paragraph.)

“Similar discrete GPU” means ~36CU ~2.2GHz, not cache scrubbing. A discrete GPU doesn’t necessarily have the same memory contention issues as a (high performance) APU, so don’t read too much into its not existing in PC land. It must have some value to Sony, otherwise why dedicate transistors to it?
 
Last edited:
Many high performing engines avoid memory allocations. In essence allocate memory upfront as continuous blocks. Then whenever new stuff is needed remove something old by replacing it with new data. New data reuses both same virtual and physical addresses. Game engines are paranoid about pointer chasing and it's all data driven design. There isn't going to be 4kB mallocs and frees in realtime apps that go for high performance.
Most allocations in x86 land tend to use 4KB pages, regardless of the total amount requested. Perhaps consoles make more use of 2/4MB or gigabyte pages, server workloads tended to care about those.

In sony world developer would just dma's the data from ssd into ram. In that operation io-controller will handle the decompression, dma and making sure caches are in correct state. No cpu involved.
Much of that's part of the extra IO logic according to Cerny, but there's some CPU consideration necessary since the CPU might need to invalidate cache lines in response to the IO operation.

edit. There could also be custom memory allocator that has chunks of preallocated memory in a pool. These chunks would never go through malloc/free, rather the engine would give some chunk back to pool and another piece in engine would "allocate" same chunk for different purpose. Then dma:in data into these chunks would again reuse same virtual and physical memory when old data is replaced with new.
Allocating a certain amount of virtual memory in advance into a pool is an optimization mentioned before. More virtual pages can be subscribed than hosted in RAM. The current gen may not have oversubscribed much, since the desire was to have those pages pinned in memory to avoid paging--but that's what loading from the SSD would be in this generation.
Having a supply of extra pages for the SSD to target seems comparatively low-cost once the game is already pre-allocating things. Other than tracking a pointer, which the system would need to be given when the SSD request was made, removing the old page from RAM would be cheap relative to a drive hit, especially if it's read-only like a lot of assets are. It could be flagged as non-resident and considered paged out.
 
Can someone please define a “cache scrubber”. Is this akin to selective flushing of cache lines? Is the scrubber define by the function or what orchestrates that function?

And how is this different the volatile bit tag from GCN?
 
Can someone please define a “cache scrubber”. Is this akin to selective flushing of cache lines? Is the scrubber define by the function or what orchestrates that function?

And how is this different the volatile bit tag from GCN?
Beyond what was stated in Cerny's PS5 presentation, we don't have much more to go on. They allow the IO block in charge of the SSD to inform the GPU of addresses affected by an SSD read. The scrubbers then perform pinpoint invalidations of lines caching those addresses. The specifics of their process is undisclosed.

The volatile flag was a customization for the PS4 GPU's L2, which marked lines dirtied by compute writes so that the eventual cache flush needed to make all results visible to the system would only flush those lines specifically. Whether that carried beyond the PS4 isn't known.

The cache scrubbers do share a similar concept of limiting the number of invalidations. Whether they go further and avoid certain flushes or just lower the overhead of a flush like the volatile flag isn't clear.
 
Beyond what was stated in Cerny's PS5 presentation, we don't have much more to go on. They allow the IO block in charge of the SSD to inform the GPU of addresses affected by an SSD read. The scrubbers then perform pinpoint invalidations of lines caching those addresses. The specifics of their process is undisclosed.

The volatile flag was a customization for the PS4 GPU's L2, which marked lines dirtied by compute writes so that the eventual cache flush needed to make all results visible to the system would only flush those lines specifically. Whether that carried beyond the PS4 isn't known.

The cache scrubbers do share a similar concept of limiting the number of invalidations. Whether they go further and avoid certain flushes or just lower the overhead of a flush like the volatile flag isn't clear.

That kind of seems backwards to me.

Maybe I’m interpreting your post incorrectly but I get the impression that the IO system is driving cache invalidation.

I have limited knowledge on this subject, but I’ve read that a lot of cache invalidation involves dead data. However most ISAs’ cache invalidation schemes tend to involve write backs. So a non-insignificant amount of dram bandwidth is wasted writing dead data back into RAM.

Cache scrubber instructions can be used to selectively removes the dead data without writing back to memory.

So why wouldnt you use such a system to inform the IO system that the DRAM addresses backing these cache lines are free for SSD reads?
 
Asking for a new page can be kinda expensive versus more light weight ways to manage memory?
See https://gamedev.stackexchange.com/q...-allocation-patterns-used-in-game-development
There's various allocation strategies in that link that involve allocating chunks of memory in advance, then tracking pointers to specific address ranges.
The double/single frame type seems like the kind of data Sony is hoping to keep minimal with more rapid asset loading.

An object pool is along the lines of what I was discussing for avoiding the write after read hazard that the scrubbers cannot help with. An object pool could be created that is larger than physical memory, then individual objects can be filled by the SSD without overwriting a current one. If this is for primarily read-only assets, the virtual memory system could quietly drop a no longer used block from RAM silently while the new blocks loaded by the SSD become more recently active. The level of oversubscription would be limited since the system will only go so long before a cache flush occurs at major events like an end of frame.
 
That kind of seems backwards to me.

Maybe I’m interpreting your post incorrectly but I get the impression that the IO system is driving cache invalidation.
The IO complex that controls the SSD contains units that pass information on affected addresses to the GPU. Custom units associated with various GPU caches then invalidate lines in those ranges. The recent post from @Unknown Soldier links to that specific section of the Road to PS5 video.

I have limited knowledge on this subject, but I’ve read that a lot of cache invalidation involves dead data. However most ISAs’ cache invalidation schemes tend to involve write backs. So a non-insignificant amount of dram bandwidth is wasted writing dead data back into RAM.
Cache invalidation involves wanting to remove a line or lines from a cache. If the lines were never modified, invalidation just changes the line's status to invalid, which make future attempts to access that address in that cache miss and making it a likely destination for the next cache fill. If the line has been written to in some way, then there is a writeback. Most of the assets would be treated as read-only, so scrubbing them would be relatively quick. The SSD being told to overwrite a range that other code is trying to write data to is a potential hazard.

Cache scrubber instructions can be used to selectively removes the dead data without writing back to memory.
Cerny described the action was evicting the data rather than dropping it in a manner inconsistent with normal cache function. If the lines being scrubbed were written to, that's something of a contradiction for the protocol to lose data that something tried to write to memory.
 
Back
Top