Current Generation Hardware Speculation with a Technical Spin [post GDC 2020] [XBSX, PS5]

Status
Not open for further replies.
More system RAM and VRAM?

5-9 GBs worth of bandwidth from the drive to RAM is a ton of data. But how do you sustain that level of bandwidth with only a 100-200 GB game? The SSDs will provide 200 GBs of data in about ~20-40 secs. So unless we get huge game sizes that measure in TBs, we are talking about a ton of repetitive transfers of the same game data. Expansion of RAM (system and video) on a PC can mitigate some of speed offered by the consoles' SSDs. The game can be more aggressive in how much data it prefetches into RAM. How fast does a HDD/SDD have to be if your PC has 10+ GBs of VRAM and 16+ GBs of system RAM?
Yes that's the point, it needs twice or more of everything upstream unless the devs use the cpu to decompress some oodle format. Games would need to be uncompressed on the ssd and that would divide the efficiency by that ratio. RDO or BCPack multiplying this even more. I don't see devs wanting that, just because the dma saves cpu cores which scale more on PC, and add some small efficiency improvement with cache scrubbers needing to flush the caches less often. They would just require more memory instead for streaming in a larger time window.

The question was whether cache scrubbers are useful in the reality of PC game design and hardware.
 
Last edited:
Great thanks, I think I understand that now. So GPU's without some form of cache invalidation optimization are going to suffer a penalty in terms of increased cache flushes when used in concert with very fast storage solutions which will presumably become far more prevalent next generation. So essentially the cache scrubbers are a GPU performance aid as opposed to something that actually speeds up the IO from the sounds of it?
It seems to be a performance measure for the GPU. Data being read in will be written to its destination regardless of what state the GPU's execution is in.
How the notification process works, and what sort of synchronization operations are needed with the scrubbers aren't clear.
Sony may be trying to reduce the time and cache thrashing related to the global stall and cache invalidates, meaning the events are cheaper but are still used in the same fashion as the normal invalidates.
If there's some kind of non-standard way of synchronizing with that hardware, maybe some workloads can use a custom localized barrier that might allow some parts of the GPU to be excluded from the stall--but that may be a more extensive change than what has been described.

Does XSX have something similar or is it's GPU going to suffer relatively compared with the PS5 in this regard. Or indeed is it's storage system not fast enough for this to matter that much.
It hasn't been mentioned by MS, and the PS5 presentation indicated scrubbers were something that were customized for Sony and that AMD did not find compelling enough to include in its own IP despite it being available.
This seems to be an optimization for one category of methods that is probably uncommon now and might a subset of many other implementations. Sometimes these may be optimizations that are nice to have, but might not find much sufficient use or benefit in a broader market.
The PS4's volatile flag may have helped make GPU compute more capable of working alongside graphics, but the concept didn't catch on anywhere else and nobody's indicated that the other GPUs suffered significantly for the lack of it.
The PS4 had a form of triangle sieve that might have been a forerunner to the culling-focused primitive shaders in Vega, so the idea might make sense. However, the PS4's implementation in particular has only really been mentioned in the pre-launch articles in 2013, and I don't recall it being mentioned since.
The PS4 Pro's ID buffer and checkerboard optimizations have had an unclear amount of adoption. Many of the leading engines found something other than checkerboard relatively quickly.
There may be other areas that the XSX has emphasized, like sampler feedback customizations or other tweaks that might provide different benefits.

Cache srubbers sound like they'd be sensible in the PC space too unless the programming model on the PC makes this impractical. Assuming not though then I wonder if this is one of the enhancements in PS5 that we might see in future AMD GPU's that Cerny mentioned. If not RDNA2 then perhaps RDNA3 (which may better match the timescales of very fast IO solutions being prevalent in the PC space).
The PC space has a wider range of hardware and has to worry about a broader legacy base that might not give the IO capability that would justify them. If there are PS5-specific ways utilizing SSD data by shaders or GPU hardware that interface with the scrubbers in a non-standard way, that may make them less likely to be used.
Discrete products have a PCIe bus to transfer over, and until there's more unified memory those explicit transfers may be heavyweight enough to exceed the savings from scrubbing.
APUs might be better-positioned due to the single memory pool, but then we'd need one with more of a performance focus.

I don't really understand that though as it's only if the data changes will in caches that it matters, and how likely is that? Like, you've a load of geometry and textures present drawing some scenery, and then a character. New scenery is loaded. Now to draw that scenery, the caches are fill with character info so they'd naturally reload the scenery data with the latest copy in RAM.

It's only an issue if the GPU is drawing scenery, the scenery data is cached, and new scenery data is loaded. That seems a rare occurrence, that the caches stick with the same data.
Perhaps this is an optimization with a certain class of workloads in mind, such as virtual texturing like in the later Trials games? A virtual texturing cache is a range of memory addresses that may be updated by data from different disk locations or different assets based on how the GPU/renderer chooses to update it. Couple that with some of the ideas about how the latest Unreal demo may be virtualizing its geometry, there could be objects or subsets of them at different levels of detail being read in or switched out of a limited working set.

Assigning specific ranges within the virtual asset caches may see benefit from the scrubbers, since they could be used to clean up a given allocation more cleanly without thrashing other in-progress objects and allow a new object to take it over. However, that may require a level of interaction between the scrubbers and shaders that might not match reality, more fine-grained synchronization than reality, and an unclear level optimism with regards to SSD latency.
 
It seems to be a performance measure for the GPU. Data being read in will be written to its destination regardless of what state the GPU's execution is in.
How the notification process works, and what sort of synchronization operations are needed with the scrubbers aren't clear.
Sony may be trying to reduce the time and cache thrashing related to the global stall and cache invalidates, meaning the events are cheaper but are still used in the same fashion as the normal invalidates.
If there's some kind of non-standard way of synchronizing with that hardware, maybe some workloads can use a custom localized barrier that might allow some parts of the GPU to be excluded from the stall--but that may be a more extensive change than what has been described.


It hasn't been mentioned by MS, and the PS5 presentation indicated scrubbers were something that were customized for Sony and that AMD did not find compelling enough to include in its own IP despite it being available.
This seems to be an optimization for one category of methods that is probably uncommon now and might a subset of many other implementations. Sometimes these may be optimizations that are nice to have, but might not find much sufficient use or benefit in a broader market.
The PS4's volatile flag may have helped make GPU compute more capable of working alongside graphics, but the concept didn't catch on anywhere else and nobody's indicated that the other GPUs suffered significantly for the lack of it.
The PS4 had a form of triangle sieve that might have been a forerunner to the culling-focused primitive shaders in Vega, so the idea might make sense. However, the PS4's implementation in particular has only really been mentioned in the pre-launch articles in 2013, and I don't recall it being mentioned since.
The PS4 Pro's ID buffer and checkerboard optimizations have had an unclear amount of adoption. Many of the leading engines found something other than checkerboard relatively quickly.
There may be other areas that the XSX has emphasized, like sampler feedback customizations or other tweaks that might provide different benefits.


The PC space has a wider range of hardware and has to worry about a broader legacy base that might not give the IO capability that would justify them. If there are PS5-specific ways utilizing SSD data by shaders or GPU hardware that interface with the scrubbers in a non-standard way, that may make them less likely to be used.
Discrete products have a PCIe bus to transfer over, and until there's more unified memory those explicit transfers may be heavyweight enough to exceed the savings from scrubbing.
APUs might be better-positioned due to the single memory pool, but then we'd need one with more of a performance focus.


Perhaps this is an optimization with a certain class of workloads in mind, such as virtual texturing like in the later Trials games? A virtual texturing cache is a range of memory addresses that may be updated by data from different disk locations or different assets based on how the GPU/renderer chooses to update it. Couple that with some of the ideas about how the latest Unreal demo may be virtualizing its geometry, there could be objects or subsets of them at different levels of detail being read in or switched out of a limited working set.

Assigning specific ranges within the virtual asset caches may see benefit from the scrubbers, since they could be used to clean up a given allocation more cleanly without thrashing other in-progress objects and allow a new object to take it over. However, that may require a level of interaction between the scrubbers and shaders that might not match reality, more fine-grained synchronization than reality, and an unclear level optimism with regards to SSD latency.

Cache invalidate seems weird to me. For PC normal ram has pretty much always been used as cache, it was there before dedicated graphical ram even. It's also much faster terms of latency than an SSD and has all the bandwidth you'd need, if AMD can't see the benefit even there... well maybe it's just another weird Cerny fixation like a dedicated audio shader module or the ID Buffer thing. I mean, so a new virtual texture block streams in a frame or so earlier, even if you make the stall lesser it's still probably not worth a stall.
 
Cache invalidate seems weird to me. For PC normal ram has pretty much always been used as cache, it was there before dedicated graphical ram even. It's also much faster terms of latency than an SSD and has all the bandwidth you'd need, if AMD can't see the benefit even there... well maybe it's just another weird Cerny fixation like a dedicated audio shader module or the ID Buffer thing. I mean, so a new virtual texture block streams in a frame or so earlier, even if you make the stall lesser it's still probably not worth a stall.

Yes this was/is one of the sources of my original confusion. Since main RAM in a PC is performing a similar function to the SSD's in the consoles, (albeit smaller and faster) then it suggests PC GPU's may also benefit from this. Unless the GPU see's system RAM as opposed to VRAM as the "last level cache" in reference to MrFox's post above.

MrFox said:
anything writing to memory from outside of the last level cache breaks coherency.
 
Cache invalidate seems weird to me. For PC normal ram has pretty much always been used as cache, it was there before dedicated graphical ram even. It's also much faster terms of latency than an SSD and has all the bandwidth you'd need,
The big limiters would be capacity and the initial-load latencies running into little-improved HDD performance and unusually modest RAM capacity improvements for the consoles this time around.
Some devs are indicating that they have ambitions for scenes where the fidelity for for some objects makes them impractical to load in their totality, or scenes have total amounts of detail that are too large to hold in memory with traditional methods. Exactly how crazy they can get with asset size given that the SSDs aren't massive is a question yet to be answered.
There have always been barriers and flushes at certain boundaries, like between frames and context switches for some platforms. Glacial IO could have the driver/OS update resource status at those points in already high-overhead operations.

A more responsive IO system doesn't need to waste as much RAM capacity on data that may not be used, but more frequent changes can take the synchronization and flush operations out of the shadow of the larger barriers they were hidden by.
One additional thought I forgot until now is that Cerny mentioned the scrubbers invalidated targeted ranges in various caches, which may mean some other on-chip caches or buffers might get notifications as well, and the pitfalls for those may fall outside of the L2-focused examples I'm aware of.

if AMD can't see the benefit even there... well maybe it's just another weird Cerny fixation like a dedicated audio shader module or the ID Buffer thing. I mean, so a new virtual texture block streams in a frame or so earlier, even if you make the stall lesser it's still probably not worth a stall.
It's difficult to say how much of an impact the various tweaks have had, but it does seem like they haven't been game-changing.
I'm still not sure how aggressive Sony thinks it's going to be with gathering data from the SSD, and how consistent its latency is expected to be. If the latency could be made reliably intra-frame, or even within specific phases of a frame's time budget, that could be at least notable. If the stall cost and penalties caused by wiping the cache for all other workloads can be significantly reduced, that might make some algorithms more practical.
AMD could have multiple reasons for not using the tech in this way. It can be that the win is modest, or at least too modest for the broader set of constraints AMD's market has.
 
Cold, hard numbers that you make up? Where's the "1900 MHz game clock" from? And listing numbers of units for comparison is beneath this site. It's "units * clock for workload" that matters. Replace TMUs with TxOps etc. (better yet, don't bother. The argument can be put to rest just comparing TFlops as that was what Chris got wrong, so leave it at 18% advantage to XSX if PS5 keeps its clock high, 20% advantage if you want to be more approximate).
To be fair, it's not @John Norum who's making it up, it's techpowerup on their placeholder PS5 specs page.

His mistake is believing the data is based on anything other than whatever the TPU editor took out of his ass at the time, for both the PS5 and the SeriesX.
80 ROPs on the SeriesX, 1750MHz base GPU clocks on the PS5, TDP for either chip, L2 cache sizes and many other datapoints are completely made up. They're only "cold numbers" in the way of how dead useless they are.
The SeriesX page even says there's only 10GB of GDDR6.
 
I meant how do you decompress the game data if you don't go throught the cpu? Games twice as big and require twice the bandwidth? What sort of hardware addition can we expect on PC to make it feasible to have games designed to dma straight to vram?

Yes that's the point, it needs twice or more of everything upstream unless the devs use the cpu to decompress some oodle format. Games would need to be uncompressed on the ssd and that would divide the efficiency by that ratio. RDO or BCPack multiplying this even more. I don't see devs wanting that, just because the dma saves cpu cores which scale more on PC, and add some small efficiency improvement with cache scrubbers needing to flush the caches less often. They would just require more memory instead for streaming in a larger time window.

The question was whether cache scrubbers are useful in the reality of PC game design and hardware.

Don't the consoles decompress everything off the SSD before it goes into main memory? Therefore the space utilised in memory (system or VRAM) would be the same as is used in the consoles unified RAM wouldn't it?

So your relative issues in a PC environment with using uncompressed data would be SSD capacity and bandwidth from the disk to the rest of the system. Looking at those in turn:

Bandwidth to the rest of the system - not a massive concern at the high end (PCIe 4.0) as there's ~7.5 GB/s available which already exceeds one consoles uncompressed throughput and comes reasonably close to the others. PCIe 5.0 which may be due as early as 2021 more than alleviates any remaining concerns. People with slower drives just have to suck it up and lower texture quality or accept longer loading screens if they have the RAM capacity to pre-load more data.

SSD Capacity - This is an interesting one which I've been thinking about over the past few days. The obvious solution is just to buy more disk space which is always an option with a PC, but it's not very elegant or efficient. A better option would be to have on disk decompression. I know some drives do this already but I'm not sure how much benefit it brings in those implementations. However I'm talking something standardised, perhaps even tied into DirectStorage. e.g. a DirectStorage compatible drive will feature a hardware decompression module similar to that found in the XSX. It wouldn't help with bandwidth as it'd be on the wrong side of the PCIe bus, but it would completely alleviate the relative capacity defect. It would also be a pretty awesome selling point for the SSD vendor who could sell a 1TB drive but claim "an effective capacity of 2TB" using the onboard hardware decompression.

Game vendors could then package 2 different distributables of their games, one with the DirectStorage compression applied (zlib + BCPACK) and one without. The one without simply takes up more space for people without a DirectStorage certified drive but the application would not see a difference since the data would be identical once it passes over the PCIe bus.

The beauty of the decompression module is that it wouldn't have to apply to only DirectStorage certified games. You could presumably software compress (or even hardware if the drive was capable) any data you wish on that drive into the correct formats for on the fly decompression by any application that requires it. The application would presumably never need to know.
 
Don't the consoles decompress everything off the SSD before it goes into main memory? Therefore the space utilised in memory (system or VRAM) would be the same as is used in the consoles unified RAM wouldn't it?
Both the consoles and the PC need to decompress the data.


Game vendors could then package 2 different distributables of their games, one with the DirectStorage compression applied (zlib + BCPACK) and one without. The one without simply takes up more space for people without a DirectStorage certified drive but the application would not see a difference since the data would be identical once it passes over the PCIe bus.
You still seem convinced publishers/developers can just push non-compressed game distributions for the PC..
They can't... textures, geometry, shadowmaps, etc. still need to be compressed or they might even not fit a regular-sized SSD at all. Just google for any game + "compression" and you'll find references to compression in error reports.
It's also been explained that DXTC compressed textures (which GPUs work on) have a relatively low compression ratio, so when placed in storage the textures get further compressed over their DXTC form.

Also, what are you calling a "DirectStorage certified drive"? A M.2 SSD with an ASIC dedicated to hardware decompression?
If so, does that make sense? How would e.g. a RAID distribution work on such hardware?
IMO it'd make more sense to develop a 16x PCIe card that houses one or more M.2 SSDs and has hardware decompression.


Am I wrong to say that only guys under NDA will be able to know if those are right or wrong?
Yes, some things in there are pretty wrong even for people without an NDA.

If the SeriesX can maintain their substantially wider GPU at 1825MHz 100% of the time, why would the PS5 with a narrower GPU need to go all the way down to 1750MHz?
There are no limitations to what the GPU can address, and there's more than a bunch of GPU data that can go to the slower pool without repercussions, so the 10GB for the GPU is just wrong.


Andrew Goossen is a Microsoft System Architect working on Xbox Series X.
and we have official sources too, for the float/int hardware capabilities:
Why are you excusing the made-up data on techpowerup with a single fact about INT throughput (which may not even be exclusive to the SeriesX), when the concern you expressed is in the clock comparisons which are, in fact, based on completely made up numbers?
 
Without cache scrubbers, I suppose you will allocate the memory in two parts. The one with assets for current frame and the other part for streaming assets.

And is there a memory access mechanism in PS5 for the GPU to skip cache referral?

Both the consoles and the PC need to decompress the data.
Yes, and on console you can be sure that devs will primarily target the best compression which for many will be the hardware accelerated option, which above all else, is one single standard.

Also, what are you calling a "DirectStorage certified drive"? A M.2 SSD with an ASIC dedicated to hardware decompression? If so, does that make sense? How would e.g. a RAID distribution work on such hardware?
The RAID issue a good one. With data stripped across different drives, any compression may be entirely negated. For RAID support alone, the decompression needs to be done off-drive further up the hardware I/O hierarchy.
 
Yes, and on console you can be sure that devs will primarily target the best compression which for many will be the hardware accelerated option, which above all else, is one single standard.

This has always been the case, and never been a real problem.
 
Yes, and on console you can be sure that devs will primarily target the best compression which for many will be the hardware accelerated option, which above all else, is one single standard.
Are you suggesting BCPack and Kraken won't be widely adopted, only zlib?
 
Both the consoles and the PC need to decompress the data.

You still seem convinced publishers/developers can just push non-compressed game distributions for the PC..
They can't... textures, geometry, shadowmaps, etc. still need to be compressed or they might even not fit a regular-sized SSD at all. Just google for any game + "compression" and you'll find references to compression in error reports.
It's also been explained that DXTC compressed textures (which GPUs work on) have a relatively low compression ratio, so when placed in storage the textures get further compressed over their DXTC form.

As I've said before there is a difference between system level IO compression/decompression of everything coming off the disk, and selective compression/decompression of specific data sets. The consoles do the former, I'm suggesting that PC's can do the latter and therefore the decompression requirements would not have to be as high as the "5 Zen 2 cores" quoted for the PS5. It's a trade off between disk space and CPU requirements that the consoles don't have to make so can simply compress everything.

You get on average 45-64% additional compression with Kraken over the "uncompressed" data feed (which would already contain GPU native compressed textures) according to Sony's own figures.

Are you suggesting that in order to not overburden every system with less than 16 CPU cores, developers would not accept an install footprint of 45-64% larger on the PC? Maybe they wouldn't but that doesn't seem like the impossible scenario that you're painting it to be.

And that's before you consider that some data types could likely still be compressed and sent via the CPU for decompression because they represent a much lighter load than decompressing textures for example. So that potentially gains you some of that space back for minimal CPU burden.

Also, what are you calling a "DirectStorage certified drive"? A M.2 SSD with an ASIC dedicated to hardware decompression?
If so, does that make sense? How would e.g. a RAID distribution work on such hardware?

I'm suggesting that is one possible route that could be taken. Actually I find it unlikely that Microsoft would stipulate a dedicated hardware decompression block in order to gain DirectStorage compliance as I think that if drives need DirectStorage compliance at all, it will be more focused on their DMA capabilities which is likely to be more important to get out into the market. That said, they could encourage the use of a hardware decompression block though some sort of "Direct Storage Ultra/Premium" type certification which does include such a block. Or SSD makers could simply do this off their own backs given that it could be a good selling point in terms of capacity, but then less likely to be tied into how games are packaged.

The RAID point is a good one, but then given that these drives are likely to already be coming close to saturating the available PCIe bandwidth is there much benefit to RAID? Perhaps that's one technology they'd simply have to forgo in favour of the compression. Since these would be gaming targeted drives rather than server, and already very high speed, I'm not seeing that as a huge loss.

IMO it'd make more sense to develop a 16x PCIe card that houses one or more M.2 SSDs and has hardware decompression.

Such a card would still be limited by a 4x PCIe interface into the CPU on non server based chips unless you were going to use it instead of a GPU which wouldn't make much sense in a gaming system.

Incidentally a decompression block on the SSD would also allow for increased bandwidth use when PCIe 5.0 arrives even if the new controllers are unable to saturate that bandwidth.
 
Are you suggesting BCPack and Kraken won't be widely adopted, only zlib?
Nope, it's easy for devs to support one standard (Kraken on PS5, BCPack on XSX) but what compression will devs use to package their PC games? Or is the expectation that the hardware will support them all? And new ones in the future?

standards.png
 
Nope, it's easy for devs to support one standard (Kraken on PS5, BCPack on XSX) but what compression will devs use to package their PC games? Or is the expectation that the hardware will support them all? And new ones in the future?

standards.png

That's where the DirectStorage certification could potentially come in. Microsoft could stipulate that DirectStorage compatible games must have a zdlib/BCPACK compressed distributable and lead the way with UWP games. I'm not suggesting this is overly likely, but it's one potential solution to the problem.
 
Compression/Decompression on the peripheral side of the PCIe bus makes little sense to me. You want that on the system side as "close" (shortest distance/fastest bus) to the system memory as possible and/or on the GPU as "close" to the video memory as possible.
 
Status
Not open for further replies.
Back
Top