General Next Generation Rumors and Discussions [Post GDC 2020]

Isn't this impossible, though? I mean, I know we have a 22GB/s theoretical, best case with compression max, but this would be writing 13 and reading 13, giving a total of 26 in less than a second? Hasn't there been confirmed no compressor, only a decompressor? That would mean it would take more than 2 seconds to write whats in memory to the SSD at 5.5GB/s.
To be fair, we've only seen this in practice on Xbox One games played on Series X, and they were running the One X code path, so we can assume they use 10GB or less RAM, so Series X is writing max 10GB and reading max 10GB in the 6-8 seconds they've shown in the demos. 3.3GB/s total if my math is right there, though it's possible not all games have memory fully saturated.

Good point that the 6 - 8 seconds was for backwards compatible games! And yeah, I'd expect unused memory to be passed over when the suspend file is created. That's what windows seems to do, maybe with some kind of cheap and fast compression.

X1X has 9GB available for games, so assuming it's all being used and there's no compression you're probably at 18GB / 2.4GB/s = 7.5 seconds. A bit less if not all of that memory is being used. So pretty much bang on what we're seeing....

And yes, PS5 should be much faster at doing something like that, but still not 6 - 8 times faster....

I've not seen anything about hardware compression in the SoC. I know some SSDs do that by default when writing data, but AFAIK that's hidden from the OS.
 
Isn't this impossible, though? I mean, I know we have a 22GB/s theoretical, best case with compression max, but this would be writing 13 and reading 13, giving a total of 26 in less than a second? Hasn't there been confirmed no compressor, only a decompressor? That would mean it would take more than 2 seconds to write whats in memory to the SSD at 5.5GB/s.

You really don't want to write all memory to disk. You would want to skip writing any dynamically allocated buffers like gbuffer, framebuffer, z-buffer,... into disk. Also some part of memory is used by OS so game is not using full 16GB. And if the game is implemented smartly things like textures, sounds,... are not stored to disk as they already are in disk. Game would store small metadata needed to read content back from it's original location.

In many ways fast suspend is similar problem as what is fast start for game. Perhaps game developers can find some neat ways to minimize the cost.
 
I seem to remember sony said their filesystem is based on ID's instead of filenames. Using ID would fit really well all kinds of streaming/suspending use cases. Having something like 4byte id per object takes almost no ram when compared to regular filenames. Of course the id's could be smartly stored in less bits to save ram if needed.

Having ssd combine something like 4byte app id and 4 byte object id would be rather nice. Then when suspending something like object id+size+begin position would be enough. Engine would read id's back, allocate size amount of memory and read it from ssd location app id + object ìd, starting from position begin for amount of size. This would for example store tiled textures nicely. Same would easily apply for streaming as if game needs new part of known object it's just object id + amount to read + position to read from.

edit. Knowing sonys controller puts data to right place in ram via dma the above requests would also have to have the ram location from where to read/write. I omitted this to keep the idea simple.
 
Last edited:
My question still comes down to out of 13.5GB how little would the game code, audio, anything else that doesn't need high bandwidth take up as 3.5 GB sounds pretty small to me, I'm more expecting that the slow access stuff over flow into fast section than the other way around.
But I'll be more than happy for someone to show me that game engine only needs a fraction of 3.5GB and that 10 is a huge hinderance compared to 11GB.

Yah, I don't know what's typical. I imagine it will vary based on the game. There is probably some advantage in PS5's "all memory is equal" approach, but I'd be surprised if Series X didn't have a bandwidth advantage. Bandwidth consumption by the CPU should be the same for both platforms in multi-platform games, and what's left should be in the Series X's favour. But I don't know that for sure, and devs will figure it out.


this is also where I believe bcpack comes into play, that it allows better partial texture retrieval compared to other package formats.
Could be mis remembering though.
But I thought that was one of the positives that he was saying, and why throughput essentially could be better than initial perceptions.

In the end devopers will have to code to take advantage of these things like always, just depends how hard it is. But some things, will just get used as it's the best way to get the performance you require.

I'm just curious to see if Microsoft hasn't gone down a route that's not compatible with what devs actually want to do. Maybe their texture streaming system is not compatible with the way BCPack, Sampler Feedback are intended to work. So it becomes a case of deciding to have different texture streaming solutions on each platform, or just sticking with one and dealing with the consequences which may be unfavourable for Xbox.
 
Are you talking about the PC you and I own today or PCs that will be built in a 12-24+ months time with new bus architectures and controllers and I/O chains that take advantage of DirectStorage? Because an API cannot negate the bottlenecks that exist in the PCs that you and I own today. You need better hardware. New hardware needs a new API. The API has to come first, the hardware will come after.

What I'm not fully understanding is which hardware/bus specifically is not up to the task and why, as from what I can see there is very little difference in that regard when looking specifically at Zen2 based systems compared with the PS5. It uses a standard PCIe 4.0 NVMe drive at a fairly standard speed (we know this to be the case because it can use commercial drives for additional storage). It connects that drive to the main APU over a standard PCIe 4.0 interface - no difference there. Within the APU itself the data stream hits the IO block. This is where most of Sony's customisation kicks in but nevertheless is most likely based on the base IO block of Zen 2 which in itself is essentially state of the art, with some customisations specific to the console space. When the decompressed stream leaves the IO block it almost certainly travels over the same AMD Infinity Fabric as would the uncompressed PC data stream (which started in an uncompressed form on the SSD) and onwards from there to wherever it's needed, i.e. CPU, GPU or main memory.

Granted on the PC with a dGPU there's an extra step of passing over the PCIe 16x interface before it can get to the GPU/graphics memory, and the question of whether PC's using DirectStorage will be able to make direct point to point requests from GPU to SSD without going via the CPU/system memory is yet to be fully answered. But I'm not seeing anything in any of that that requires a fundamental change in the PC architecture.

We just need a much better IO software interface to manage it all efficiently at these new speeds. And that's supposed to be what DirectStorage is. I've not seen any sign yet of DirectStorage being a new all encompassing hardware standard, at least not beyond the hardware that's available today. Although I wouldn't be surprised if SSD's needed to be certified DirectStorage compatible, especially if point to point access request by the GPU are part of the standard. If it goes beyond the SSD though I'd be very surprised to be honest. For DirectStorage to be useless unless you have an as yet unreleased and radically different whole system architecture i.e. minimum of new motherboards, CPU's, storage devices, and new interfaces between them all seems unrealistic.

You need new hardware to support a fast SSD coupled with controller that can decompress certain data and dump it at ~20Gb/s to DDR4 and/or GDDR6 without impact the rest of the system. That's the goal.

I think we have to make the assumption that compression won't play a factor in the PC model because we're not going to see hardware decompression units like those in the new consoles built into PC CPU dies any time soon. But that's a reasonable assumption to make since it's not really needed. There are drives today which exceed the XSX SSD compressed throughput without requiring compression. While drives will be available when these consoles hit the market which will be in the same ball park with uncompressed throughput as the PS5 with compressed (i'm talking typical throughput, not theoretical peaks).
 
HW decompression is hugely important. This is easy to try with something like 4GB tar ball. Decompression time scales pretty much linearly with amount of cpu cores when using nvme ssd. If there is heavy decompression needed a lot of cpu can be used. Of course the pc solution is buy more memory, huge load time and then play from memory. This is something new consoles seem to try to break away from (i.e. hugely heavy streaming instead of adding a lot more ram).

Also this is something so many pc softwares get wrong. They do single threaded IO and run very, very, slowly. Even when there is standard good tools to use many cores to (de)compress

Parallel tar
https://www.peterdavehello.org/2015/02/use-multi-threads-to-compress-files-when-taring-something/

pigz, parallel gzip
https://zlib.net/pigz/

Directstorage is great idea. Will be interesting to see in few years time if there is some kind of hw acceleration for it. Maybe something in the amd io chip could do it nicely,...
 
HW decompression is hugely important. This is easy to try with something like 4GB tar ball. Decompression time scales pretty much linearly with amount of cpu cores when using nvme ssd. If there is heavy decompression needed a lot of cpu can be used. Of course the pc solution is buy more memory, huge load time and then play from memory. This is something new consoles seem to try to break away from (i.e. hugely heavy streaming instead of adding a lot more ram).

I was thinking less in terms of decompressing in software on the CPU and more in terms of not compressing in the first place. It'd be great if the console style compression formats did come to the PC space, but as you say, the decompression cost is way too high with currently no hardware solution in place to deal with that. Also how would it be stored on the SSD in the first place since you can't guarantee everyone would have the hardware to do the required decompression.

Directstorage is great idea. Will be interesting to see in few years time if there is some kind of hw acceleration for it. Maybe something in the amd io chip could do it nicely,...

Wouldn't it be limited to specific formats though? I can't see AMD wanting to take up space on their CPU dies for a decompression technology specifically aimed at gaming. Could this be put on the GPU instead or does the data stream need to be processed by the CPU first?
 
I was thinking less in terms of decompressing in software on the CPU and more in terms of not compressing in the first place. It'd be great if the console style compression formats did come to the PC space, but as you say, the decompression cost is way too high with currently no hardware solution in place to deal with that. Also how would it be stored on the SSD in the first place since you can't guarantee everyone would have the hardware to do the required decompression.



Wouldn't it be limited to specific formats though? I can't see AMD wanting to take up space on their CPU dies for a decompression technology specifically aimed at gaming. Could this be put on the GPU instead or does the data stream need to be processed by the CPU first?

I would prefer to have the compressed data in ssd. Buying bigger ssd's is not fun when there is alternative to save space and also get better streaming performance.

I assume some compressed data would be used by cpu? Thinking from amd pov they could stick the decompression into their io chip and then give the uncompressed data either to cpu or gpu as needed. We will anyway have pcie4 based cards in that timeframe and moving compressed data to gpu might not buy much.

edit. Game developers would probably use whatever format is available. MS could align format between pc and xbox to be same and make everyones life easier,... i.e. easy to port between pc/xbox and microsoft gamepass story gets stronger.
 
My question still comes down to out of 13.5GB how little would the game code, audio, anything else that doesn't need high bandwidth take up
It doesn't matter though, does it? Use 5 GBs for non-graphics and everything else for video - as long as you don't need more than 10 GBs video, the GPU access RAM at full speed.
 
  • Like
Reactions: Jay
What I'm not fully understanding is which hardware/bus specifically is not up to the task and why, as from what I can see there is very little difference in that regard when looking specifically at Zen2 based systems compared with the PS5.

Somebody asks this question every few days the answer is always the same. On PC data is pulled from the SSD over the controller to main RAM, where the CPU may need to unpack and/or decompress certain data (that's read/write from/to DDR4), anything or the GPU then goes shovelled over that bus to GDDR5/6. On PS5/XSX, data goes to the controller where some may be decompressed on the fly then it hits RAM and ready for use by CPU and GPU.

Those are literally the I/O steps.

Granted on the PC with a dGPU there's an extra step of passing over the PCIe 16x interface before it can get to the GPU/graphics memory, and the question of whether PC's using DirectStorage will be able to make direct point to point requests from GPU to SSD without going via the CPU/system memory is yet to be fully answered. But I'm not seeing anything in any of that that requires a fundamental change in the PC architecture.

In addition to the hardware chain there is file system overheard, driver and I/O overhear (IDE/SCSI, PCI, video etc) none of which is optimised for such transfers. This is probably as big a killer of I/O performance as the lack of SSD having a direct path to GDDR5/6.

We just need a much better IO software interface to manage it all efficiently at these new speeds. And that's supposed to be what DirectStorage is.

I think a lot of people are throwing a lot of conjecture onto what the Windows version of the DirectStorage API will be. Until Microsoft actually publish this, it's conjecture. Logically, it makes sense for Windows to support smarter SSD controllers and it makes sense for SSD controllers to potentially have a more direct route to VRAM than going via main memory - but this needs new hardware. You can't magic a bus out of nowhere, it's either there or it's not. This type of access may only be supported on future graphics cards. You may recall AMD Radeon Pro GPUs with an SSD bolted on top.

Before any of this can work, Windows needs a driver and I/O framework overhaul, that is I assume what DirectStorage will facilitate.

I would prefer to have the compressed data in ssd. Buying bigger ssd's is not fun when there is alternative to save space and also get better streaming performance.

Me too. This will could make game installations more complex, because some people may favour the compression method uses the least amount of SSD storage at the expensive of slower load/decompression times, others may wish to tip that balance the other way - faster loads/decompression at the expensive more SSD space.
 
Last edited by a moderator:
Good point that the 6 - 8 seconds was for backwards compatible games! And yeah, I'd expect unused memory to be passed over when the suspend file is created. That's what windows seems to do, maybe with some kind of cheap and fast compression.

X1X has 9GB available for games, so assuming it's all being used and there's no compression you're probably at 18GB / 2.4GB/s = 7.5 seconds. A bit less if not all of that memory is being used. So pretty much bang on what we're seeing....

And yes, PS5 should be much faster at doing something like that, but still not 6 - 8 times faster....

I've not seen anything about hardware compression in the SoC. I know some SSDs do that by default when writing data, but AFAIK that's hidden from the OS.

The PS4 game are 5GB or 5.5 GB. This the memory size of PS4 and PS4 Pro.
 
I think a lot of people are throwing a lot of conjecture onto what the Windows version of the DirectStorage API will be. Until Microsoft actually publish this, it's conjecture. Logically, it makes sense for Windows to support smarter SSD controllers and it makes sense for SSD controllers to potentially have a more direct route to VRAM than going via main memory. You may recall AMD Radeon Pro GPUs with an SSD bolted on top.

Before any of this can work, Windows needs a driver and I/O framework overhaul, that is I assume what DirectStorage will facilitate.

Decompression on ssd controller is not great idea. One would have to make the pcie4 bus wider to be able to transfer the data. In amd terms much better place for decompression would be in their io chip. Pay the decompression price in io-controller and move the uncompressed data to main ram or gpu ram as needed while only having 4 lane pcie4 bus between ssd and motherboard.
 
It doesn't matter though, does it? Use 5 GBs for non-graphics and everything else for video - as long as you don't need more than 10 GBs video, the GPU access RAM at full speed.
This is my point that it has been brought up that 10gb high bandwidth isn't enough for next gen 4k games.
I have trouble excepting that at face value, and that 3.5GB is over kill for anything that can reside in slow section of memory.
 
Decompression on ssd controller is not great idea. One would have to make the pcie4 bus wider to be able to transfer the data. In amd terms much better place for decompression would be in their io chip. Pay the decompression price in io-controller and move the uncompressed data to main ram or gpu ram as needed while only having 4 lane pcie4 bus between ssd and motherboard.

Yeah, but Windows needs an API for the future when PCI is a few iterations on. Nobody can build better hardware to solve these problems without understanding the underlying Windows architecture. Microsoft has spent a decade making sure I/O processes and device handing is absolutely bullet-proof in Windows, both in terms of performance and I/O integrity (security) and now the design of the new consoles is highlighting how having all this disassociated can be bad for performance.

How do you manage integrity of the GPU potentially having access to the SSD, or vice-versa? This is not a simple problem to solve, not unless Microsoft want to rebuild much of Windows' kernel and I'm sure they don't. :nope:
 
Yeah, but Windows needs an API for the future when PCI is a few iterations on. Nobody can build better hardware to solve these problems without understanding the underlying Windows architecture. Microsoft has spent a decade making sure I/O processes and device handing is absolutely bullet-proof in Windows, both in terms of performance and I/O integrity (security) and now the design of the new consoles is highlighting how having all this disassociated can be bad for performance.

How do you manage integrity of the GPU potentially having access to the SSD, or vice-versa? This is not a simple problem to solve, not unless Microsoft want to rebuild much of Windows' kernel and I'm sure they don't. :nope:

You create sw api that is hw agnostic. Data in, data out. Preferably data out to ram in specific address like sony does. The API can be implemented in cpu/gpu/ssd. Implementing it in ssd is not great idea ever due to the pcie bus and the limited nature of those busses on regular cpu's. You want to have the implementation in cpu/gpu or in amd case in io-controller to save pcie lanes for better use.
 
So question on BCPack... Does what is "BCPack(ed)" need to be decompressed for GPU use or is it a new native GPU format?
Per the Digital Foundry article on the Series X, the BCPack hardware is part of the decompression block paired with the SSD. The GPU would appear to be a consumer of the output of that block.
https://www.eurogamer.net/articles/digitalfoundry-2020-inside-xbox-series-x-full-specs
"Our second component is a high-speed hardware decompression block that can deliver over 6GB/s," reveals Andrew Goossen. "This is a dedicated silicon block that offloads decompression work from the CPU and is matched to the SSD so that decompression is never a bottleneck. The decompression hardware supports Zlib for general data and a new compression [system] called BCPack that is tailored to the GPU textures that typically comprise the vast majority of a game's package size."
The GPU's lossy texture compression formats would seem to fit within the systems both consoles have for their SSD compression. Perhaps BCPack has implications for the data formats or compression settings within its payload, but I don't see a benefit for BCPack decompression to fully non-compressed texture data, after which the GPU's routine accesses would choke.
The same goes for the PS5, where its decompression block should be a compression layer on top of lossily-compressed textures that the GPU natively handles.
Perhaps BCPack does more to compress these, or it has additional lossy formats?

They are following multiples of the PS4 CUs though. Their BC method is probably tied to it for whatever reason. Their other option was probably 64 CUs which was probably avoided due to costs
It's also the case that node transitions give a rough 2x of the transistor budget, which seems to bias the outcome a bit.

Thought experiment - assuming that the basic premise of the tweets was correct:

The likely places Series X could fall behind are: SSD throughput/latency, GPU bandwidth, gpu front-end (clock speed)
Can you expand on where the Series X falls behind on GPU bandwidth? Is that DRAM bandwidth, or bandwidth elsewhere?
The PS5's clock isn't going to make it win in total bandwidth for any per-CPU caches.
Perhaps the L1, assuming the Series X GPU didn't adjust its size/bandwidth. One reason why it might need to depends on whether the L2's slice count increased to mirror the wider memory bus. In RDNA, the L1 is subdivided to match the number of L2 groups, since there are 4 slices per 64-bit controller, and the L1's subdivisions match how many requests it can respond to per clock.
The Series X may have 5 L2 groups, in which case the L1 might increase to have 5 sections, and thus 5 requests per clock, which would keep it above the PS5.
However, if the Series X doesn't create a 5th L2 group, it might mean that the L1 and L2 capabilities are as wide per-clock as the probable PS5 arrangement, and then clock speed could have an effect.
One possible complication to adding another cache division like that is that the ROP caches are aligned in a specific manner, and some of the no-flush benefits that Vega touted for making them L2 clients didn't hold if there was some kind of misalignment (maybe for an APU?).

There is some evidence that of a product transition where not expanding internal network to match DRAM can have an impact, such as benchmarks and memory-intensive tests that indicated Fury didn't always do much better than Hawaii despite having an HBM interface, with signs that the internal L2 bandwidth didn't scale as expected.



Pretty much. There's no reason for 36 CUs beyond that, and we know devs did target specific CUs with their code. It seems bizarre that the GPU is so constrained as we're used to swapping GPUs with differing core counts on the PC and it just working, and it's hard to imagine why devs would be targetting so low level still that games can break on compatible hardware. But if you think about it, there's some reason, even if odd, to go with 36 CUs whereas no particular reason to go with a really hot, narrow chip. So BC seems the only justification.
That was the alleged reason for why the Pro's BC mode only exposed 18 CUs to old software. It didn't stop the Pro from having 36 CUs.
 
Somebody asks this question every few days the answer is always the same. On PC data is pulled from the SSD over the controller to main RAM, where the CPU may need to unpack and/or decompress certain data (that's read/write from/to DDR4), anything or the GPU then goes shovelled over that bus to GDDR5/6. On PS5/XSX, data goes to the controller where some may be decompressed on the fly then it hits RAM and ready for use bu CPU and GPU.

Those are literally the I/O steps.

Yep, I understand that. But if we're talking about an uncompressed data stream rather than a compressed one (as I can't see how PC's can work with a compressed one without dedicated decompression hardware which they won't get any time soon outside of the GPU), then the steps you describe above are the same for either a PC or the PS5 with the exception of the additional step of data going over the PCIe 16x bus to the GPU. So yes that's an extra step not required by consoles which will add latency, but from a bandwidth perspective it's not a concern as that interface is 4x wider than the one between the SSD and the APU/CPU in either the PS5 or the PC.

I totally agree with your points on decompression though. If data needs to be compressed for whatever reason, and also processed by the CPU instead of or before heading to the GPU, then it would need to be decompressed by the CPU in software and that is a disadvantage of the current PC architecture which won't be overcome for quite some time. So the big questions are 1. Do PC's actually have to use data compression beyond those formats already natively handled by the GPU (which just to note no-one seems to be factoring into these bandwidth comparisons), and 2. how much of that data needs to be processed by the CPU and how much can simply be sent over the PCIe bus in a compressed form for the GPU to read natively.

I'd be curious to know how much extra compression the consoles custom formats gain you over those formats already in use natively by GPU's today. Is 5GB/s on a modern PC driver already the equivalent of something higher using those formats?

In addition to the hardware chain there is file system overheard, driver and I/O overhear (IDE/SCSI, PCI, video etc) none of which is optimised for such transfers. This is probably as big a killer of I/O performance as the lack of SSD having a direct path to GDDR5/6.

That's what I think we're all assuming DirectStorage is being introduced to address. No-one knows for sure yet, at least not those who aren't under NDA, so we can't definitively say that it will solve all these legacy software issues. But nor can we assume it won't.

I think a lot of people are throwing a lot of conjecture onto what the Windows version of the DirectStorage API will be. Until Microsoft actually publish this, it's conjecture. Logically, it makes sense for Windows to support smarter SSD controllers and it makes sense for SSD controllers to potentially have a more direct route to VRAM than going via main memory. You may recall AMD Radeon Pro GPUs with an SSD bolted on top.

Before any of this can work, Windows needs a driver and I/O framework overhaul, that is I assume what DirectStorage will facilitate.

It sounds like we're on the same page in this regard. So to summarise, my point is that if DirectStorage sorts out the software side, there's not a whole lot on the hardware side blocking modern PC's from having storage throughput quite similar to the new consoles but without the custom data compression. The hardware capabilities and interfaces are largely the same with the exceptions of Sony's customisations in the I/O block and the extra step to accommodate dGPU's.
 
That was the alleged reason for why the Pro's BC mode only exposed 18 CUs to old software. It didn't stop the Pro from having 36 CUs.
What other reason is there for choosing 36 CUs and then having to engineer some complex, expensive cooling when more CUs gets you more performance at the same cost?
 
What other reason is there for choosing 36 CUs and then having to engineer some complex, expensive cooling when more CUs gets you more performance at the same cost?
As in: you need 18 CU for legacy PS4 and 36 CU for legacy 4Pro; Why not just above that and close off the amount of CUs required for BC? I suppose if the goal is to run as much of BC as possible using native clocks, would it be advantageous to keep it 18/36 CU?
 
As in: you need 18 CU for legacy PS4 and 36 CU for legacy 4Pro; Why not just above that and close off the amount of CUs required for BC?
I understand that argument, but if that's doable, why hasn't Sony done it? In the GitHub Controversy, the idea was floated that Sony would do just that, with the 36 CUs showing only for BC testing. Turns out Sony did go narrow and then clock really high. It seems an odd choice.

I think a second part might be the SSD. I think Sony invested very heavily in that and thought that coupled with ~9 TFs that 36 CU would provide would be okay, and then they pushed that 36 CU unit. If they decided wider and more TFs was the target, perhaps the SSD solution would be simpler and cheaper.
 
Back
Top