Velocity Architecture - Limited only by asset install sizes

rntongo · Nov 3, 2020

function said:
Well it certainly sounds like you know an awful lot more about this than me, and with the points you're making I'm inclined to think what you're saying is right, so I guess I'm a convert. There's just a couple of things I'd like to try and clear up about what I was saying.

These three points I think I do understand. I was assuming there was something on the SoC to allow data to be presented to the GPU in a form it could use after the GPU had requested it - it wouldn't need to know (or care) about the format in which it was actually stored on the SSD as that was all abstracted away. And while BW of an SSD is low and its latency is high, I figured especially if the developer knew they only wanted to use the data once e.g. software decompression of assets as they are streamed in, they might as well get it done while it's on chip (but as you say this presents practical problems). I wasn't proposing, say, pulling the same texture samples off the SSD for multiple frames.

I am curious now about the different ways in which a system might bypass ram and stay on SoC, however impractical and unlikely it might be. Something fun to think about in bed at night, anyway.

I think persistent memory is the best chance of bypassing RAM but it's still not a replacement. But yeah all the major optimizations you're going to see are about efficiently utilizing the 13.5GB-14GB for games. Be it significantly increasing the amount of active RAM like Sony is doing or much more efficient demand paging by Microsoft. The biggest gains come from that. The only thing I would have changed about the Series X is the raw I/O bandwidth for the SSD. 3.75GB/s would have been better. Devs on the PS5 can load in well over 10GB of data into RAM without even worrying about anything. Considering the Series X has about 10GB for game optimal RAM, it would have been great if it could fill that up in a second. So around 3.75-4GB/s with a decompression ratio of 2.5. Otherwise 4.8-6GB/s is not bad at all. I can't wait to see what developers achieve with these systems.

function · Nov 3, 2020

thicc_gaf said:
Now they could've been referring to that texture being in main memory but IIRC they were describing a feature of the storage with that response. And, it's possible (very possible, actually) that it was in line with what rtongo is describing, i.e the process is virtually invisible to the developer in terms of requesting the data from the SSD because some slice of system memory is being reserved as a cache of sorts to map and stream in data from the SSD (in Series X's case, the 100 GB partition (not using it as describing a hard-segmented cluster of memory of the drive like drive partitions generally work, I just lack of a better term for it), so it's automatically handling those operations for the developer in the background.

rtongo is probably best placed to answer this, but I'll have a crack at adding a little based on what I've read.

I think SFS is intended to be used (or be usable) transparently to the app. As in, I think you probably could query the SFS residency map and tile map manually, but you could also leave the system to manage itself once you've set it up. The description of SFS given at Hotchips indicates that you start by evicting no-longer needed pages from memory, then trigger loads of the new stuff. There are likely constraints which mean you'll only do this within the maximum size of some specified texture paging pool, otherwise the churn of what's in memory could end up being huge with ugly LOD fighting between textures.

(As an aside SFS mip map pages are, iirc, 64KB in size which I think probably matches the SSD block size, or a multiple of it but probably not less).

I guess this is getting hypothetical now but, supposing there was ever a way for a processor to convert/break-up block data into byte data on the fly and then drop it within a small high-level of on-board cache (an L4$ for example), then pull that in to the lower-level caches of the processor....even knowing the gargantuan bandwidth difference between main memory and SSDs, if data of a given size, say 10 MB, were expected to be used for a few frames by the processor, does that come to a point where there isn't much a difference between just doing it through the (not currently in existence) method I'm describing here, vs. the more traditional process?

Well that's kind of what I had been thinking about; an area of on-chip memory for the IO/decompression block to put stuff into. Kind of like a train marshalling yard, where data could be copied to main ram, or read into cache with the same address that it had while on SSD and mapped into "virtual memory", or both (with a virtual memory address for the on chip cached version and a separate one for the simultaneous copy in ram).

But I have to agree this would seem complex and I'm not the person to say how you'd do it (part of the fun of this is trying to work out how the thing you've just suggested would actually work!). But I think you uncovered a really interesting question (for me anyway) in this next bit:

Because if a game has a 60 FPS target for example, if it were doing nothing but pulling in data from storage and that storage were even, say, 2 GB/s for uncompressed data, on those frames that'd come to about 34 MB per frame. That's ignoring latency of SSDs though, but let's say the game in question just needs that 10 - 34 MB of data by x amount of time; do you think it would be worth this hypothetical implementation for storage data access to processor cache (let alone if it'd be possible even with changes in memory access schemes) if it meant simply accounting for the latency ahead of time and that processor buffering a read of that 10 - 34 MB of data from storage into a local L4$ so that the data is there by the time it's needed by that processor?

Well when you put it like that, it probably wouldn't be. 10 - 34MB would represent a very large investment in die area for the sram. CPU and GPU L2 combined are only 13 MB. If you could have a much smaller area of cache and work on it constantly for outputting to ram (e.g. decompression) perhaps it would work out better, but then that's a lot of work for the game developer to manage, and if they fall behind and the cache filled they would either lose data or parts of the IO would stall.

I guess this is another point in favour of rtongo's argument. Silicon is a precious resource, and there are better ways to use large amounts of it than on an SSD cache.

Honestly wonder if that type of memory access design is feasible or desirable, or if it'd just be better to stick with increasing the bandwidth of components to boost the traditional memory access schemes as they already exist. Because I know that this hypothetical I'm describing could also complicate cache coherency, but I figure the offset benefit would be less bus contention on main system memory in hUMA designs by making use of SSD storage more flexible and less reliant on main memory acting as a middleman. Plus I've been thinking of technological features for 10th-gen consoles and this has been something on my mind.

A level 4 (or perhaps L3 using the RDNA L0, L1, L2 classification) cache might not be a bad idea, but given the number of of BW and latency critical operations involving ram instead of SSD those resources might be best used elsewhere.

But to bring it back on topic....forgetting the whole conversion stuff, assuming the GPU could work with the block data, or supposing the SSD storage were a persistent memory solution but at SSD bandwidths and lower-quality persistent memory (so higher latency)....could the raw bandwidth difference between that and main memory be offset through considering what size of data would be needed on a per-frame basis, and ensuring the processor have a prefetch window large enough to simply read in that small amount of memory (we're talking like 10 or so MBs) for a given range of cycles, streaming it in, to work with it directly through its caches, and be a valid means of accessing data, versus still needing to copy the data to main memory and then do such?

If you have a prefetch window the length of a frame or two you can probably afford to put data in ram straight from the SSD (SSD latency is probably between 100 to 1000 times greater than ram for a busy GPU memory controller).

Persistent memory addressable like Optane wouldn't hurt (again like rtongo mentioned) but that's a lot of cost for a relatively small increase in speed. It does give you crap load of directly addressable memory, but given the cost I think developers would prefer more ram and more performance and being given to tools to efficiently manage what you put in ram off the SSD.

As always, not a ProLeetExpert, just my thoughts.

I hope that was clear x3.

Yeah man, I think I got what you were getting at. Hopefully you agree.

function · Nov 3, 2020

rntongo said:
I think persistent memory is the best chance of bypassing RAM but it's still not a replacement. But yeah all the major optimizations you're going to see are about efficiently utilizing the 13.5GB-14GB for games.

Yeah, after looking at the current cost of something like Optane, I don't see how it could realistically factor into a console vendors BOM calculations. Faster SSDs, more ram, and more processing power seem to be the most universal and cost effective ways of driving performance from a high level view.

I'm really looking forward to seeing how SFS works out, although as it seems to require explicit use it might be a while till we see it or till developers talk about it.

Be it significantly increasing the amount of active RAM like Sony is doing or much more efficient demand paging by Microsoft. The biggest gains come from that.

I have a feeling that the amount of game available ram on PS5 will be very close to the XSX. I'm guessing something like 13 ~ 14GB for PS5 to the XSX's 13.5. Sony's SSD is faster, but I think the same types of core operations will be kept in ram - including buffering 4K video footage ready for saving.

The only thing I would have changed about the Series X is the raw I/O bandwidth for the SSD. 3.75GB/s would have been better. Devs on the PS5 can load in well over 10GB of data into RAM without even worrying about anything. Considering the Series X has about 10GB for game optimal RAM, it would have been great if it could fill that up in a second. So around 3.75-4GB/s with a decompression ratio of 2.5. Otherwise 4.8-6GB/s is not bad at all. I can't wait to see what developers achieve with these systems.

Yeah, I was a bit surprised at the speed of the XSX SSD at first as the controller can as you say go faster. But then I figured cost of higher frequency flash and power/heat of the expansion drive might be factors. I still think TLC might be on the cards (for both XSX and PS5) so that might factor into cost calculations too (durability vs read speed - Sony also retaining durability but getting more speed through lower clocks with more channels).

The "GPU optimal" 10 GB on the XSX seemed a bit odd at first, but the Hotchips presentation made it clear how much they needed BW for certain GPU activities. The final 3.5 GB of game ram at 336 GB/s is still overkill for CPU / Audio / OS accesses, and probably has room to spare for things like supplementary texture sampling. XSS only has 224 GB/s for its entire, slightly gimpy existence*.

*I actually quite like the XSS.

Speaking of BW, I'm hoping we'll find out more about how Infinity Cache is used on RDNA2 cards to effectively double BW (according to AMD). I know neither PS5 nor XSX have it, but I'd be interested to see exactly what they're putting in there.

Next gen is going to be great. PC, PS5, Xbox. I don't think there are going to be any losers.

scently · Nov 3, 2020

function said:
Yeah, after looking at the current cost of something like Optane, I don't see how it could realistically factor into a console vendors BOM calculations. Faster SSDs, more ram, and more processing power seem to be the most universal and cost effective ways of driving performance from a high level view.

I'm really looking forward to seeing how SFS works out, although as it seems to require explicit use it might be a while till we see it or till developers talk about it.

I have a feeling that the amount of game available ram on PS5 will be very close to the XSX. I'm guessing something like 13 ~ 14GB for PS5 to the XSX's 13.5. Sony's SSD is faster, but I think the same types of core operations will be kept in ram - including buffering 4K video footage ready for saving.

Yeah, I was a bit surprised at the speed of the XSX SSD at first as the controller can as you say go faster. But then I figured cost of higher frequency flash and power/heat of the expansion drive might be factors. I still think TLC might be on the cards (for both XSX and PS5) so that might factor into cost calculations too (durability vs read speed - Sony also retaining durability but getting more speed through lower clocks with more channels).

The "GPU optimal" 10 GB on the XSX seemed a bit odd at first, but the Hotchips presentation made it clear how much they needed BW for certain GPU activities. The final 3.5 GB of game ram at 336 GB/s is still overkill for CPU / Audio / OS accesses, and probably has room to spare for things like supplementary texture sampling. XSS only has 224 GB/s for its entire, slightly gimpy existence*.

*I actually quite like the XSS.

Speaking of BW, I'm hoping we'll find out more about how Infinity Cache is used on RDNA2 cards to effectively double BW (according to AMD). I know neither PS5 nor XSX have it, but I'd be interested to see exactly what they're putting in there.

Next gen is going to be great. PC, PS5, Xbox. I don't think there are going to be any losers.

Yeah, I am really looking forward to this coming gen. Both consoles are engineered to last. On some level, I think we might not see mid-gen updates. These consoles are much more impressive relative to the current-gen at launch.

thicc_gaf · Nov 5, 2020

function said:
rtongo is probably best placed to answer this, but I'll have a crack at adding a little based on what I've read.

I think SFS is intended to be used (or be usable) transparently to the app. As in, I think you probably could query the SFS residency map and tile map manually, but you could also leave the system to manage itself once you've set it up. The description of SFS given at Hotchips indicates that you start by evicting no-longer needed pages from memory, then trigger loads of the new stuff. There are likely constraints which mean you'll only do this within the maximum size of some specified texture paging pool, otherwise the churn of what's in memory could end up being huge with ugly LOD fighting between textures.

(As an aside SFS mip map pages are, iirc, 64KB in size which I think probably matches the SSD block size, or a multiple of it but probably not less).

Speaking of SFS and the mip maps, so we already know they have hardware in the GPU for blending the mips. I've seen some documents with visualizations of this twice, maybe three times, over the past few months but didn't bother to hold on to them

. Did those document drawings show if this mip blending hardware had any cache on it locally? If so would it have fallen within 64 KB in size to match the block size of the SSDs?

I think Jason Ronald or another person on the engineering team said something about even smaller page sizes for sampling when it came to SFS, but I can't recall the specific context that was in pertains to.

Well that's kind of what I had been thinking about; an area of on-chip memory for the IO/decompression block to put stuff into. Kind of like a train marshalling yard, where data could be copied to main ram, or read into cache with the same address that it had while on SSD and mapped into "virtual memory", or both (with a virtual memory address for the on chip cached version and a separate one for the simultaneous copy in ram).

But I have to agree this would seem complex and I'm not the person to say how you'd do it (part of the fun of this is trying to work out how the thing you've just suggested would actually work!).

Inclined to agree; it's probably a lot of work for not enough payoff. Some team of MIT engineers out there probably have a research paper on this kind of stuff, might be worth a Google search some day.

Well when you put it like that, it probably wouldn't be. 10 - 34MB would represent a very large investment in die area for the sram. CPU and GPU L2 combined are only 13 MB. If you could have a much smaller area of cache and work on it constantly for outputting to ram (e.g. decompression) perhaps it would work out better, but then that's a lot of work for the game developer to manage, and if they fall behind and the cache filled they would either lose data or parts of the IO would stall.

I guess this is another point in favour of rtongo's argument. Silicon is a precious resource, and there are better ways to use large amounts of it than on an SSD cache.

A level 4 (or perhaps L3 using the RDNA L0, L1, L2 classification) cache might not be a bad idea, but given the number of of BW and latency critical operations involving ram instead of SSD those resources might be best used elsewhere.

It would probably only have a chance of working if something like embedded MRAM took off..and became affordable. There's already some designs with it out there for certain embedded designs, but nothing for the consumer electronics market AFAIK, and it's crazy expensive. But the benefits seem worth it; nearly SRAM-level speeds and larger capacities, and able to be baked onto the chip itself.

A lot of projections for future designs maybe 10 years out where STT-MRAM could start to replace L3$ SRAM levels in designs; some even showing similar maybe for DRAM, but capacities have to grow by magnitudes before that happens (and right now we're still talking largest capacities in maybe 16 MB to 32 MB last I checked).

If you have a prefetch window the length of a frame or two you can probably afford to put data in ram straight from the SSD (SSD latency is probably between 100 to 1000 times greater than ram for a busy GPU memory controller).

Persistent memory addressable like Optane wouldn't hurt (again like rtongo mentioned) but that's a lot of cost for a relatively small increase in speed. It does give you crap load of directly addressable memory, but given the cost I think developers would prefer more ram and more performance and being given to tools to efficiently manage what you put in ram off the SSD.

Personally I think persistent memory, once it can be had at good quantities for good prices, it'll make a great compliment to volatile and NAND memories, even in console designs. RAM prices don't seem to be decreasing at a scale as much as they did in the past, if MS's statements at Hot Chips are anything to go by, and that decrease rate will probably slow going into the future. So if even getting 32 GB in future console designs is pushing the high-end, that opens up a large window for persistent RAM.

Particularly the type that functions more like DRAM; IIRC Optane has SSD-style and DRAM-style variants, I don't know if the SSD versions is byte-addressable, but it does have much lower latency than the typical SSD. The chance of seeing future designs that could leverage a respectable amount of faster main memory and a pool of directly addressable persistent memory at potentially 4x the main memory amounts, and more mature bandwidths and latencies, I think that grants a lot of design freedom for creators as long as they're willing to learn how to naturally integrate another memory space in the hierarchy into their designs.

As always, not a ProLeetExpert, just my thoughts.

Yeah man, I think I got what you were getting at. Hopefully you agree.

No problems at all dude; I welcome these kind of discussions and feel like I learn a heck of a lot from the insights provided in general on the boards. And a lot of points raised here in particular on your end, make a lot of sense to me :yes:

To some of the other stuff you bring up in the other reply:

>I was one of the people honestly thinking persistent memory would be in PS5 and/or Series X. Feeling incredibly stupid for that in hindsight, I didn't know any better xD

>For next-next gen systems, I still think it has a good shot of being there. The bandwidth will never match main memory (especially if next-next gen go with an HBM-based main memory), but if the persistent memory is DRAM-class in speed, latency and function, it should be able to serve a great supplement between main memory and SSD memory as long as they can get the pricing down enough and amounts anywhere between 3x-4x the amount of main memory for similar (or lower) prices.

You can do a breakdown of the official MSRP for Intel's Optane DC Persistent Memory and I think it comes to around $6 per GB, tho I haven't checked the launch prices for it in a long time. That's for the 128 GB option. That price would include profit margins of course, but even if you halve the cost it'd still need to come down a good bit more per GB before it's a viable solution in a console design.

>PS5's probably definitely using TLC; those were Toshiba modules (though that's interesting in itself because Toshiba now goes by some different name now IIRC? Something starting with a "K"...tbh their new name is kinda garbage IMO x3), but they would be very large TLC if they're 128 GB chips, that kind of pushes QLC territory. Couldn't make out what Series X is using, except Jeff Grub tore the expansion card open and it showed a SK Hynix module. Tried to google the model number but absolutely nothing came up matching it!

rntongo · Nov 5, 2020

>For next-next gen systems, I still think it has a good shot of being there. The bandwidth will never match main memory (especially if next-next gen go with an HBM-based main memory), but if the persistent memory is DRAM-class in speed, latency and function, it should be able to serve a great supplement between main memory and SSD memory as long as they can get the pricing down enough and amounts anywhere between 3x-4x the amount of main memory for similar (or lower) prices.

If Sony & MSFT are willing to rewrite the OS significantly then PMEM would be smart for next next gen. I think a much more smarter thing would be to hit 32GB of RAM of high data transfer rates(>18Gbps per pin) and 12GB/s(32GB RAM /2.5 decomp ratio) SSD bandwidth; either PCIe 4.0 x6 or 5.0 x3. If they can launch with 2TB base models of SSDs in such systems it would be a huge relief for consumers. And by foregoing PMEM maybe they can invest more into real RT & ML accelerators and higher core CPU.

thicc_gaf said:
Speaking of SFS and the mip maps, so we already know they have hardware in the GPU for blending the mips. I've seen some documents with visualizations of this twice, maybe three times, over the past few months but didn't bother to hold on to them . Did those document drawings show if this mip blending hardware had any cache on it locally? If so would it have fallen within 64 KB in size to match the block size of the SSDs?

I think Jason Ronald or another person on the engineering team said something about even smaller page sizes for sampling when it came to SFS, but I can't recall the specific context that was in pertains to.

I honestly don't know the answers to these but I think from what James Stanard said SFS can identify up to the smallest page size. But I'm sure Microsoft got all the necessary caches in place. The whole die has about 76MB most of which is unaccounted for.

thicc_gaf · Nov 5, 2020

rntongo said:
If Sony & MSFT are willing to rewrite the OS significantly then PMEM would be smart for next next gen. I think a much more smarter thing would be to hit 32GB of RAM of high data transfer rates(>18Gbps per pin) and 12GB/s(32GB RAM /2.5 decomp ratio) SSD bandwidth; either PCIe 4.0 x6 or 5.0 x3. If they can launch with 2TB base models of SSDs in such systems it would be a huge relief for consumers. And by foregoing PMEM maybe they can invest more into real RT & ML accelerators and higher core CPU.

I honestly don't know the answers to these but I think from what James Stanard said SFS can identify up to the smallest page size. But I'm sure Microsoft got all the necessary caches in place. The whole die has about 76MB most of which is unaccounted for.

It's like you seemingly read my mind on some of the next-next gen stuff on the SSD bandwidths (might go for PCIe 5.0 just depends if it would be ready in time). But could GDDR_nth or whatever version is available by then, really bring the needed bandwidth, even if clamshell mode were used (halving capacities but keeping bandwidth high per module)?

For where those consoles'll probably hit that's why I'm more favoring HBM-based memory technologies. But, it doesn't even need to be said, the issue there would be price; HBM suppliers place their stuff at a premium mainly because clients are willing to pay the premium. I don't know much mass bulk contract purchases from MS and Sony would drive down the costs for them.

Anyone have guesses what the unaccounted portions of the 76 MB cache is for? I think someone accounted for all of the cache on the CPU and GPU, and other things, and still there was like 20-something MB not accounted for. Some users here have had some speculation on it back in August, and some guy on Twitter by the Leviathan handle, they also gave some thoughts on it. But it was all more them pondering what it could be for, rather than giving any definitive conclusion.

And most people have ruled out Infinity Cache but...maybe a cut-down version of it in some implementation? Doesn't seem very likely but we never know.

function · Nov 5, 2020

thicc_gaf said:
Speaking of SFS and the mip maps, so we already know they have hardware in the GPU for blending the mips. I've seen some documents with visualizations of this twice, maybe three times, over the past few months but didn't bother to hold on to them . Did those document drawings show if this mip blending hardware had any cache on it locally? If so would it have fallen within 64 KB in size to match the block size of the SSDs?

The SFS specific blending mode appears to be intended to hide transitions from a higher detail mipmap page (where you could observe a drop of detail) to a lower detail mipmap page (where you couldn't observe the transition). So basically, where you blend from a higher detail to a lower detail page is moved to a screen location where the results of that change are invisible.

IMO this will use the standard cache hierarchy, just with the texture samplers directed towards the appropriate page of each mipmap to make this happen. The system maintains a map of where different LODs are used, so it should be able to calculate where to offset the blending. There's almost certainly some hardware acceleration for this, as this will have to work with up to 16x anisotropic filtering too and it would seem needlessly expensive to do this kind of weighted and conditional blending in software. But I think it will be integrated into the existing pipeline and cache hierarchy.

It would probably only have a chance of working if something like embedded MRAM took off..and became affordable. There's already some designs with it out there for certain embedded designs, but nothing for the consumer electronics market AFAIK, and it's crazy expensive. But the benefits seem worth it; nearly SRAM-level speeds and larger capacities, and able to be baked onto the chip itself.

A lot of projections for future designs maybe 10 years out where STT-MRAM could start to replace L3$ SRAM levels in designs; some even showing similar maybe for DRAM, but capacities have to grow by magnitudes before that happens (and right now we're still talking largest capacities in maybe 16 MB to 32 MB last I checked).

Sounds like I should look into MRAM more...

The thing about exotic tech though is that while it offers huge potential advantages, it has risks wrt enormous investments and potential snags, and that make the gradual evolution of existing technologies a safer bet. We've been with sram and dram in consoles for a very long time because the risks are low, the tech is proven and long term planning with regards to cost and availability is reasonably solid.

If something MRAM does make it into consoles, it will have had to established itself elsewhere in a fairly strong way first.

Personally I think persistent memory, once it can be had at good quantities for good prices, it'll make a great compliment to volatile and NAND memories, even in console designs. RAM prices don't seem to be decreasing at a scale as much as they did in the past, if MS's statements at Hot Chips are anything to go by, and that decrease rate will probably slow going into the future. So if even getting 32 GB in future console designs is pushing the high-end, that opens up a large window for persistent RAM.

Particularly the type that functions more like DRAM; IIRC Optane has SSD-style and DRAM-style variants, I don't know if the SSD versions is byte-addressable, but it does have much lower latency than the typical SSD. The chance of seeing future designs that could leverage a respectable amount of faster main memory and a pool of directly addressable persistent memory at potentially 4x the main memory amounts, and more mature bandwidths and latencies, I think that grants a lot of design freedom for creators as long as they're willing to learn how to naturally integrate another memory space in the hierarchy into their designs.

I'd guess that the SSD Optane's controller is designed around traditional filesystem type access, as that would be the most natural way to make an SSD controller. Though I suppose you could make a controller that could access at the byte level through some kind of special set of commands? I don't know enough to say.

One of the big considerations in games now is portability - everything is multi platform (and currently multi-gen). Organising your data structures and access patterns to favour a unique memory layout may not go down well. Whatever hierarchy of memory and caches you come up with can't be too alien to the general world of game development.

When Sega introduced their Naomi GDROM arcade board, they used the Dreamcast high capacity CD, but loaded the entire damn thing into a 1GB+ ram "virtual cartridge" to save on the cost of GB arcade carts which cost serious dollar. A one time 60 second boot up fee costs almost nothing, but loading times during gameplay mean you loose credits from players. Nice. Unfortunately, I don't think a similar (in terms of how the game code sees the game data) persistent memory solution is going to be viable on consoles in the near future.

No problems at all dude; I welcome these kind of discussions and feel like I learn a heck of a lot from the insights provided in general on the boards. And a lot of points raised here in particular on your end, make a lot of sense to me

Just because something makes sense doesn't mean it's right.

And yeah, discussions like this are cool. Complex enough to get you thinking, but chilled enough that it's fun.

>I was one of the people honestly thinking persistent memory would be in PS5 and/or Series X. Feeling incredibly stupid for that in hindsight, I didn't know any better xD

It's okay to think things that are later proved to be incorrect! What's a bit dumb is not stopping to question what you think when given a reasonable reason to do so. I wish I could remember to do this this more often, when I'm being dumb ...

>For next-next gen systems, I still think it has a good shot of being there. The bandwidth will never match main memory (especially if next-next gen go with an HBM-based main memory), but if the persistent memory is DRAM-class in speed, latency and function, it should be able to serve a great supplement between main memory and SSD memory as long as they can get the pricing down enough and amounts anywhere between 3x-4x the amount of main memory for similar (or lower) prices.

And as always, cost is key! SSDs are so fast now (even on XSX) that the benefit of Dram like persistent memory may have limited benefit in the face of faster SSDs and better ways of predicting what you need. Likewise, HBM is kick ass, but costs per unit of performance (size, speed, power etc) don't come out in its favour when all factored in. It's great if you favour performance over cost, but consoles always have to balance perf and cost (unless you're Nintendo, and favour cost over anything, and still win because your games are great).

The thing about graphics is that you have a somewhat reliably predictable workload due to the nature of the job. It's constrained to some extent by the screen. FOV, distance, resolution etc. Where persistent memory seems to kick ass is with data sets that potentially scale enormously, and aren't constrained by for example screen space and screen resolution. In games, you can make things fit the screen. In some other problems, however, you have to make the computer hardware fit the problem.

128/256 GB of Optane backing up the dram would be awesome next gen. But more awesome than more dram and more CU on the GPU? I'm not qualified to say. But I think probably not.

You can do a breakdown of the official MSRP for Intel's Optane DC Persistent Memory and I think it comes to around $6 per GB, tho I haven't checked the launch prices for it in a long time. That's for the 128 GB option. That price would include profit margins of course, but even if you halve the cost it'd still need to come down a good bit more per GB before it's a viable solution in a console design.

Yeah, definitely.

>PS5's probably definitely using TLC; those were Toshiba modules (though that's interesting in itself because Toshiba now goes by some different name now IIRC? Something starting with a "K"...tbh their new name is kinda garbage IMO x3), but they would be very large TLC if they're 128 GB chips, that kind of pushes QLC territory. Couldn't make out what Series X is using, except Jeff Grub tore the expansion card open and it showed a SK Hynix module. Tried to google the model number but absolutely nothing came up matching it!

I tried searching the code too.

Seemed to be ... maybe ... in a family of stuff that included TLC. Which means ... basically nothing. Doh!

(Also, I'm a bit drunk now. Please excuse any gibberish).

thicc_gaf · Nov 6, 2020

function said:
The SFS specific blending mode appears to be intended to hide transitions from a higher detail mipmap page (where you could observe a drop of detail) to a lower detail mipmap page (where you couldn't observe the transition). So basically, where you blend from a higher detail to a lower detail page is moved to a screen location where the results of that change are invisible.

IMO this will use the standard cache hierarchy, just with the texture samplers directed towards the appropriate page of each mipmap to make this happen. The system maintains a map of where different LODs are used, so it should be able to calculate where to offset the blending. There's almost certainly some hardware acceleration for this, as this will have to work with up to 16x anisotropic filtering too and it would seem needlessly expensive to do this kind of weighted and conditional blending in software. But I think it will be integrated into the existing pipeline and cache hierarchy.

Thanks for the clarifications; I was under the impression MS were using a block of cache within the mip blending hardware to store a texture sample, but that was outside of the caches already listed on the GPU. So my thought was the data would get copied from the L2$ and into this other cache for the mip-blending hardware. That might over-complicate things though; I'm assuming the map the system uses is stored in the 2.5 GB of reserved GDDR6?

Sounds like I should look into MRAM more...

The thing about exotic tech though is that while it offers huge potential advantages, it has risks wrt enormous investments and potential snags, and that make the gradual evolution of existing technologies a safer bet. We've been with sram and dram in consoles for a very long time because the risks are low, the tech is proven and long term planning with regards to cost and availability is reasonably solid.

If something MRAM does make it into consoles, it will have had to established itself elsewhere in a fairly strong way first.

Oh trust me, I'm not foolish enough to think MRAM will be in 10th-gen systems xD. There's just no path to it getting the capacity gains needed in even 10 years for that to happen, and certainly not at good prices. Plus by the point such could happen, other more standard solutions would be even still cheaper, and more or less just as solid.

Which is why I can get this perspective now; when you think about it those are the big advantages of SRAM and DRAM as you put it; time-tested, good performance for price, good capacity for price.

I'd guess that the SSD Optane's controller is designed around traditional filesystem type access, as that would be the most natural way to make an SSD controller. Though I suppose you could make a controller that could access at the byte level through some kind of special set of commands? I don't know enough to say.

One of the big considerations in games now is portability - everything is multi platform (and currently multi-gen). Organising your data structures and access patterns to favour a unique memory layout may not go down well. Whatever hierarchy of memory and caches you come up with can't be too alien to the general world of game development.

When Sega introduced their Naomi GDROM arcade board, they used the Dreamcast high capacity CD, but loaded the entire damn thing into a 1GB+ ram "virtual cartridge" to save on the cost of GB arcade carts which cost serious dollar. A one time 60 second boot up fee costs almost nothing, but loading times during gameplay mean you loose credits from players. Nice. Unfortunately, I don't think a similar persistent memory solution is going to be viable on consoles in the near future.

Ah man, SEGA were something else back in the day! Constantly interested in learning about the tech they leveraged for their console and arcade systems, and how they did it. What they did with NAOMI was pretty clever; I think a modern-day arcade system mimicking that, even with a persistent memory, would be neat to see happen.

You're right though; ultimately if costs for persistent memories don't start to scale down in the next several years, it won't be a thing for 10th-gen consoles, most likely. And ever-increasing cross-gen/multi-platform ecosystem support from 3rd parties will probably also have a big impact, something that isn't always on my mind.

And as always, cost is key! SSDs are so fast now (even on XSX) that the benefit of Dram like persistent memory may have limited benefit in the face of faster SSDs and better ways of predicting what you need. Likewise, HBM is kick ass, but costs per unit of performance (size, speed, power etc) don't come out in its favour when all factored in. It's great if you favour performance over cost, but consoles always have to balance perf and cost (unless you're Nintendo, and favour cost over anything, and still win because your games are great).

The thing about graphics is that you have a somewhat reliably predictable workload due to the nature of the job. It's constrained to some extent by the screen. FOV, distance, resolution etc. Where persistent memory seems to kick ass is with data sets that potentially scale enormously, and aren't constrained by for example screen space and screen resolution. In games, you can make things fit the screen. In some other problems, however, you have to make the computer hardware fit the problem.

128/256 GB of Optane backing up the dram would be awesome next gen. But more awesome than more dram and more CU on the GPU? I'm not qualified to say. But I think probably not.

Yeah, I think you might be on the money here. It'll help with revamping some of my own thoughts on what another generation of consoles could bring, for sure (that's gonna be a lot of stuff to rewrite though xD). There's still a use-case for HBM I feel, particularly factoring in bus widths and potential bandwidth of future HBM iterations, and it's really up in the air how much further GDDR can be pushed. Because GDDR seems to favor "smaller" bandwidths (at least in comparison to HBM), so that's always going to put a cap on how much bandwidth and capacity a system with GDDR can have considering it can't really be packaged the same way HBM can (and generally has a higher watt consumption).

Say some future GDDR spec can hit 32 Gbps on the I/O pins, but there's still the traditional packaging standard for it which is going to take up PCB real estate (read somewhere about potential experimentation of stacking GDDR package-on-package style, but I don't think anything substantial's come from that), but if it stays at the same 32-bit sizes that's 128 GB/s per module, and it seems the consoles generally are shooting for 256-bit to 320-bit (MAYBE 384-bit or at least that was the case for One X), so that probably limits them to 8-10 modules. 1 TB/s - 1.28 TB/s main memory bandwidths, hopefully 4 GB capacities end up supported so you can get 32 GB - 40 GB capacities. But does that bandwidth seem too small for future system designs?

Maybe that's where something like Infinity Cache could come in? We don't have any numbers on RDNA 2 GPUs with IC unfortunately. But I guess generous IC + some future GDDR spec at speeds listed above could do pretty well for future system designs after all. Really interesting to think about.

I tried searching the code too.

Seemed to be ... maybe ... in a family of stuff that included TLC. Which means ... basically nothing. Doh!

(Also, I'm a bit drunk now. Please excuse any gibberish).

Nah you're quite articulate if this is you being drunk xD.

I'll try seeing what I can find on the Toshiba and SK Hynix NAND modules over the weekend. Really hate how these manufacturers obfuscate this stuff at times. Although I can understand if they're customized parts not yet officially announced for the wider market. At least they didn't scrub off the brand naming and details from the chips altogether like oh so many cheap retro emulator knockoffs do!

And I guess to kinda get the thread back on topic xD....does anyone know if any of the Series BC games are leveraging Velocity Architecture? I'd assume they aren't, unless they're optimized, but "optimized" doesn't seem to automatically mean they're leveraging XvA. The load times are really impressive from what we're seeing right now.

function · Nov 6, 2020

thicc_gaf said:
Thanks for the clarifications; I was under the impression MS were using a block of cache within the mip blending hardware to store a texture sample, but that was outside of the caches already listed on the GPU. So my thought was the data would get copied from the L2$ and into this other cache for the mip-blending hardware. That might over-complicate things though; I'm assuming the map the system uses is stored in the 2.5 GB of reserved GDDR6?

Yeah, I think the simplest way (customisation work, silicon cost) is to add their SFS sampling mode to those already supported by the Texture Mapping Units. All the data they need (textures, tile map, residency map) are stored in texture format. I think the tile and residency maps will be stored in "game memory" as the game needs to be able to see them, and they might be useful for developers in some other way if they have access to them. (In the past I've suggested using the SFS data to decide on model / geometry LOD, as it could indirectly tell you about needed level of detail in dynamic res games).

Ah man, SEGA were something else back in the day! Constantly interested in learning about the tech they leveraged for their console and arcade systems, and how they did it. What they did with NAOMI was pretty clever; I think a modern-day arcade system mimicking that, even with a persistent memory, would be neat to see happen.

Yeah, I used to really like Sega. It's hard for people to understand, I think, what it was like to be gaming on the MD and SNES and then walk into an arcade and BAM see Daytona USA running on a sit down cab with something like a 50 inch screen and a great audio system turned up high. It was mindblowing! It's a long, long time since anything has completely redefined what games are for me. Those types of jumps don't exist these days (although the move to something like XVA / DirectStorage is still most welcome!).

Yeah, I think you might be on the money here. It'll help with revamping some of my own thoughts on what another generation of consoles could bring, for sure (that's gonna be a lot of stuff to rewrite though xD). There's still a use-case for HBM I feel, particularly factoring in bus widths and potential bandwidth of future HBM iterations, and it's really up in the air how much further GDDR can be pushed. Because GDDR seems to favor "smaller" bandwidths (at least in comparison to HBM), so that's always going to put a cap on how much bandwidth and capacity a system with GDDR can have considering it can't really be packaged the same way HBM can (and generally has a higher watt consumption).

If the industry can support a high end console with a high enough BOM, I suppose a HBM derivative isn't off the cards. As you say, BW, speed, power and footprint are all in its favour. Generally smaller BW's come from being further away (in both computer and physical terms) from the processor, and HBM sits on an interposer with a stupidly wide bus built on silicon. Something off package like GDDR6 is unlikely to match something on an interposer like HBM.

Where traditional memory types hold up is that with enough cache, and a wide enough bus and enough space for the dram on the board, you can get a better compromise for the cost and risk. For example, Radeon 7 was HBM, but Radeon 6900 is GDDR6 with a phat sram cache. Will this new approach hold up? Hopefully.

Say some future GDDR spec can hit 32 Gbps on the I/O pins, but there's still the traditional packaging standard for it which is going to take up PCB real estate (read somewhere about potential experimentation of stacking GDDR package-on-package style, but I don't think anything substantial's come from that), but if it stays at the same 32-bit sizes that's 128 GB/s per module, and it seems the consoles generally are shooting for 256-bit to 320-bit (MAYBE 384-bit or at least that was the case for One X), so that probably limits them to 8-10 modules. 1 TB/s - 1.28 TB/s main memory bandwidths, hopefully 4 GB capacities end up supported so you can get 32 GB - 40 GB capacities. But does that bandwidth seem too small for future system designs?

I mean, there are all kinds of whizz-kid silicon guys profiling this stuff and trying to predict the best set of tradeoffs to head towards. GDDR stacking might work, but there are inherently problems associate with running at high frequencies, using lots of power, stacking, and cooling. HBM works well with stack because it uses lower clocks and less power, but it makes up with a tremendously phat bus that can only work on an interposer. When your memory has to go across a board and up through pins / solder bumps under a chip package you're going to struggle with bus widths beyond a certain point.

Maybe that's where something like Infinity Cache could come in? We don't have any numbers on RDNA 2 GPUs with IC unfortunately. But I guess generous IC + some future GDDR spec at speeds listed above could do pretty well for future system designs after all. Really interesting to think about.

Yeah, the success (or failure) of RDNA2 may give us some insight into the future. High end GPUs are much higher margin parts than consoles, so that probably gives them more area to play with on the silicon. Then again, the PHYs for external dram take up die area too, and as transistors get smaller that accounts for more and more cache that could be put on chip instead. But yeah, if GDDR hits a wall, and die area remains expensive with little cost reduction, eventually hardware vendors will be driven towards something more radical.

Maybe faster SSDs and persistent memory can save us.

And I guess to kinda get the thread back on topic xD....does anyone know if any of the Series BC games are leveraging Velocity Architecture? I'd assume they aren't, unless they're optimized, but "optimized" doesn't seem to automatically mean they're leveraging XvA. The load times are really impressive from what we're seeing right now.

XVA seems to comprise of a fast(ish) SSD, DirectStorage using virtual memory addresses, SFS, and a custom decompression block which I expect will need games to be compiled/built in a certain way to use.

Soooo..... other than using an SSD and having a very fast CPU, I'd agree with you that BC likely won't take advantage of most of XVA unless specifically patched to do so.

I think it will be a similar case for PS5 BC too. So I think BC games will mostly be limited (or boosted if you want to look at it that way!) by the CPU.

thicc_gaf · Nov 8, 2020

function said:
Yeah, I think the simplest way (customisation work, silicon cost) is to add their SFS sampling mode to those already supported by the Texture Mapping Units. All the data they need (textures, tile map, residency map) are stored in texture format. I think the tile and residency maps will be stored in "game memory" as the game needs to be able to see them, and they might be useful for developers in some other way if they have access to them. (In the past I've suggested using the SFS data to decide on model / geometry LOD, as it could indirectly tell you about needed level of detail in dynamic res games).

Ah, this clears up a ton! So the way described here could suggest this is one of the other customizations MS might've done for the GPU, although there's the chance that AMD adopt this across the RDNA 2 PC GPUs, while other parts like the mip-blending hardware are exclusive to the Series systems, at least going by what Xbox engineers were saying on Twitter.

Yeah, I used to really like Sega. It's hard for people to understand, I think, what it was like to be gaming on the MD and SNES and then walk into an arcade and BAM see Daytona USA running on a sit down cab with something like a 50 inch screen and a great audio system turned up high. It was mindblowing! It's a long, long time since anything has completely redefined what games are for me. Those types of jumps don't exist these days (although the move to something like XVA / DirectStorage is still most welcome!).

I actually remember doing exactly that xD! Had a MegaDrive (Genesis technically over here), and would often go to the arcade on the naval base where they just had so many amazing games with that experience you simply couldn't get at home, even with the then-next gen consoles of PS1/Saturn/N64 (aside from the 3D, which the arcade did still better). Makes me sad that arcades fell off; I do think they could still serve a big role of driving the market forward and provide the kind of full-fat experience you can't really get in the home unless you have a REALLY good (and expensive) setup, which most gamers don't.

Like, people are already struggling to make sure they have TVs 4K/120 compatible for the Series X and PS5, and that's just one component of several in a good gaming setup. Something that were an arcade equivalent would have a much better convenience factor, then add in the real estate problem when it comes to floorspace, that's another area arcades still have the advantage in because not all gamers have large houses or rooms with large amounts of floorspace.

Having those sort of things could easily enable for some new mindblowing tech, like full VR/AR/location-based gameplay designs that wouldn't be easy to translate home but...not enough visionaries in the gaming field looking at arcade as a blue ocean market I guess :S.

If the industry can support a high end console with a high enough BOM, I suppose a HBM derivative isn't off the cards. As you say, BW, speed, power and footprint are all in its favour. Generally smaller BW's come from being further away (in both computer and physical terms) from the processor, and HBM sits on an interposer with a stupidly wide bus built on silicon. Something off package like GDDR6 is unlikely to match something on an interposer like HBM.

Where traditional memory types hold up is that with enough cache, and a wide enough bus and enough space for the dram on the board, you can get a better compromise for the cost and risk. For example, Radeon 7 was HBM, but Radeon 6900 is GDDR6 with a phat sram cache. Will this new approach hold up? Hopefully.

Yeah, I think AMD are onto something with IC. If it works out well (even if it has issues in RDNA 2, they can refine it for RDNA 3), that could draw others like Nvidia to develop equivalents. They can refine to work even smarter at smaller capacities, or get results with "slower" caches at the L3$ in later generations comparable to what they could be getting in the first generations. That might open up the chance of seeing it in future gaming consoles as well.

I think Nvidia's shown through GDDR6X that there's still room for GDDR to grow, though there's no way to tell where the ceiling is at this time. Whatever the ceiling is, I think they'll probably hit it by the end of the decade.

I mean, there are all kinds of whizz-kid silicon guys profiling this stuff and trying to predict the best set of tradeoffs to head towards. GDDR stacking might work, but there are inherently problems associate with running at high frequencies, using lots of power, stacking, and cooling. HBM works well with stack because it uses lower clocks and less power, but it makes up with a tremendously phat bus that can only work on an interposer. When your memory has to go across a board and up through pins / solder bumps under a chip package you're going to struggle with bus widths beyond a certain point.

Speaking of HBM, that could run into its own issues in the future. I've been reading a lot into FGDRAM (Fine-Grained Dynamic Random Access Memory), some great research and thesis papers exploring it and highlighting limitations in HBM technologies. If GDDR really does hit a wall sooner rather than later, HBM could be a great substitute provided costs go down, but for server and big data markets unless there's radical changes in HBM architecture, something like FGDRAM might have to step up to keep pushing larger bandwidths and wider buses, with better granularity levels.

Question is who's the first to develop a successful implementation of it? Or it could just end up that the ideas surrounding FGDRAM get brought forward into a future HBM spec, that's also possible. It'd be neat if yourself and some of the other super-technical posters around here had some thoughts on things like FGDRAM, if taken a look into. I'll have to re-read the papers at a future date, for sure.

Yeah, the success (or failure) of RDNA2 may give us some insight into the future. High end GPUs are much higher margin parts than consoles, so that probably gives them more area to play with on the silicon. Then again, the PHYs for external dram take up die area too, and as transistors get smaller that accounts for more and more cache that could be put on chip instead. But yeah, if GDDR hits a wall, and die area remains expensive with little cost reduction, eventually hardware vendors will be driven towards something more radical.

Maybe faster SSDs and persistent memory can save us.

I think you're on the money with SSDs, can't see them going away. Costs should continue to scale down, if there's any use of persistent memory it'll probably be as a large-ish (16 GB - 32 GB) byte-addressable cache replacement for DRAM or SRAM on the flash controller's side. Hopefully with lower latencies than 1st-generation Optane DC Persistent Memory which doesn't have terrible latencies in and of itself, I just think you'd probably want lower amounts to supplement it as a large cache in this instance.

XVA seems to comprise of a fast(ish) SSD, DirectStorage using virtual memory addresses, SFS, and a custom decompression block which I expect will need games to be compiled/built in a certain way to use.

Soooo..... other than using an SSD and having a very fast CPU, I'd agree with you that BC likely won't take advantage of most of XVA unless specifically patched to do so.

I think it will be a similar case for PS5 BC too. So I think BC games will mostly be limited (or boosted if you want to look at it that way!) by the CPU.

We're actually seeing some instances of BC titles running better on PS5 vs Series X, but that seems to do more with the fact PS5 would have the PS4 version to work with, while Series X has the XBO version (generally lower framerate and/or resolution vs. PS4 ver) or One X (generally higher native resolution vs. even PS4 Pro, but sometimes worst framerate as result) to work with.

One take I'm not agreeing with that some are trying to go for, though, is that variable frequency is affecting the SSD performance on PS5's side and that somehow is making for the (generally) longer load times for BC games there vs. Series X. That take doesn't make a lot of sense to me; it's already been well described variable frequency is a CPU/GPU thing, nothing in relation to the SSD or in PS5's case, the SSD I/O hardware block. AMD's own version of variable frequency they showed off at the RDNA 2 event is also CPU/GPU related.

Granted, BC games aren't stressing the SSDs in either PS5 or Series X, but it's just weird to see people rationalizing BC load times down to SSD being affected by variable frequency and Smartshift, or that variable frequency as-is would be affecting load times of BC games on PS5 because I don't think any of these unoptimized BC titles are necessarily pushing the GPU to its limits if much at all. Half the GPU is disabled anyway for BC on PS5 (IIRC), so there's almost virtually no way the GPU would have workloads stressing it to require power from the CPU's power budget (and therefore lower CPU profile performance which actually would affect BC).

That's also accounting for the fact PS5's CPU doesn't have a non-SMT mode (not that it'd be needed for BC; its clock is still much faster than PS4 or Pro's CPUs, though maybe the additional MHz overhead for Series X running BC games in non-SMT mode does help some with loading times there for non-optimized BC games?).

Deleted member 11852 · Nov 8, 2020

thicc_gaf said:
One take I'm not agreeing with that some are trying to go for, though, is that variable frequency is affecting the SSD performance on PS5's side and that somehow is making for the (generally) longer load times for BC games there vs. Series X. That take doesn't make a lot of sense to me; it's already been well described variable frequency is a CPU/GPU thing, nothing in relation to the SSD or in PS5's case, the SSD I/O hardware block. AMD's own version of variable frequency they showed off at the RDNA 2 event is also CPU/GPU related.

This is almost-definitely attributed to the CPU. Most game data check-in will be CPU-driven, i.e. once a file has actually been loaded into RAM the data is often a mishmash of compressed and uncompressed textures, shaders, audio, geometry and other stuff and separating and processing the data so it's usable by the game engines is on the CPU, aside from zlib decompression which is managed in hardware in both consoles. Series X's CPU clock is a shade faster than PS5 and most games exhibit a shade faster loading on Series X.

Coincidence? :nope:

Conspiracy? :nope:

Results fitting the facts? :yep2:

thicc_gaf · Nov 8, 2020

DSoup said:
This is almost-definitely attributed to the CPU. Most game data check-in will be CPU-driven, i.e. one a file has actually been loaded into RAM the data is often a mishmash of compressed and uncompressed textures, shaders, audio, geometry and other stuff and separating and processing the data so it's usable by the game engines is on the CPU, aside from zlib decompression which is managed in hardware in both consoles. Series X's CPU clock is a shade faster than PS5 and most games exhibit a shade faster loading on Series X.

Coincidence? Conspiracy? Results fitting the facts?

And there we go, case closed

So I'm wondering if this can also basically be answerable for non-BC games on the next-gen platforms as well. It makes me curious, then, how much of the CPU is actually "out of the way" on PS5's side. Like is the I/O block also handling this processing before then moving the data into RAM? If so, what are the specs on those aspects of the block? I know they've compared it to general Zen 2 core performance but...Zen 2 cores can run a gamut of overall frequencies, cache amounts etc. and while the Coherency Engines are likely just Zen 2 cores I find it hard to think they're exactly the same as the cores in the PS5 CPU proper.

Then that also asks about if the extra step of another processor component doing that before putting the data in RAM adds latency to the process compared to ust letting the CPU proper handle some of that? It will be very interesting to see how it all plays out for next-gen games in particular, but I'm expecting very impressive results from both that are relatively performant with each other.

Deleted member 11852 · Nov 8, 2020

thicc_gaf said:
So I'm wondering if this can also basically be answerable for non-BC games on the next-gen platforms as well. It makes me curious, then, how much of the CPU is actually "out of the way" on PS5's side. Like is the I/O block also handling this processing before then moving the data into RAM?

Games on today's consoles and PCs generally have their data stored in 'packs' (.pak files) which are often just .zip archives with or without actual compression. There can be many of these and they're often organised so their contents have all the data needed for any given level or area. This is obviously done to to improve loading/streaming times by reducing seek times. This is why there is so much data duplication, e.g. trees found in pack files for level 1 will probably be needed in other levels with trees.

On nextgen consoles I'd expect organisation to be based on data type, i.e. here is a pack file with all the foliage geometry, here is a a pack file with all the foliage textures etc. Because there are no seek times and virtually no overhead for file access you can pull 60 trees from these two packs and the the same will be true of most other assets. Textures compressed with zlib / kraken / oodle / bcpack will decompress on the fly during load and this approach will virtually eliminate CPU-bound check-in.

Ideally you want to be storing game data in the format that is immediately usable by the game engine, even if it's a little larger. It'll likely compress anyway and it saves a CPU-process during check-in to manipulate data into the format it is required. It probably won't eliminate all CPU check-in but it would alleviate a ton of it. You can see a practical demonstration of this with Spider-Man Miles Morales on PS5 where you go from game menu to city in 2 seconds of game load. But I bet PS5 running the PS4 Spider-Man, with smaller assets lumped together, takes much, much longer.

turkey · Nov 8, 2020

DSoup said:
This is almost-definitely attributed to the CPU. Most game data check-in will be CPU-driven, i.e. once a file has actually been loaded into RAM the data is often a mishmash of compressed and uncompressed textures, shaders, audio, geometry and other stuff and separating and processing the data so it's usable by the game engines is on the CPU, aside from zlib decompression which is managed in hardware in both consoles. Series X's CPU clock is a shade faster than PS5 and most games exhibit a shade faster loading on Series X.

Coincidence? Conspiracy? Results fitting the facts?

Is the slight CPU clock difference enough to offset the large (over 2x) raw SSD speed difference?

I do wonder if the PS4 and it's slightly novel HDD via USB bridge when emulated slows the IO throughput, who knows what strange API or implementation bottlenecks are in there which need to function for games that used them.

Deleted member 11852 · Nov 8, 2020

turkey said:
Is the slight CPU clock difference enough to offset the large (over 2x) raw SSD speed difference?

Yes because I'd wager that most games have asynchronous I/O powered check-in roughly equally waiting on the CPU as waiting for data from the storage device. Go have a look at what happened when DF put an 8Tb SSD into PS4 Pro. You see some great improvements but nowhere near what you might hope for given the relative bandwidth differences between the stock 5400rpm PS4 Pro HDD and the Samsung 870 QVO SSD. This just demonstrates it's not just about I/O bandwidth - at least not how devs build current gen games.

Introduce a faster storage device and you address maybe half the issue but you're still left with waiting for the CPU. What I think we're seeing with loading times in b/c games is relatively short I/O reads where PS5's higher bandwidth doesn't really offer much of an improvement over Series X because they're both so damn fast - much faster than required - but where Series X's faster CPU gives it a distinct edge.

iroboto · Nov 8, 2020

DSoup said:
Yes because I'd wager that most games have asynchronous I/O powered check-in roughly equally waiting on the CPU as waiting for data from the storage device. Go have a look at what happened when DF put an 8Tb SSD into PS4 Pro. You see some great improvements but nowhere near what you might hope for given the relative bandwidth differences between the stock 5400rpm PS4 Pro HDD and the Samsung 870 QVO SSD. This just demonstrates it's not just about I/O bandwidth - at least not how devs build current gen games.

Introduce a faster storage device and you address maybe half the issue but you're still left with waiting for the CPU. What I think we're seeing with loading times in b/c games is relatively short I/O reads where PS5's higher bandwidth doesn't really offer much of an improvement over Series X because they're both so damn fast - much faster than required - but where Series X's faster CPU gives it a distinct edge.

They might be just coded differently. Which is what I’m thinking is happening here. It may just come down to older engine history where someone decided for PS4 or Xbox they would load it this way. And maybe something is more modern or older for the other version and no one cares because loading didn’t matter

I cannot see any amount of cpu or ssd speeds to make up 3X differences in this way. To me it’s a code issue.

Allandor · Nov 8, 2020

DSoup said:
Yes because I'd wager that most games have asynchronous I/O powered check-in roughly equally waiting on the CPU as waiting for data from the storage device. Go have a look at what happened when DF put an 8Tb SSD into PS4 Pro. You see some great improvements but nowhere near what you might hope for given the relative bandwidth differences between the stock 5400rpm PS4 Pro HDD and the Samsung 870 QVO SSD. This just demonstrates it's not just about I/O bandwidth - at least not how devs build current gen games.

Introduce a faster storage device and you address maybe half the issue but you're still left with waiting for the CPU. What I think we're seeing with loading times in b/c games is relatively short I/O reads where PS5's higher bandwidth doesn't really offer much of an improvement over Series X because they're both so damn fast - much faster than required - but where Series X's faster CPU gives it a distinct edge.

The new patch for Last of Us also also proofs that. Faster loading times were always possible, it was just never a design target. Sure the SSD and hardware decompression helps, but overall we would already have much shorter loading times, if it was somethings customers would wanne have and long loading times would have been a deal breaker. It is just how you design your engine.

Yes you can't make miracles happen with a HDD, but faster loading was always possible but was never the priority. E.g. open the main menu, or loading saves. It is just a question of data mangement and maybe game design. But if you need many GBs of textures etc for a scene, then you have to load. For sure not as fast as with an SSD, but the Last of Us patch shows, what is possible with a more modern design (even with an HDD on the PS4).

BRiT · Nov 8, 2020

If I were a console OS coder, I would put in a "cheat", a predictive pre-emptive cheat. If you don't have a game loaded, the moment the user moves over a title in the UI, I would start reading in the game to a memory so when the user does choose to start it, you have a head start on their action. :mrgreen:

chris1515 · Nov 8, 2020

BRiT said:
If I were a console OS coder, I would put in a "cheat", a predictive pre-emptive cheat. If you don't have a game loaded, the moment the user moves over a title in the UI, I would start reading in the game to a memory so when the user does choose to start it, you have a head start on their action.

There is a patent on Sony side talking about doing this but the patent is more talking about preemptively initialize the game engine(some loading included but far from being the longest part). Cold boot is far from been loading only if it was the case optimized PS5 games would load as fast as Ratchet and Clank portal.

Velocity Architecture - Limited only by asset install sizes

rntongo

function

None functional

function

None functional

scently

thicc_gaf

rntongo

thicc_gaf

function

None functional

thicc_gaf

function

None functional

thicc_gaf

Deleted member 11852

Guest

thicc_gaf

Deleted member 11852

Guest

turkey

Deleted member 11852

Guest

iroboto

Daft Funk

Allandor

BRiT

(>• •)>⌐■-■ (⌐■-■)

chris1515

Similar threads