I am also a bit ??? at the idea of mid frame usage of something from the SSD. Like, how small is that chunk or texture that could realistically be used mid frame? For that matter a 16.6 or 8.3 ms frame for that game (60hz/120hz)?
Tying a mid-frame bit of used data to the slowest and largest latency hardware sub-component?
The actual transactions the SSD works with best are the 2KB or greater NAND pages, and the virtual memory pages are at a minimum 4KB.
I'm not sure there's sufficient benefit to going smaller, since the physical properties of the arrays, virtual memory, and the PCIe bus favor multiple KB granularity for good utilization.
I think the rumor is that the XBSX is using a PHISON controller with ~440K read IOPs per second? That would require an average granularity at the drive level something like 1.3x 4KB per operation to give the bandwidth given for the console.
A 2.4 GB/s drive can provide ~40MB of data per frame at 60 FPS, though the proposed scenario has a load/replace cycle, meaning two phases each with an aggregate of 20MB max.
These figures can be doubled from the point of view of the GPU if we assume the claimed average 2x compression ratio holds.
It's also likely that I'm being too generous, since "mid-frame" doesn't give a precise window within the frame. There would be intervals at the start and end where the shaders in question hadn't launched yet or need to ramp, and period of time prior to the end where a renderer wouldn't expect any intermediate loads to be useful. Hundreds of microseconds for latency for GPU ramp and SSD response are likely to be appreciable if working with single-digit millisecond budgets.
This may be where some of the unknowns come in for the SSD overheads that might be removed relative to the PC SSD benchmarks. ~100 usec seems to be part of what Intel assumes is likely (in Optane marketing)
https://builders.intel.com/datacent...-with-qlc-unleashing-performance-and-capacity, which seems much lower than the hundreds of usec that many consumer drives has been benchmarked as having at Anandtech.
Those are average latencies as well, rather than the 99th percentile figures, which range over two orders of magnitude depending on the drive and things like drive type SLC/MLC/TLC,QLC and empty/full status.
This may be an area where Microsoft's choice in using fixed proprietary expansion drive may reduce the complexity of navigating the massive variation in quality and performance consistency in the standard SSD space.
PRT has fallbacks for when a page is found to not be resident. At an ISA level, AMD's GPUs have an option to substitute 0 for any failed load if need be. I think there's been some unspecified changes to the filtering algorithms for Microsoft's sampler feedback implementation to help account for non-resident pages and whatever fallback is in place.
One thing the scenario didn't mention was whether it was mid-frame load, consume, unload, replace, and then consume again. There could be a difference in cost if this process is being used to help pre-load data for the next frame, versus intra-frame reuse.
I mean, I guess I would ask why not just leave it in memory? Unless you ran out of memory entirely, then you're going to need to start relying on the SSD to stretch the limits of your VRAM.
More aggressive memory management is something the upcoming gen is assuming will compensate for the historically modest RAM capacity gains.
"AMD makes a thread leave the chiplet, even when it’s speaking to another CCX on the same chiplet, because that makes the control logic for the entire CPU a lot easier to handle. This may be improved in future generations, depending on how AMD controls the number of cores inside a chiplet."
Again this is talking about multiple chiplets, but it does highlight that there are probably gains to be had by controlling which threads share a core and a CCX / L3. Maybe there are changes to the control logic that would make sense in a console, that wouldn't be worth it on PC or weren't ready when Matisse launched.
AMD's cache subsystem depends on the memory controllers or the attached logic in those blocks to be the home agents in the memory subsystem. Uprooting the logic that arbitrates for all remote cache transactions likely impinges on the fabric and CCX architectures, and touch a design fundamental that AMD hasn't changed since (I think) the K8. Assuming AMD would be willing, I'd suspect the risk and price would be significant for a modest latency benefit.