Xbox Series X [XBSX] [Release November 10 2020]

But you can't stream it mid frame on demand. The GPU can't say, "where's this texture?" and the system say, "hang on, I'll just fetch it from the SSD." The system will need to be looking way ahead and thinking, "okay, the GPU is gonna need that texture in 1ms time. I'd better grab it now." At which point, streaming in 16 ms earlier rather than 0.5 ms earlier is safer and completely reliable and an extension of the already-present systems. In the case that you are trying to use a texture still sat on SSD mid frame and you miss the timing, the worst that happens is you are one frame out and are having to use a lower LOD presumably for one frame. That's a really, really marginal and minimal problem to solve with mid-frame loads! ;)

Perhaps more realistic is non-graphics workloads where asynchronous processing can do job processing on persistent data without that being tied to framerate. I expect all graphics tasks to be working on RAM resident data and prefetching a frame in advance at least.
 
I agree with that. Async probably makes more sense. As for caching locations etc; that should reduce seek time to very little and make the operation a page in from drive. I believe SSD speeds are in microns, so I’m not exactly sure how much time they have available to do things.

I do agree they are probably just messing around. I don’t expect the SSD to ever be integrated into the pipeline.
 
I am also a bit ??? at the idea of mid frame usage of something from the SSD. Like, how small is that chunk or texture that could realistically be used mid frame? For that matter a 16.6 or 8.3 ms frame for that game (60hz/120hz)?

Tying a mid-frame bit of used data to the slowest and largest latency hardware sub-component?

The actual transactions the SSD works with best are the 2KB or greater NAND pages, and the virtual memory pages are at a minimum 4KB.
I'm not sure there's sufficient benefit to going smaller, since the physical properties of the arrays, virtual memory, and the PCIe bus favor multiple KB granularity for good utilization.
I think the rumor is that the XBSX is using a PHISON controller with ~440K read IOPs per second? That would require an average granularity at the drive level something like 1.3x 4KB per operation to give the bandwidth given for the console.

A 2.4 GB/s drive can provide ~40MB of data per frame at 60 FPS, though the proposed scenario has a load/replace cycle, meaning two phases each with an aggregate of 20MB max.
These figures can be doubled from the point of view of the GPU if we assume the claimed average 2x compression ratio holds.
It's also likely that I'm being too generous, since "mid-frame" doesn't give a precise window within the frame. There would be intervals at the start and end where the shaders in question hadn't launched yet or need to ramp, and period of time prior to the end where a renderer wouldn't expect any intermediate loads to be useful. Hundreds of microseconds for latency for GPU ramp and SSD response are likely to be appreciable if working with single-digit millisecond budgets.

This may be where some of the unknowns come in for the SSD overheads that might be removed relative to the PC SSD benchmarks. ~100 usec seems to be part of what Intel assumes is likely (in Optane marketing) https://builders.intel.com/datacent...-with-qlc-unleashing-performance-and-capacity, which seems much lower than the hundreds of usec that many consumer drives has been benchmarked as having at Anandtech.
Those are average latencies as well, rather than the 99th percentile figures, which range over two orders of magnitude depending on the drive and things like drive type SLC/MLC/TLC,QLC and empty/full status.

This may be an area where Microsoft's choice in using fixed proprietary expansion drive may reduce the complexity of navigating the massive variation in quality and performance consistency in the standard SSD space.

PRT has fallbacks for when a page is found to not be resident. At an ISA level, AMD's GPUs have an option to substitute 0 for any failed load if need be. I think there's been some unspecified changes to the filtering algorithms for Microsoft's sampler feedback implementation to help account for non-resident pages and whatever fallback is in place.

One thing the scenario didn't mention was whether it was mid-frame load, consume, unload, replace, and then consume again. There could be a difference in cost if this process is being used to help pre-load data for the next frame, versus intra-frame reuse.




I mean, I guess I would ask why not just leave it in memory? Unless you ran out of memory entirely, then you're going to need to start relying on the SSD to stretch the limits of your VRAM.
More aggressive memory management is something the upcoming gen is assuming will compensate for the historically modest RAM capacity gains.

"AMD makes a thread leave the chiplet, even when it’s speaking to another CCX on the same chiplet, because that makes the control logic for the entire CPU a lot easier to handle. This may be improved in future generations, depending on how AMD controls the number of cores inside a chiplet."

Again this is talking about multiple chiplets, but it does highlight that there are probably gains to be had by controlling which threads share a core and a CCX / L3. Maybe there are changes to the control logic that would make sense in a console, that wouldn't be worth it on PC or weren't ready when Matisse launched.
AMD's cache subsystem depends on the memory controllers or the attached logic in those blocks to be the home agents in the memory subsystem. Uprooting the logic that arbitrates for all remote cache transactions likely impinges on the fabric and CCX architectures, and touch a design fundamental that AMD hasn't changed since (I think) the K8. Assuming AMD would be willing, I'd suspect the risk and price would be significant for a modest latency benefit.
 
A vexing problem of going wide is the added paralellism. As succintly explained by Mark Cerny, there is the issue arising from dynamic parallelism where the number of independent executable tasks is exceeded by the number of threads resulting in poor saturation and inefficiency. Attempts to saturate threads can result in irregular data-dependent workloads where thread contentions issues (ABA, spinlocks, etc) can arise and the larger the number of cores, the bigger the problem. MS seems to have come up with a new flavour of non-blocking technique for concurrent data structures in the patent: "FIFO Queue, Memory Resource and Data Management for graphics processing" By J.M Gould and I. Nevraev which claims to alleviate contention issues while keeping a tight handle on memory usage.
This sheds some light on MS's choice for massive parallelism and its struggles to deal with it.
 
A vexing problem of going wide is the added paralellism. As succintly explained by Mark Cerny, there is the issue arising from dynamic parallelism where the number of independent executable tasks is exceeded by the number of threads resulting in poor saturation and inefficiency. Attempts to saturate threads can result in irregular data-dependent workloads where thread contentions issues (ABA, spinlocks, etc) can arise and the larger the number of cores, the bigger the problem. MS seems to have come up with a new flavour of non-blocking technique for concurrent data structures in the patent: "FIFO Queue, Memory Resource and Data Management for graphics processing" By J.M Gould and I. Nevraev which claims to alleviate contention issues while keeping a tight handle on memory usage.
This sheds some light on MS's choice for massive parallelism and its struggles to deal with it.
All directions point towards increased parallelism as power consumption is a harder and more difficult problem to solve. All of our videos cards and CPUs with each generation have been given more cores. Not less. We’ve never gone backwards, because backwards is much less efficient in the trade of power/performance.
 
All directions point towards increased parallelism as power consumption is a harder and more difficult problem to solve. All of our videos cards and CPUs with each generation have been given more cores. Not less. We’ve never gone backwards, because backwards is much less efficient in the trade of power/performance.
I agree. But parallelism is not a panacea and comes with its own set of problems, particularly because current computer science concepts be it command scheduling or cache management is more suited to CPUs than GPUs. It reminds me of the classical Operation Research problems I studied in college.
 
I agree. But parallelism is not a panacea and comes with its own set of problems, particularly because current computer science concepts be it command scheduling or cache management is more suited to CPUs than GPUs. It reminds me of the classical Operation Research problems I studied in college.

I also wish to point out that if the claims of the patent do pan out, its a massive win for MS.
 
I agree. But parallelism is not a panacea and comes with its own set of problems, particularly because current computer science concepts be it command scheduling or cache management is more suited to CPUs than GPUs. It reminds me of the classical Operation Research problems I studied in college.
Specifically for GPUs, does the compute pipeline Have scheduling Issues? I don’t believe the command processor is really used all that much. Otherwise we couldn’t do massive calculations with cuda split over several cards.

I largely suspect CU scheduling for efficiency is not the main issue. Supporting bandwidth for so much compute is.
 
Last edited:
Specifically for GPUs, does the compute pipeline Have scheduling Issues? I don’t believe the command processor is really used all that much. Otherwise we couldn’t do massive calculations with cuda split over several cards.

I largely suspect CU scheduling for efficiency is not the main issue. Supporting bandwidth for so much compute is.

When you are no longer facing a bandwidth bottleneck, it is.

https://egrove.olemiss.edu/cgi/viewcontent.cgi?article=2587&context=etd

https://vtechworks.lib.vt.edu/bitst...d=3301894A2069755AA2DDC449C58A057A?sequence=1
 
What do you mean by "choice for massive parallelism"? They've done nothing different to anyone else and have no more parallelism than anyone else.

Who is "anyone else" ? The only other relevant party is of course Sony as in operating with the same form factor and power/cost budget.
 
I guess he's referring to cu count.
That's no different to any other GPU though - Arcturus has been found configured to 128 CUs! The only people not pursuing width and parallelism are Sony, and that's not because parallelism is difficult or wasteful but for BC reasons, using a strangely narrow GPU.

Who is "anyone else" ? The only other relevant party is of course Sony as in operating with the same form factor and power/cost budget.
It's not that MS chose wide, but that Sony uniquely and counter-intuitively chose narrow, which has left us all wondering why and the only apparent rational explanation is CU count matching with the PS4Pro for BC reasons. Having settled on narrow, they then set about clocking it up the wazoo to get more power from it. It wasn't a design choice to solve parallelism issues that caused Sony to choose narrow and clocked high.
 
Last edited:
That's no different to any other GPU though - Arcturus has been found configured to 128 CUs! The only people not pursuing width and parallelism are Sony, and that's not because parallelism is difficult or wasteful but for BC reasons, using a strangely narrow GPU.

I thought so too for a long time. But it really does not make sense. High clocks was baked in the design since the very start.
 
That's no different to any other GPU though - Arcturus has been found configured to 128 CUs! The only people not pursuing width and parallelism are Sony, and that's not because parallelism is difficult or wasteful but for BC reasons, using a strangely narrow GPU.

It's not that MS choose wide, but that Sony uniquely and counter-intuitively chose narrow, which has left us all wondering why and the only apparent rational explanation is CU count matching with the PS4Pro for BC reasons. Having settled on narrow, they then set about clocking it up the wazoo to get more power from it. It wasn't a design choice to solve parallelism issues that caused Sony to choose narrow and clocked high.

I think the goal was not to "solve paralellism issues" (a non-sequitur when speaking of a GPU) but to have less of its problems. Also, was Mark Cerny lying ?
 
Last edited:
I think the goal was not to "solve paralellism issues" (a non-sequitur when speaking of a GPU) but to have less of its problems. Also, was Mark Cerny lying ?
Not so much that Cerny was lying but trying to justify their choice to a fanbase who had just seen the relatively massive GPU Microsoft rolled out. And it worked because now the PS5 subreddit is full of people parroting Cerny's comments as a means to say the PS5 is somehow more powerful than the XSX.
 
I thought so too for a long time. But it really does not make sense. High clocks was baked in the design since the very start.
Says who? I’ve never seen any mention of any reason why they floated so high. It’s double the amount of power required at 1825Mhz to get to 2230Mhz. That means a power envelope of the same chip but potentially 72CU. That’s not a small thing. And I largely suspect that’s where and how XSX is able to have more CU and TF and still be a smaller box at 52CU. There’s no magic here. Wide and slow produce way more TF than narrow and fast with dramatic power savings.
 
Says who? I’ve never seen any mention of any reason why they floated so high. It’s double the amount of power required at 1825Mhz to get to 2230Mhz. That means a power envelope of the same chip but potentially 72CU. That’s not a small thing. And I largely suspect that’s where and how XSX is able to have more CU and TF and still be a smaller box at 52CU. There’s no magic here. Wide and slow produce way more TF than narrow and fast with dramatic power savings.

Seems very unlikely they had planned to go with 2.3ghz from the beginning, around 2015. How could they or even AMD know those clocks would be sustainable then? They could have aimed for high clocks offcourse, but 2.3ghz?
Most likely they wanted solid BC, it's on the same arch as PS4 this time around, and they went with 36CU, perhaps double the Pro's power in TF metrics, just like MS doubled their premium console (12.2TF).
They could have upped the clocks from there as high as possible, they could not maintain 2ghz/9TF, and came with boost clocks to have a higher peak value at 10.2TF/2.23ghz, with a custom smartshift implementation.
It's always a win over a fixed 9TF versus a between 9 and 10TF (in high load cases) value.

Sony is the only one going this fast and narrow for a higher end GPU, i strongly doubt RTX3000 or RDNA2 dGPU's in the higher end range go for very narrow and very fast. Rumors point to that not being the case atleast.
And no, i have zero doubt it had mattered much if sony went for a more powerfull wider gpu for the content we got, the 'true next gen leap' some expected was just not attainable with the power increases we got on either next generation console.
 
Back
Top