Xbox Series X [XBSX] [Release November 10 2020]

Sure, but in any given period of time, if your processing is limited by how much data you can fetch in X amount of time, it doesn't matter if you process items more quickly due to clock speed or due to more CUs.

Basically if your SOC is sitting idle because you can only get X amount of data to the SOC then it does not matter how quickly it can process that data. The faster you process it, the faster you get to sit there and wait for more data to come in.

In that case, regardless of the system architecture, the system that can provide more data to the SOC will have an advantage.

That's what happens if you become limited by your bandwidth. Your SOC could be operating at 10 GHz instead of 3 GHz, but both are rendering at the same speed if they both have the same bandwidth and are limited by it. Another SOC could have 1000 CUs or 300 CUs, but they are both rendering a scene at the same speed if they are limited by the same bandwidth.

In situations like this where pure bandwidth is the main limitation, then whatever system has more bandwidth will have the advantage regardless how much faster or wider an SOC is than another SOC. It doesn't matter how quickly your SOC can process the data if your system can get less data to the SOC than another architecture.
  • The PS5 can, in ideal situations, process 448 GB/s of data from main memory. The SOC can process more than that depending on the workload, but that is the absolute limit if it has to access main memory.
  • The XBS-X can in ideal situations, process 560 GB/s of data from main memory. The SOC can process more than that depending on the workload, but that is the absolutely limit if it has to access main memory.
Bandwidth limited situations means the SOC can process more than those data rates at that point in time, but that it's limited to processing data by those data rates. When this situation arises, it doesn't matter that you have similar or greater bandwidth per CU, your system is still sitting there twiddling it's thumbs if it's already pulling in the maximum amount of data that it can.

Bandwidth obviously isn't the only limitation you can run into, so how much of a limitation it is depends on how often your system needs more data than it can pull from main memory.

Regards,
SB
yes, as I wrote, xsx has 25% bw advantage (if we taking into account only 10gb of ram)
 
That's what happens if you become limited by your bandwidth. Your SOC could be operating at 10 GHz instead of 3 GHz, but both are rendering at the same speed if they both have the same bandwidth and are limited by it. Another SOC could have 1000 CUs or 300 CUs, but they are both rendering a scene at the same speed if they are limited by the same bandwidth.

In situations like this where pure bandwidth is the main limitation, then whatever system has more bandwidth will have the advantage regardless how much faster or wider an SOC is than another SOC. It doesn't matter how quickly your SOC can process the data if your system can get less data to the SOC than another architecture.
The only thing I would add to this is that you can also have a CU bottleneck as well. If you have a workload that is significantly larger than the number of available ALU, you have a compute deficiency that cannot be rectified by bandwidth.

So for extreme examples in this case 1 CU @ 78Ghz vs 36 CU @ 2.3Ghz. The latter will outperform the former at the same bandwidth and memory setup. This is because the former has to make 36x more write trips and 36x more read trips than the latter. So while the bandwidth is available there isn't enough CU's to take advantage of it.

So you could have 1 TB of bandwidth but a single CU is only capable of requesting so much data before it's full. You can process it fast sure, but requesting data and writing data is likely the slowest part of the process here because latency becomes a factor the more times you make requests. We traditional hide latency by introducing more threads, but once again there is a limit to that as well.

I definitely don't think having a 36x faster front end on the graphics side will make up for the number of memory trips later down the pipeline.

1 CU will need to make 36x requests vs the 36CU 1 request to fulfill the same amount of work. You can eventually extrapolate this to other items over time.

The reason why we don't see things like 80CUs and such have a huge lift over smaller CUs is likely because the workload just hasn't been large enough where you see the smaller ALUs combined with smaller caches fall off a cliff.

It's not always linear and often more than not, most things run very well until you reach a workload point in which you break the camels back and things get progressively worse running.
 
Last edited:
Compute performance from higher clocks relies on being fed, it relies on data being in cache to maximize performance; inversely idle clock cycles results in lower performance.
Yup. Within a certain envelope, the clock speed and number of compute units does not matter, it's about how many free compute cycles you have available in any given timeframe and Series X should have more unless that higher ceiling has been eaten into running a higher resolution.

All things being equal, Series X should support better RT than PS5.
 
The only thing I would add to this is that you can also have a CU bottleneck as well. If you have a workload that is significantly larger than the number of available ALU, you have a compute deficiency that cannot be rectified by bandwidth.

So for extreme examples in this case 1 CU @ 78Ghz vs 36 CU @ 2.3Ghz. The latter will outperform the former at the same bandwidth and memory setup. This is because the former has to make 36x more write trips and 36x more read trips than the latter. So while the bandwidth is available there isn't enough CU's to take advantage of it.

So you could have 1 TB of bandwidth but a single CU is only capable of requesting so much data before it's full. You can process it fast sure, but requesting data and writing data is likely the slowest part of the process here because latency becomes a factor the more times you make requests. We traditional hide latency by introducing more threads, but once again there is a limit to that as well.

I definitely don't think having a 36x faster front end on the graphics side will make up for the number of memory trips later down the pipeline.

1 CU will need to make 36x requests vs the 36CU 1 request to fulfill the same amount of work. You can eventually extrapolate this to other items over time.

The reason why we don't see things like 80CUs and such have a huge lift over smaller CUs is likely because the workload just hasn't been large enough where you see the smaller ALUs combined with smaller caches fall off a cliff.

It's not always linear and often more than not, most things run very well until you reach a workload point in which you break the camels back and things get progressively worse running.

Yes. In a nutshell, many people don't realize that every architecture out there is going to be limited at X time by Y thing. Until such time as there are unlimited resources and capabilities on some piece of hardware, there will always be times when it's limited by some bit of it's architecture.

It's just that when something is the fastest piece of hardware, you don't think about the times when it hits it's limitations because it's human nature to assume it's not limited because it's the fastest thing on the market. But anytime some part of the hardware is idling, it means another part of the hardware is operating at its limits, thus limiting the overall performance of the hardware at that point.

The practical holy grail of hardware isn't to design something with no limits, but to design something where each piece can be the limiting factor at some point. IE - the HD 2900 XT wasn't a good design because it's bandwidth was never a limitation (thus transistors could have been better used for something else) for any real world use.

Now when comparing 2 pieces of hardware, the interesting things is to tease out how one architecture might be limited by X feature in Y situations versus another architecture. Unfortunately, often a lot of noise comes in with partisan comments that this means one architecture is overall better than another because of that, when that isn't necessarily the case at all.

Just because one system might be slightly better at RT doesn't suddenly make the other architecture not good. Just because one arch has a lower clock speed doesn't make it worse. Just because one arch has more CU doesn't mean the other arch is bad. Etc., etc.

If someone can't acknowledge when their arch might have a limitation that another arch is less limited by, then there's no way they can fairly judge different architectures. Likewise if they can never admit that another arch than the one they like is better in some areas, the same problem arises.

Of course, in a good technical discourse, there will always be a back and forth about the relative strengths or weaknesses and how that impacts the overall performance of an arch or even potential discussion about whether or not some feature is a weakness or a strength with evidence provided by real world applications after a product has spent enough time on the market.

It's unfortunate that I sometimes see too much of X is better than Y because ... limited data. It's still early in the product cycle. Each arch has been on the market for less than a year. Very little software has been written to utilize the features of either product. Yet, some are already making claims that X is better at Y thing on such limited data.

That said, I do appreciate all the people that keep an open mind and attempt to steer the discourse into talking about why A might be better than B in S product doing X, Y, or Z thing. Is it the hardware? Is it the software? Is it the development environment? Is it something non-obvious? Is it the skill level of the developer? Is it the time spent on A or B arch? Etc.

It's also a little frustrating if someone keeps point out that X thing is true because this is how that person interprets Cerny's words yet at the same time dismisses anything Andrew Goosen might have said about the arch he helped create. Likewise, going the other way around, pointing out things Goosen said while ignoring things that Cerny said.

Also, if someone is going to go through a video frame by frame to find places where X arch is doing something better than Y arch? Your argument will be stronger if you also point out those frames where Y arch is doing something better than X arch. Otherwise, you'll often come out looking partisan when that may or may not be your intention. There's been a lot of screenshots posted here attempting to show that X arch is better or worse, and then later having someone else come in and show that the opposite is true depending on what frame or screenshot that person cherry picked to show some alleged superiority or inferiority. Or at the very least while you're looking for proof that your arch is definitely better, at least try to make sure that there isn't also evidence in the footage of your favorite arch doing the exact same thing? :)

People, it's not the end of the world if your system of preference is slightly worse or slightly better at this or that. :) The fun is in looking at what it happening and trying to tease out any details we can from it.

Bleh, this turned out a lot longer than I intended. Perhaps a side-effect of my not wanting to put people on ignore. :p

Regards,
SB
 
It’s tough to saturate a GPU. I have nvidia-Smi on when I profile my code for data work and I can barely make a blip up 14%. It’s just sitting around waiting for data to do work. Computation is just so fast.

I get the challenges that everyone puts forward as pros and cons. But so much of that is just hardware talk; making software maximize said hardware is incredibly difficult and we probably don’t put enough focus on how hard that may be.
 
The only thing I would add to this is that you can also have a CU bottleneck as well. If you have a workload that is significantly larger than the number of available ALU, you have a compute deficiency that cannot be rectified by bandwidth.

So for extreme examples in this case 1 CU @ 78Ghz vs 36 CU @ 2.3Ghz. The latter will outperform the former at the same bandwidth and memory setup. This is because the former has to make 36x more write trips and 36x more read trips than the latter. So while the bandwidth is available there isn't enough CU's to take advantage of it.

So you could have 1 TB of bandwidth but a single CU is only capable of requesting so much data before it's full. You can process it fast sure, but requesting data and writing data is likely the slowest part of the process here because latency becomes a factor the more times you make requests. We traditional hide latency by introducing more threads, but once again there is a limit to that as well.

I definitely don't think having a 36x faster front end on the graphics side will make up for the number of memory trips later down the pipeline.

1 CU will need to make 36x requests vs the 36CU 1 request to fulfill the same amount of work. You can eventually extrapolate this to other items over time.

The reason why we don't see things like 80CUs and such have a huge lift over smaller CUs is likely because the workload just hasn't been large enough where you see the smaller ALUs combined with smaller caches fall off a cliff.

It's not always linear and often more than not, most things run very well until you reach a workload point in which you break the camels back and things get progressively worse running.
ok so we have new quite good benchmark re8 rt mode and still gap is minimal, so maybe indeed cu count * clock is better performance indicator than only cu count ? ;)
 
We only have a time limited demo, perhaps wait until the full release game where tech-heads have more time and regions to analyze?

Though I do think that it sometimes can be closer to Count*Clock-vs-Count*Clock than just Count-vs-Count. It all depends where the limiters are for the workloads.
 
  • Like
Reactions: snc
*Ahem* Behave. Also, not the thread for console comparisons. We generally avoid those because of how quickly it loses the ability to remain as a civil discussion.
 
ok so we have new quite good benchmark re8 rt mode and still gap is minimal, so maybe indeed cu count * clock is better performance indicator than only cu count ? ;)
They just do different things, if we're talking about particular functions required in a pipeline right. Clockspeed increases the speed of the whole pipeline, having more CU's may only assist in improving 1 or 2 aspects of the pipeline (in scenarios where more hardware units would be advantageous to have).

Since you're measuring just the final output, you don't know what aspects PS5 is doing better or worse than XSX. This isn't the same as talking about strictly RT performance; and in order for there to be a larger gap between RT performance, the amount of workload around RT has to be greater as it must be a greater % of frame time in order to see a differential in RT performance.
 
They just do different things, if we're talking about particular functions required in a pipeline right. Clockspeed increases the speed of the whole pipeline, having more CU's may only assist in improving 1 or 2 aspects of the pipeline (in scenarios where more hardware units would be advantageous to have).

Since you're measuring just the final output, you don't know what aspects PS5 is doing better or worse than XSX. This isn't the same as talking about strictly RT performance; and in order for there to be a larger gap between RT performance, the amount of workload around RT has to be greater as it must be a greater % of frame time in order to see a differential in RT performance.
yeah but thats how generally games works, but maybe you are right and there exist syntetic benchmark that would show increase rt performance comparing to compute performance tough I'm not sure
 
yeah but thats how generally games works, but maybe you are right and there exist syntetic benchmark that would show increase rt performance comparing to compute performance tough I'm not sure
yes, games work that way because not all parts of the game are being run in parallel over the CUs. At least with the older APIs, there is a heavy reliance on the 3D pipeline and the fixed function units to do a lot of heavy lifting.

But in the question of just asking if having more CUs (and RT Units) would be more advantageous in RT performance, vs having less but with more clock speed, I would say yes, it likely is. Ray traversal takes a while, having more units and more bandwidth would be ideal in this scenario. But if the workload is not large enough, that advantage in RT units will likely not make up the deficit elsewhere (ie, being much slower on the front end of the pipeline).

When games leave behind last generation, i would re-assess this because the new APIs rely significantly less on the fixed function hardware and more on the compute units to do the work. And the only engine I know capable of doing next gen things without using newer API features, is UE5... which isn't out.
 
yes, games work that way because not all parts of the game are being run in parallel over the CUs. At least with the older APIs, there is a heavy reliance on the 3D pipeline and the fixed function units to do a lot of heavy lifting.

But in the question of just asking if having more CUs (and RT Units) would be more advantageous in RT performance, vs having less but with more clock speed, I would say yes, it likely is. Ray traversal takes a while, having more units and more bandwidth would be ideal in this scenario. But if the workload is not large enough, that advantage in RT units will likely not make up the deficit elsewhere (ie, being much slower on the front end of the pipeline).

When games leave behind last generation, i would re-assess this because the new APIs rely significantly less on the fixed function hardware and more on the compute units to do the work.
ok do you think it still would have advantage if there is same tf and same bandwidth but more cu (so proportionaly slower clock) ?
 
ok do you think it still would have advantage if there is same tf and same bandwidth but more cu (so proportionaly slower clock) ?
The best way to represent this would likely be a graph of CUs and clock speed vs Performance while locking Teraflops and bandwidth. Looking at the graph you'll find a local maxima and you're also likely find a very particular combination in which too many CUs would under perform significantly, and too few CUs would underperform.

The answer would probably just be looking at the graph if that makes sense. You're going to want to choose the profile that gives you the best maxima in this case. But different workloads will likely result in different maximas.

So there is that consideration as well. I'm not sure if Sony and MS went for overall best average performance, or if they biased their configuration for what they thought would be the future. Unfortunately this is a complete unknown, but an interesting discussion none the less.
 
Maybe new metro:exodus will shed some light on this manner. I personally think that RT will play less role further we will go in next gen. New consoles are not powerfull enough with 1gen amd RT hardware for advanced RT. And tbh RT costs vs visual fidelity is very bad, if remember correctly Alex from DF said that the simplest RT effect adds 6ms rendering time to 6800. This budget could be spent elsewhere with possible bigger impact on overall visual representation. But maybe in time engines will change and RT will play bigger role and have bigger impact. Right now i feel its like tessellation on x360.
 
  • Like
Reactions: snc
ok so we have new quite good benchmark re8 rt mode and still gap is minimal, so maybe indeed cu count * clock is better performance indicator than only cu count ? ;)

Singular games are almost never a great benchmark for anything other than overall performance of an arch for a given game. IE - you can't usually make judgements about any specific implementation detail of hardware or software.

Even on the PC, this is the case, although on PC you have greater control in that you can enable or disable things to try to tease out possibilities.

Imagine it this way. Lets say some hypothetical game is hypothetically
  • Limited by physics calculations 5% of the time.
  • Limited by compute 25% of the time.
  • Limited by rasterization 10% of the time.
  • Limited by RT 15% of the time.
  • Limited by trips to main memory (bandwidth) 10% of the time.
  • Etc.
If the game runs well on X hardware versus Y hardware, can you definitively say that A portion of hardware X, B portion of hardware X, or C portion of hardware X is the main reason that X hardware performs better than Y hardware? Now to further complicate things lets say the percentage that those numbers are for X hardware, but the percentages change for Y hardware because Y hardware has made different choices in how the hardware was designed. Yeah, not easy to say...

We can take guesses, certainly, but they'll never be anything more than guesses without being able to see the developer's internal performance graphs.

Things are further complicated by the fact that we generally can't run console games with unlocked framerate. Games often run at different resolutions, further complicated by dynamic resolution where we're relying on someone attempting to determine resolution of a very VERY (as in extremely) small sample of frames. The potential for error there is huge even with people and sites that have been doing this for years.

This is further complicated when the performance lead for 2 pieces of hardware might swap positions depending on what level or what portions of a level are benchmarked. In the PC space, you can (and some sites have) cherry picked where they benchmark a game because that particular section will put their favored hardware in a more favorable light, while another site might choose a different point in the game to benchmark because it puts their hardware ahead in that portion of the game.

Benchmarking to determine hardware performance of specific parts of a given piece of hardware is already hard and more often than not either indeterminant or misinterpreted. And that's where you at least have some level of control over the benchmark conditions in attempting apples to apples comparisons. And that's also where you have some tools available to you to attempt to see what the hardware is doing.

Trying to do that on consoles where you have almost no control over the benchmark conditions and no hardware level tools?

If one console performs relatively better in RT on mode versus RT off mode (if it has that setting), this might or might not tell us anything about relative RT performance ... assuming that anything that changes between RT on and RT off is limited only to RT. But if other things also change (like resolution, effects, etc.)? It'll be interesting certainly, but I'm not sure how much we would be able to take away from it.

BTW - this isn't to say that it won't be fun to talk about it. Just trying to emphasize that there is likely no case and no one game we can look at that will definitely tell us that X platform or Y platform is better or worse due to A hardware implementation or B hardware implementation.

At the end of the generation after looking at the body of work? Maybe we can make some generalizations? But talking about it and trying to figure it out is fun. :) Just no-one should take any of the discussion as evidence of a fact.

Oh and back to why I originally posted this. There are no good benchmarks on console. :) If we had performance graphs like we do on the PC version of Gears 5? And we could toggle individual rendering features on and off? Man that would help a lot. But we don't and we likely never will have something that breaks things down to that level on console.

Regards,
SB
 
ok do you think it still would have advantage if there is same tf and same bandwidth but more cu (so proportionaly slower clock) ?
I’m going to revise my answer here now thinking about it.
You should never have more CUs than bandwidth can support and vice versa otherwise bandwidth or CUs will be idle.

so your hypothetical would never occur. A single CU with a high clock rate could never withdraw more than a couple Gb/s of a larger memory pool and this is a result of transport and latency. You can only make so many read/write requests per second for each CU.

this is largely the reason why GPUs continually increase in bandwidth (more cores with each generation) and CPUs are generally requiring more or less the same amount of bandwidth (less cores but much more powerful cores).

so if you lower clock rate and increase CU to match the computational power, you must increase bandwidth to support the additional CUs otherwise technically they have nothing to pull.

If you lower the CU count having more bandwidth won’t help even if you vastly increase the clockspeed.

there is a bit of wiggle room here with this argument however; there is definitely a point in time where a marginal CU increase is easily offset by more clockspeed. But there is a limit on clockspeed in which the returns will taper off as you increase further.
 
Perhaps one final thought on the concept of clockspeed and CU count. Perhaps this hypothetical case may make more sense.
If you design a synthetic benchmark which only leveraged the fixed function portions of the hardware only and skipped the entire unified shader pipeline entirely;
if XSX ran this benchmark at 100fps, PS5 would run this benchmark at 123fps, or 23% faster as per their clock speed difference.
There is absolutely nothing XSX can do to mitigate this difference because they have exactly the same FF hardware but PS5 runs 23% faster here.
Which means, in any benchmark of which XSX and PS5 are pretty much identical, XSX essentially made up that deficit on the back half of the frame where compute and unified shaders do its work despite this part also being 23% slower.
And so that CU advantage there is putting in work to make up the clock speed differential twice.
I hope that makes sense.

Effectively the larger the back half of the portion is, or the further away the XSX can get away from fixed function pipeline, the more it will leverage it's silicon strengths.

I think typically this isn't the case with other GPU families. Often larger GPUS are also shipped with larger front ends so that you don't run into this scenario of clockspeed vs core count. It's more like you have it all so there is no way the more expensive card will perform worse than a model lower on the family.

Thinking about it this way, XSX is actually the one that is an outlier as it's front end performance and general compute performance are not well matched. Increasing the clock speed would help narrow that particular gap with respect to PS5, but it wouldn't make up for the fact that it's mismatched with respect to itself.

I don't know if MS had went this route in favour of hoping the transition to Mesh shaders would happen sooner since CUs are what they use to process geometry, and it's not clear on their intentions on whether they hoped developers intend to skip the 3D pipeline altogether in favour of just using compute shaders for rendering out pixels instead of ROPs. It does come across again, as a cost cutting measure.
 
Perhaps one final thought on the concept of clockspeed and CU count. Perhaps this hypothetical case may make more sense.
If you design a synthetic benchmark which only leveraged the fixed function portions of the hardware only and skipped the entire unified shader pipeline entirely;
if XSX ran this benchmark at 100fps, PS5 would run this benchmark at 123fps, or 23% faster as per their clock speed difference.
There is absolutely nothing XSX can do to mitigate this difference because they have exactly the same FF hardware but PS5 runs 23% faster here.
Which means, in any benchmark of which XSX and PS5 are pretty much identical, XSX essentially made up that deficit on the back half of the frame where compute and unified shaders do its work despite this part also being 23% slower.
And so that CU advantage there is putting in work to make up the clock speed differential twice.
I hope that makes sense.

Effectively the larger the back half of the portion is, or the further away the XSX can get away from fixed function pipeline, the more it will leverage it's silicon strengths.

I think typically this isn't the case with other GPU families. Often larger GPUS are also shipped with larger front ends so that you don't run into this scenario of clockspeed vs core count. It's more like you have it all so there is no way the more expensive card will perform worse than a model lower on the family.

Thinking about it this way, XSX is actually the one that is an outlier as it's front end performance and general compute performance are not well matched. Increasing the clock speed would help narrow that particular gap with respect to PS5, but it wouldn't make up for the fact that it's mismatched with respect to itself.

I don't know if MS had went this route in favour of hoping the transition to Mesh shaders would happen sooner since CUs are what they use to process geometry, and it's not clear on their intentions on whether they hoped developers intend to skip the 3D pipeline altogether in favour of just using compute shaders for rendering out pixels instead of ROPs. It does come across again, as a cost cutting measure.
and imho if some think xsx advantage shouldn't be close to theoretical 20% compute and 25% bandwidth advantage but rather closer to 44% cu count number advantage is simply wrong but end this theoretical dispute here ;)
 
I don't know if MS had went this route in favour of hoping the transition to Mesh shaders would happen sooner since CUs are what they use to process geometry, and it's not clear on their intentions on whether they hoped developers intend to skip the 3D pipeline altogether in favour of just using compute shaders for rendering out pixels instead of ROPs. It does come across again, as a cost cutting measure.

It's not really "hoping"; things are moving to favor parallelism and wider designs in the GPU space. MS wanted a design that was future-proofed in this respect, and we have benchmarks on mesh shader routines showing incredible degrees of performance uplift over the traditional 3D pipeline process that operates mainly off fixed-function hardware. So perhaps in some way it's a cost-cutting measure but it's no more of one than, say, Sony settling for pseudo-programmable logic with the primitive shaders in their design...and I'd say Microsoft's choices are easily the more future-proofed of the two even in spite of compromises.

Not to say in practice we will see games leveraging mesh shaders having a tenfold increase over games on the traditional pipeline path, but it's sensible to assume there will be a notable advantage once devs fully move over to that setup and move their pipelines and tools to build around it. This is hopefully where Sony's alterations (whatever they are) with their primitive shaders is sensible and able to match up with what true mesh shaders provide in the near future.

What we're probably go to see is a clash in 3D design philosophies, at least for the short term, Sony favoring fixed function (and pseudo-programmable graphics with the prim shaders) and Microsoft favoring programmable mesh shaders. However, the advantage seems to be on Microsoft's side here because it's not just Series systems pushing mesh shaders; AMD GPUs, Nvidia GPUs and Intel GPUs are also pushing in that direction, collectively that is a market for hundreds of millions of devices, more than whatever numbers PS5 can reach. So 3P devs will certainly start to optimize their pipelines for mesh shaders (though not at the expense of fixed-function hardware), and it's honestly just a matter of when that happens, not "if".

If you want a point of comparison, IMO it's like the quadrilaterals/triangles scenario from the '90s between SEGA and Sony, only this time Sony are on the quadrilaterals side of that scenario and Microsoft are on the triangles side of it. What happened then looks like it will happen again (in terms of the general shift; both companies will be perfectly fine in the long-term in spite of this IMHO and support for the fixed-function 3D pipeline is more ingrained than quads were in the gaming space at their time).
 
Back
Top