Digital Foundry Article Technical Discussion [2023]

Status
Not open for further replies.
So, then, is it also your opinion that Sony got the "balancing act" wrong in this particular case?

Series X pretty much murders PS5 in this game. So did Sony screw up their hardware design?

Or is it possible that games differ, engines differ, and development priorities differ?

No they didn't get it wrong in this particular case.

And your post is overly aggressive and feels like a response I see on Twitter.
 
Well, it really shouldn't. Due to potential faster clocks, the PS5 can have faster triangle setup (primitive rate, culling, tessellation, etc), pixel fill rate, texture filtering rate, thread scheduling (front end), and memory subsystem (caches). Also as we've learned recently: faster Async Compute.

Series X is definitely faster in Compute, Ray Tracing and memory bandwidth. So in the end, they both end up as equals on their own ways.

Texture filtering rate should be a wash, as texture filtering units are per CU / DCU. Series S/X have the same texture filtering rate per CU per clock as the PS5. Likewise, Series S / X have the same amount of L0 and LDS per CU as PS5.

PS5 has more L1 cache per CU in it's favour, and about the same total L2 BW, but Series X has more L2 and will have a higher L2 hit rate, and it has more memory BW.

That dev chap said that async was 'generally' faster on PS5, so that's good for PS5, but it would also mean there are cases where PS5 isn't faster (the particular combination of sync and async workloads will likely be important).

But yeah, overall results are very close. I guess that makes outliers all the more interesting!
 
If we remove redundancy CU from the equation:

448 / 40 = 11
560 / 56 = 10

It’s more or less the same.

And if you take clock speeds into account, while also factoring in redundancy:

448 / (36 * 2.3) = 5.4
560 / (52 * 1.8) = 6.0

Series X has a distinct, though nor enormous, bandwidth per compute unit per cycle advantage. The Series X's two memory ranges will probably lower this a little in practice, but it does seem that the Series X is well provisioned for in terms of BW.

Some more ROPs for the Series X might have been nice. It does seem pretty decent at doing transparencies, but I'm guessing that at depth only passes or shadows the memory bus is left twiddling its thumbs.
 
If we remove redundancy CU from the equation:

448 / 40 = 11
560 / 56 = 10

It’s more or less the same.

They're not though are they?

PS5 has 36 CU's so using 40 is incorrect and gives an incorrect value.

PS5 is 12.4GB/s per CU

XSX is 10.7GB/s per CU

That's nearly a 20% advantage in bandwidth per CU in favour of PS5.

~20% is not 'more or less the same'
 
They're not though are they?

PS5 has 36 CU's so using 40 is incorrect and gives an incorrect value.

PS5 is 12.4GB/s per CU

XSX is 10.7GB/s per CU

That's nearly a 20% advantage in bandwidth per CU in favour of PS5.

~20% is not 'more or less the same'
the original design for a 5700 is 40CUs at 448 GB/s. In a discussion about ratios and having enough, this is the orignal planned ratio. They only removed CUs for redundancy. Having enormous bandwidth per CU doesn’t necessarily increase performance either. It’s not a metric that we’ve ever used to determine performance for a GPU. If that was a heavily weighted metric for performance we would have maximized that number long ago. The only time I’ve ever seen bandwidth per CU come up is in console discussion. In particular only this generation, as it didn’t show up in last.

I’m fairly positive 4090 is 1TF of bandwidth for 128SM units or a ratio of 7.8.

The number is not meaningful.
 
They're not though are they?

PS5 has 36 CU's so using 40 is incorrect and gives an incorrect value.

PS5 is 12.4GB/s per CU

XSX is 10.7GB/s per CU

That's nearly a 20% advantage in bandwidth per CU in favour of PS5.

~20% is not 'more or less the same'

"BW per CU" in itself doesn't actually mean anything, because the CUs are running at different speeds.

The rate at which a compute unit will consume BW will vary more or less in line with clockspeed for a given task on a given GPU.

SX CUs run slower, so they require less BW per CU than PS5 CUs to avoid being BW starved. It's like with ROPs, or TMUs, or anything else. You have to factor speed into account because dermines data inputs and outputs.

Adjusted for clockspeeds, Series X CUs are actually better provided for than PS5 if you look just at GPU optimal memory - to the tune of 10%+. Use of the 'slow' 192-bit range of memory for some less BW demanding tasks may result in some reduction in ability to feed the GPU, but unless something is going very wrong on SX it's going to be comparing comparably or favourably to the PS5 in this particular regard.
 
The original design for a 5700 is 40CUs at 448 GB/s.
Irrelevant
In a discussion about ratios and having enough, this is the original planned ratio.
For RDNA1
They only removed CUs for redundancy.
I know
Having enormous bandwidth per CU doesn’t necessarily increase performance either.
Proof?
It’s not a metric that we’ve ever used to determine performance for a GPU.
Doesn't mean it doesn't have an effect.
If that was a heavily weighted metric for performance we would have maximized that number long ago.
Doesn't mean it's not a metric that does affect performance.
The only time I’ve ever seen bandwidth per CU come up is in console discussion. In particular only this generation, as it didn’t show up in last.
Again, and?
I’m fairly positive 4090 is 1TF of bandwidth for 128SM units or a ratio of 7.8.
TF?
The number is not meaningful.
To you.
 
"BW per CU" in itself doesn't actually mean anything, because the CUs are running at different speeds.

The rate at which a compute unit will consume BW will vary more or less in line with clockspeed for a given task on a given GPU.

SX CUs run slower, so they require less BW per CU than PS5 CUs to avoid being BW starved. It's like with ROPs, or TMUs, or anything else. You have to factor speed into account because dermines data inputs and outputs.

Adjusted for clockspeeds, Series X CUs are actually better provided for than PS5 if you look just at GPU optimal memory - to the tune of 10%+. Use of the 'slow' 192-bit range of memory for some less BW demanding tasks may result in some reduction in ability to feed the GPU, but unless something is going very wrong on SX it's going to be comparing comparably or favourably to the PS5 in this particular regard.

PS5 has ~20% more BW per CU but is also clocked *up to* 17% higher.

So what BW advantage does XSX have with it's lower clocks as PS5's higher bandwidth per CU nullifies that.

But, back to talking about DF video's.
 
Irrelevant

For RDNA1

I know

Proof?

Doesn't mean it doesn't have an effect.

Doesn't mean it's not a metric that does affect performance.

Again, and?

TF?

To you.
  • It's a relevant number because you said XSX has the absolute worse bandwidth per CU metric of any GPU - and so the 5700XT should suffice.
  • You can build a GPU with a single CU and have huge bandwidth metric. That's not going to net you more performance.
  • Under your interpretation of this metric, a 5700XT with 36 CUs will outperform a 5700XT with 40CUs and I chose this because PS5 is a custom RDNA 1 with RDNA2 features
  • The only time bandwidth per CUs are a meaningful metric is if the CUs are bandwidth starved and waiting around for work to do because there isn't enough bandwidth. Anymore than removing this bottleneck will not generate additional performance.
  • And because this particular bandwidth per CU only comes up in discussion about PS5 vs XSX, and in no other GPU discussions, I think that lends to the lack of credibility that this metric has any real weight, the main purpose of this particular metric is to just point at something PS5 has more of over XSX. This is about as significant as when people said that PS4 has 400% more ACE units over XB1. That really didn't lead nothing at all.
  • 1TB/s - my mistake
  • If you showcase how bandwidth per CU/SM increases performance in a graph beyond the point where there is sufficient bandwidth to not bottleneck the compute, I'm more than happy to concede.
  • As per this question here: https://forums.developer.nvidia.com/t/memory-bandwidth-in-terms-of-sm-number/174020
    • You cannot make a single SM take full advantage of the bandwidth (say 1TB/s). Because no SM is capable of saturating 1TB/s of bandwidth
    • This also implies an upper limit on how much each CU can pull from bandwidth
    • If the kernel you write can fully maximize the bandwidth available provided by the memory subsystem, having even more bandwidth than that will not result in additional performance.
    • Thus back to the original point - 560GB/s is paired for 56CUs and the 448 is paired for 40. Reducing the number of CUs will not net any additional performance because there is suddenly more bandwidth per CU. In order to use more bandwidth you need more CUs, not less.
 
Last edited:
Why so? XSX have more Unified shader units and texture units. Doesn't that make them at least on par?
but Series X has more L2 and will have a higher L2 hit rate, and it has more memory BW .. Texture filtering rate should be a wash
Texture Filtering (using FP16 precision) on AMD GPUs is often half their texture fill rate, filtering also requires way more memory bandwidth, and more cache usage. The Series X has 44% more CUs/TMUs than PS5 (208 TMUs vs 144 TMUs), however it only has 25% more L2 Cache (5MB vs 4MB), L1 Cache is smaller, and memory bandwidth is also fractured on Series X, with a 560GB/s portion and a 336 GB/s portion, while the PS5 enjoys a fixed 448GB/s at all times.

So in practice, while texture fill rate on Series X is a huge 380GT/s vs the PS5's 320GT/s (a difference of 60GT/s), the texture filtering rate stands at 190GT/s vs 160GT/s (a meager 30GT/s difference) easily compensated for with the higher clocked caches on the PS5, larger L1, and the more consistent memory bandwidth.

Keep in my mind that even the 160GT/s filter rate is a crazy high number for consoles, that is never reached in practice. Even if we assume a console game implementation of native 4K120 (which never happened), we only need 1 billion Texel/s for filtering, multiply that by 8X (using AF 8X, which is rarely used in consoles), you get about 8 billion Texel/s, still far from the 160GT/s rate. You would become bottlenecked by memory bandwidth, ROPs, or compute way faster than the filtering rate.

Of course other advanced texture techniques need more texture fill rate (such as Render to Textures and Cubemaps), but still the available fill rate in modern GPUs is often way more than enough. That's why I believe the PS5 can filter more textures than Series X (even if by a small amount), or at the very least -like you said-, is not at a deficit compared to Series X.
 
PS5 has ~20% more BW per CU but is also clocked *up to* 17% higher.

So what BW advantage does XSX have with it's lower clocks as PS5's higher bandwidth per CU nullifies that.

Bandwidth is a rate - it's data over time. A CU count on it's own doesn't have any element of time - that's why you need to take into account the relative speed of the CUs. It adds in the element of time (things per second) so you can compare the relative bandwidth provision.

"BW per CU" means essentially nothing, but "BW per (CU * clockspeed)" allows you to make at least some kind of rough comparison of how each CU in the system can be supported by the memory bus (L1 and L2 cache differences aside).

I'll try and use your style of GB/s per CU figures to highlight this, but add in relative clockspeed to provide a (somewhat) meaningful comparison of processing and bandwidth.

(448 GB / 36) / (560 / 52) = 12.44 / 10.77 = 1.16 [PS5 has 16% more "BW per CU"]

223 mhz / 1825 mhz = 1.22 [but each PS5 CU will be running up to 22% faster]

In other words, when only considering GPU optimal ram on the SX (to keep things simple), the SX is slightly better able to feed its CUs on paper. It's not a huge difference though.

Another, simpler way of doing essentially the same comparison is (GB/s) / (TF/s). Time (s) cancels out and you get a straight forward ratio. It's about 46.1 for SX and 43.5 for PS5.

These are just theoretical numbers and all kinds of stuff could make a difference here or there, but it roughly shows us that in this one area, SX is about the same or maybe a little better than the PS5. And that probably shouldn't be too suprising given they're basically the same architecture and those CUs are going to need very similar support from main ram.
 
the original design for a 5700 is 40CUs at 448 GB/s. In a discussion about ratios and having enough, this is the orignal planned ratio. They only removed CUs for redundancy. Having enormous bandwidth per CU doesn’t necessarily increase performance either. It’s not a metric that we’ve ever used to determine performance for a GPU. If that was a heavily weighted metric for performance we would have maximized that number long ago. The only time I’ve ever seen bandwidth per CU come up is in console discussion. In particular only this generation, as it didn’t show up in last.

I’m fairly positive 4090 is 1TF of bandwidth for 128SM units or a ratio of 7.8.

The number is not meaningful.

I doubt that. Shader counts increases have mandated an increase in VRAM bandwidth. While bandwidth/shader count is not a common figure we regularly discuss, it doesn't mean it doesn't represent a performance impact.

You probably can't argue much with the ratio alone because the GPU caches play a role in memory performance too.

The 4090 has a low ratio but the 4000 series saw a massive increase in on-chip memory (72 mb versus 6mb for the 3090) and so has AMD GPUs since the 6000 series.

However, I think we as a forum should explore this topic more readily because it may represent a performance impact that we have ignored. We have seen a huge leaps in theoretical Tflops performance but the GDDR bandwidth increases have been miniscule in comparison. I wonder if the cache increases are enough to accommodate.
 
I doubt that. Shader counts increases have mandated an increase in VRAM bandwidth. While bandwidth/shader count is not a common figure we regularly discuss, it doesn't mean it doesn't represent a performance impact.

You probably can't argue much with the ratio alone because the GPU caches play a role in memory performance too.

The 4090 has a low ratio but the 4000 series saw a massive increase in on-chip memory (72 mb versus 6mb for the 3090) and so has AMD GPUs since the 6000 series.

However, I think we as a forum should explore this topic more readily because it may represent a performance impact that we have ignored. We have seen a huge leaps in theoretical Tflops performance but the GDDR bandwidth increases have been miniscule in comparison. I wonder if the cache increases are enough to accommodate.
I absolutely agree with all your points, but because of how one can manipulate bandwidth per CU, you don’t get any sort of reading here of how it affects the overall outcome. Therefore I don’t see value in metric. Having less CUs with massive bandwidth will spike this particular metric, and performance will be worse than the same bandwidth but with more CUs that is more towards an appropriate ratio.

Caches are the major culprit for performance here, more cache is likely leading to significantly more performance even though the VRAM ratio is progressively getting smaller.

The TLDR for me:
If one cannot graph bandwidth per/CU against performance and find a correlation that indicates going up or down the ratio leads to more performance then the metric is worthless.
 
Last edited:
  • It's a relevant number because you said XSX has the absolute worse bandwidth per CU metric of any GPU - and so the 5700XT should suffice.
  • You can build a GPU with a single CU and have huge bandwidth metric. That's not going to net you more performance.
  • Under your interpretation of this metric, a 5700XT with 36 CUs will outperform a 5700XT with 40CUs and I chose this because PS5 is a custom RDNA 1 with RDNA2 features
  • The only time bandwidth per CUs are a meaningful metric is if the CUs are bandwidth starved and waiting around for work to do because there isn't enough bandwidth. Anymore than removing this bottleneck will not generate additional performance.
  • And because this particular bandwidth per CU only comes up in discussion about PS5 vs XSX, and in no other GPU discussions, I think that lends to the lack of credibility that this metric has any real weight, the main purpose of this particular metric is to just point at something PS5 has more of over XSX. This is about as significant as when people said that PS4 has 400% more ACE units over XB1. That really didn't lead nothing at all.
  • 1TB/s - my mistake
  • If you showcase how bandwidth per CU/SM increases performance in a graph beyond the point where there is sufficient bandwidth to not bottleneck the compute, I'm more than happy to concede.
  • As per this question here: https://forums.developer.nvidia.com/t/memory-bandwidth-in-terms-of-sm-number/174020
    • You cannot make a single SM take full advantage of the bandwidth (say 1TB/s). Because no SM is capable of saturating 1TB/s of bandwidth
    • This also implies an upper limit on how much each CU can pull from bandwidth
    • If the kernel you write can fully maximize the bandwidth available provided by the memory subsystem, having even more bandwidth than that will not result in additional performance.
    • Thus back to the original point - 560GB/s is paired for 56CUs and the 448 is paired for 40. Reducing the number of CUs will not net any additional performance because there is suddenly more bandwidth per CU. In order to use more bandwidth you need more CUs, not less.
Here is a summary:
PS5 went narrower and faster, whereas Series X went wider and slower. Cerny's approach was that it is easier to fill less CU's but utilise them better at higher clocks, whereas Series X went for wider for more computational power.
Both approaches have their pros and cons. The Series X's CU's are probably less utilised which is why the PS5 manages good parity, but we will see how things will look like if games require more CU's if that will ever be a thing.
 
I wouldn't have thought there'd be any trouble filling more CUs. GPUs are inherently parallel and should just use the whole CU offering - how do you prevent a workload from actually being distributed across all 56 CUs on XBSX?
 
Status
Not open for further replies.
Back
Top