Digital Foundry Article Technical Discussion [2021]

BRiT · Nov 17, 2021

Inuhanyou said:
I guess that's for like og Xbox games or 360 games right? I was only thinking in terms of last gen and next gen but it does make sense. Certainly it probably makes that level emulation much easier to achieve.

I still hear about Sony deciding their cu counts and preconfigured clocks to make PS4 and Pro work on PS5 so I guess it shouldn't be surprising

They only had 7.x cores to use last-gen, so it applies for Xbox One games too.

davis.anthony · Nov 17, 2021

iroboto said:
All these GPUs suffered from the 4 cycle issue of GCN architecture.
The reason clock speed helped GCN architectures is because it could speed through the idle cycles more to issue instructions faster.

https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf

As per the slides, 6 and 7.

RDNA resolves some of the largest issues with scaling to more CUs.
You will get more performance out of increasing clock speed to a limit in which data needs to be accessible for work to perform. But you will hit physical limits of power draw. The need to go wider is the natural resolution to increasing performance per watt.

Stating the reason for the issue doesn't make the issue or the poor CU scaling of those cards go away, or make my comment of them having poor CU scaling any less true.

RDNA2 is better but it's still around 72% scaling on average with additional CU's so there's still room for improvement, although this is scaling on the desktop parts that have a larger memory bandwidth difference between them than PS5 and XSX do so the CU scaling likely lower then on the PC parts.

Clock speed scaling vs CU scaling is likely the reason Sony chose the narrow/fast option they did.

Silent_Buddha · Nov 17, 2021

Inuhanyou said:
I guess that's for like og Xbox games or 360 games right? I was only thinking in terms of last gen and next gen but it does make sense. Certainly it probably makes that level emulation much easier to achieve.

I still hear about Sony deciding their cu counts and preconfigured clocks to make PS4 and Pro work on PS5 so I guess it shouldn't be surprising

This is how things pan out.

Xbox - 1 CPU core capable of executing 1 thread at a time.
- Can only execute 1 CPU thread at a time.
X360 - 3 CPU cores capable of executing 2 threads at a time.
- Can execute 6 CPU threads simultaneously.
XBO - 8 CPU cores capable of executing 1 thread at a time.
- Can execute 8 CPU threads simultaneously.
XBS consoles - 8 CPU cores capable of executing 2 threads at a time.
- SMT off - Can execute 8 CPU threads simultaneously
- SMT on - Can execute 16 CPU threads simultaneously

Basically running any previous generation of Xbox game doesn't benefit from SMT being on with XBS consoles.

BC mode with SMT off can therefore utilize the extra CPU cycles from the higher clockspeed with SMT off in order to potentially further enhance a BC title.

Regards,
SB

iroboto · Nov 17, 2021

davis.anthony said:
Stating the reason for the issue doesn't make the issue or the poor CU scaling of those cards go away, or make my comment of them having poor CU scaling any less true.

RDNA2 is better but it's still around 72% scaling on average with additional CU's so there's still room for improvement, although this is scaling on the desktop parts that have a larger memory bandwidth difference between them than PS5 and XSX do so the CU scaling likely lower then on the PC parts.

Clock speed scaling vs CU scaling is likely the reason Sony chose the narrow/fast option they did.

You don't get 100% efficiency from clock speed increases either as there are obvious diminishing returns on clock speed and going wide has saturation issues.

CU saturation will depend per title here. If you rely heavily on the 3D pipeline, than CU saturation will diminish and CU scaling is as you say, not the greatest.
If you rely heavily on the compute pipeline then the opposite would occur.

The real question is where you're sampling from. Game engines are still largely in their transitory process now, despite having compute shaders since 2011.
Games like Doom Eternal perform very well on AMD hardware as function of CU scaling.

If you compare a 5700XT 40CUs vs the 6800XT 80CUs (edit sorry 72CUs) you can see it's nearly perfect scaling; most certainly better than 72%
AMD Radeon RX 6800 XT Review - NVIDIA is in Trouble - DOOM Eternal | TechPowerUp

nearly 82% at 1080p
nearly 91% at 1440p
nearly 95% at 4K

At 4K they nearly achieved perfect scaling.
Fixed function hardware is still playing a role here of course, but in games that are entirely software rasterization, the CU scaling will be very good compartively.

Compared to a 6900XT which is 80CUs.
AMD Radeon RX 6900 XT Review - The Biggest Big Navi - DOOM Eternal | TechPowerUp
344.4fps vs 171 fps @ 1080p -- 101%
293 fps vs 131 fps @ 1440p -- 123%
169 fps vs 74.6 fps @ 4K -- 125%

I think scaling is fine here if the title is coded to do it.

davis.anthony · Nov 17, 2021

iroboto said:
If you compare a 5700XT 40CUs vs the 6800XT 80CUs (edit sorry 72CUs) you can see it's nearly perfect scaling; most certainly better than 72%
AMD Radeon RX 6800 XT Review - NVIDIA is in Trouble - DOOM Eternal | TechPowerUp

nearly 82% at 1080p
nearly 91% at 1440p
nearly 95% at 4K

At 4K they nearly achieved perfect scaling.
Fixed function hardware is still playing a role here of course, but in games that are entirely software rasterization, the CU scaling will be very good compartively.

Compared to a 6900XT which is 80CUs.
AMD Radeon RX 6900 XT Review - The Biggest Big Navi - DOOM Eternal | TechPowerUp
344.4fps vs 171 fps @ 1080p -- 101%
293 fps vs 131 fps @ 1440p -- 123%
169 fps vs 74.6 fps @ 4K -- 125%

I think scaling is fine here if the title is coded to do it.

Console's don't have RDNA1 GPU's so comparing scaling between a 5700XT and 6800XT is not relevant.

When comparing CU scaling between RDNA2 GPU's is 72% average and much lower then your RDNA1 vs RDNA2 figures.

iroboto · Nov 17, 2021

davis.anthony said:
Console's don't have RDNA1 GPU's so comparing scaling between a 5700XT and 6800XT is not relevant.

When comparing CU scaling between RDNA2 GPU's is 72% average and much lower then your RDNA1 vs RDNA2 figures.

There's no difference between RDNA 1 and RDNA 2 CUs except for the RBEs and RT units, support for DX12U feature set; performance wise everything else stays the same.
Performance per watt improved, but if you're running the same clock speeds it doesn't matter. This is equivalent unless you're using features for DX12.

But as per my original post, I would disagree. If you compare the benchmarks between the 6800XT and the 6900XT.
6800XT has 72 CUs, the 6900XT has 80 CUs, that is a 11% difference - so perfect scaling should suggest an 11% improvement on Doom Eternal.

The frame rate differences are: 12%, 17%, 18% better on the 6900XT on the respective resolutions.
So we are seeing once again, perfect scaling.

What you're calling out as a 72% scaling amount is really just selecting titles that are entirely built around the 3D pipeline, where in general (AMD, and NV), command processors have a harder time fully saturating the CUs/SMs because of the unified shader pipeline. Games that are extremely heavy on compute shaders will not share the same behaviour.

davis.anthony · Nov 17, 2021

iroboto said:
But as per my original post, I would disagree. If you compare the benchmarks between the 6800XT and the 6900XT.
6800XT has 72 CUs, the 6900XT has 80 CUs, that is a 11% difference - so perfect scaling should suggest an 11% improvement on Doom Eternal.

The frame rate differences are: 12%, 17%, 18% better on the 6900XT on the respective resolutions.
So we are seeing once again, perfect scaling.

computerbase.de did an article on RDNA2 comparing it on a per clock basis to RDNA1 (RDNA1 is slightly faster per clock) and also CU scaling between 40, 60, 72 and 80 CU's at the same clock speed (important to show pure CU scaling), their results showed on average CU scaling was 72%.

Doom Eternal on their testing showed that going from a 72CU RDNA2 GPU at 2Ghz to 80CU's at 2Ghz showed 8% scaling.

Moving from 40CU's at 2Ghz to 80CU's at 2Ghz in Doom Eternal resulted in 70% performance improvement.

iroboto · Nov 17, 2021

davis.anthony said:
computerbase.de did an article on RDNA2 comparing it on a per clock basis to RDNA1 (RDNA1 is slightly faster per clock) and also CU scaling between 40, 60, 72 and 80 CU's at the same clock speed (important to show pure CU scaling), their results showed on average CU scaling was 72%.

yea I'm well aware. So which game titles did they use?

davis.anthony · Nov 17, 2021

iroboto said:
yea I'm well aware. So which game titles did they use?

AC:Valhalla
Borderlands 3
CP2077
Death Stranding
Doom Eternal
Hitman 3
Watch Dogs Legions

iroboto · Nov 17, 2021

davis.anthony said:
computerbase.de did an article on RDNA2 comparing it on a per clock basis to RDNA1 (RDNA1 is slightly faster per clock) and also CU scaling between 40, 60, 72 and 80 CU's at the same clock speed (important to show pure CU scaling), their results showed on average CU scaling was 72%.

Doom Eternal on their testing showed that going from a 72CU RDNA2 GPU at 2Ghz to 80CU's at 2Ghz showed 8% scaling.

Moving from 40CU's at 2Ghz to 80CU's at 2Ghz in Doom Eternal resulted in 70% performance improvement.

you can't lock clocks and do comparisons like that since the 6000 series requires more clockspeed for higher bandwidth on the infinity cache which is needed to feed the CUs.
That's my only issue there with what they did. Since clockspeed is now part of the memory subsystem, by locking the clocks they've starved the CUs without compensating for it.

If they want to normalize for clocks and CUs, they need to do it in post. But that information does not exist: this was the only way they could do it, but that doesn't necessarily mean it's accurate. At least, not with how these particular AMD cards are setup.

davis.anthony · Nov 17, 2021

iroboto said:
you can't lock clocks and do comparisons like that since the 6000 series requires more clockspeed for higher bandwidth on the infinity cache which is needed to feed the CUs.
That's my only issue there with what they did. Since clockspeed is now part of the memory subsystem, by locking the clocks they've starved the CUs without compensating for it.

If they want to normalize for clocks and CUs, they need to do it in post. But that information does not exist: this was the only way they could do it, but that doesn't necessarily mean it's accurate. At least, not with how these particular AMD cards are setup.

It wasn't a perfect test no, but it's the best we have (unless you know of a better article?) and is likely close enough to reality to be useful when discussing CU scaling for RDNA2 based GPU's.

It's also worth remembering that the lower CU cards in this test also have a smaller memory bus so they have less bandwidth then the 60+ CU cards. Bandwidth was not equalized between the cards.

IC also does very little at 1080p so running at 2Ghz wouldn't be an issue, 1080p shows roughly the same CU scaling as 4k does.

iroboto · Nov 18, 2021

davis.anthony said:
It wasn't a perfect test no, but it's the best we have (unless you know of a better article?) and is likely close enough to reality to be useful when discussing CU scaling for RDNA2 based GPU's.

It's also worth reminding that the lower CU cards also have a smaller memory bus so they have less bandiwdth then the 60+ CU cards. Bandwidth was not equalized between the cards.

IC also does very little at 1080p so running at 2Ghz wouldn't be an issue, 1080p shows roughly the same CU scaling as 4k does.

There won't be a perfect test, and I don't rule out that the number is somewhat useful for talking about it. But the reason I made the point to begin with is that, technically speaking more CUs is not the issue.

For the tests themselves, they don't need equalized bandwidth between cards, we want bandwidth per CU.
the 6700XT has 1.5TB/s of bandwidth for 40CU.
The 6800XT and 6900XT have 2.0TB of bandwidth for 72 and 80CU respectively.
There's no comparison on which CUs are being supplied better.

Without the cache, they only have 512GB/s of bandwidth. That's actually slower than the XSX with 560GB/s.
Whereas the 6700XT is 384 GB/s and 1.5TB of cache, but compared to a PS4 which has 448GB/s with 36 CUs. At least the differential there is 60GB/s. They actually need over 768GB/s of bandwidth + 1.5TB/s of cache to be double the memory system of a 6700XT when doubling the compute power. the 6800/6900 series when the cache is removed is pretty pitiful in bandwidth per CU. So pitiful it's hard to believe, PS5 will have 2x the bandwidth per CU looking at offchip.

The point of my post was to showcase that CU scaling works and can work perfectly provided the data can be fed to the CUs properly, which is a function of both programming and of course consumption of available bandwidth.

Performance doesn't get worse just because you added in more CUs. They just aren't fed well because the cost of bandwidth is significantly higher than the cost of adding more ALUs. It's not an architectural thing, it's a cost thing.

We have more stages of cache in the SoC to try to mitigate the need to hit main memory because it's cheaper to do that than it is to build more and more bandwidth off chip.

Of the group, if we remove cache entirely - PS5 still has the most off-chip bandwidth per CU, followed by XSX, then 6700XT, 6800,6900 respectively.

CU scaling tests will always benefit the lower number in this case for those reasons, it really comes down to where the bottlenecks are.

snc · Nov 18, 2021

iroboto said:
you can't lock clocks and do comparisons like that since the 6000 series requires more clockspeed for higher bandwidth on the infinity cache which is needed to feed the CUs.
That's my only issue there with what they did. Since clockspeed is now part of the memory subsystem, by locking the clocks they've starved the CUs without compensating for it.

If they want to normalize for clocks and CUs, they need to do it in post. But that information does not exist: this was the only way they could do it, but that doesn't necessarily mean it's accurate. At least, not with how these particular AMD cards are setup.

what ? the disadvantage in this comparison have gpu with lower cu as it has lower bandwidth which always has some impact on performance so yeah it was not perfect test but not in this direction ;d would be perfect test of cu scaling if all cards have same bandwidth but still good

from computerbase

The Radeon RX 6800, Radeon RX 6800 XT and Radeon RX 6900 XT all have a 256-bit interface with 16 Gbps memory and a 128 MB "Infinity Cache" with 2.0 TB / s - so they are absolutely identical. The Radeon RX 6700 XT, on the other hand, only accesses a 192-bit interface, 16 Gbps memory and a 96 MB "Infinity Cache" with 1.5 TB / s. The Radeon RX 6700 XT has less memory bandwidth and a smaller and slower "Infinity Cache". That cannot be compensated for either.

This means that the graphics card is inevitably at a disadvantage, which should, however, be quite small due to the lower GPU clock and the significantly lower number of CUs.

iroboto · Nov 18, 2021

snc said:
what ? the disadvantage in this comparison have gpu with lower cu as it has lower bandwidth which always has some impact on performance so yeah it was not perfect test but not in this direction ;d would be perfect test of cu scaling if all cards have same bandwidth but still good

no, it would be a perfect test of CU scaling if you normalized bandwidth per CU. Not bandwidth per card.

snc · Nov 18, 2021

iroboto said:
no, it would be a perfect test of CU scaling if you normalized bandwidth per CU. Not bandwidth per card.

perfect would be if all have like 10tb/s as we were sure none of them is bandwidth limited, 40cu with smaller bandwidth in 4k for sure has some perf impact by bandwidth but still good test

iroboto · Nov 18, 2021

snc said:
perfect would be if all have like 10tb/s as we were sure none of them is bandwidth limited, 40cu with smaller bandwidth in 4k for sure has some perf impact by bandwidth but still good test

Yes. That would be great. Because you remove bandwidth as a bottleneck and you look purely at how well the data is sent to be parallelized over the CUs for work.

Inuhanyou · Nov 18, 2021

Silent_Buddha said:
This is how things pan out.

Xbox - 1 CPU core capable of executing 1 thread at a time.

Can only execute 1 CPU thread at a time.

X360 - 3 CPU cores capable of executing 2 threads at a time.

Can execute 6 CPU threads simultaneously.

XBO - 8 CPU cores capable of executing 1 thread at a time.

Can execute 8 CPU threads simultaneously.

XBS consoles - 8 CPU cores capable of executing 2 threads at a time.

SMT off - Can execute 8 CPU threads simultaneously

SMT on - Can execute 16 CPU threads simultaneously

Basically running any previous generation of Xbox game doesn't benefit from SMT being on with XBS consoles.

BC mode with SMT off can therefore utilize the extra CPU cycles from the higher clockspeed with SMT off in order to potentially further enhance a BC title.

Regards,
SB

That kinda gets to my original confusion I guess.

If last gen was pretty much the same with Sony and Ms regarding both CPUs not having SMT, why doesn't Sony need a specific non SMT mode for their machine to emulate last gen?

Unless they already have one of those, but I only recall Microsoft making such an option possible with the series consoles and not Sony with ps5

Silent_Buddha · Nov 18, 2021

Inuhanyou said:
That kinda gets to my original confusion I guess.

If last gen was pretty much the same with Sony and Ms regarding both CPUs not having SMT, why doesn't Sony need a specific non SMT mode for their machine to emulate last gen?

Unless they already have one of those, but I only recall Microsoft making such an option possible with the series consoles and not Sony with ps5

The only reason that MS does it is to enable a higher clock for the CPU cores when SMT is off. It's not about needing to turn it off to emulate last gen (for MS) but about getting more CPU performance to use for potential enhancements for last gen.

Last gen titles would also work just as well on MS consoles with SMT on, there would just be less CPU headroom in that case for potential enhancements or performance improvements.

Sony aren't looking to enhance any BC titles so it doesn't need any potential extra CPU speed for BC titles. As long as previous gen titles, work, that's all they need.

Regards,
SB

Inuhanyou · Nov 18, 2021

Silent_Buddha said:
The only reason that MS does it is to enable a higher clock for the CPU cores when SMT is off. It's not about needing to turn it off to emulate last gen (for MS) but about getting more CPU performance to use for potential enhancements for last gen.

Last gen titles would also work just as well on MS consoles with SMT on, there would just be less CPU headroom in that case for potential enhancements or performance improvements.

Sony aren't looking to enhance any BC titles so it doesn't need any potential extra CPU speed for BC titles. As long as previous gen titles, work, that's all they need.

Regards,
SB

Thats why I come here, you guys know your stuff. Thanks for explaining it to me!

Basically it's not neccesary but in cases where smt isn't being used in legacy emulation higher clock speeds instead are preferred for better performance. I guess they also do this to accommodate their fps doubling initiative as well as the res boosts?!

davis.anthony · Nov 18, 2021

iroboto said:
The point of my post was to showcase that CU scaling works and can work perfectly provided the data can be fed to the CUs properly, which is a function of both programming and of course consumption of available bandwidth

But again, computerbases results completely go against what you're saying.

IC has little effect at 1080p and comes more important as the resolution increases, and yet in their testing 1080p still shows roughly the same 70% CU scaling as 4k does.

If it was a bandwidth issue 1080p would scale the most (providing not CPU limited...etc..etc...) as there's more available bandwidth, but it doesn't. Indicating that bandwidth isn't really the issue.

Looking at your Techpowerup example, specially the overclocking section where they show average clocks at stock it would seem that the clock speeds were not a match.

So it's not a CU scaling example but a CU+clock speed scaling example.

Digital Foundry Article Technical Discussion [2021]

BRiT

(>• •)>⌐■-■ (⌐■-■)

davis.anthony

Silent_Buddha

iroboto

Daft Funk

davis.anthony

iroboto

Daft Funk

davis.anthony

iroboto

Daft Funk

davis.anthony

iroboto

Daft Funk

davis.anthony

iroboto

Daft Funk

snc

iroboto

Daft Funk

snc

iroboto

Daft Funk

Inuhanyou

Silent_Buddha

Inuhanyou

davis.anthony

Similar threads