Digital Foundry Article Technical Discussion [2021]

Status
Not open for further replies.
But again, computerbases results completely go against what you're saying.

IC has little effect at 1080p and comes more important as the resolution increases, and yet in their testing 1080p still shows roughly the same 70% CU scaling as 4k does.

If it was a bandwidth issue 1080p would scale the most (providing not CPU limited...etc..etc...) as there's more available bandwidth, but it doesn't. Indicating that bandwidth isn't really the issue.

Looking at your Techpowerup example, specially the overclocking section where they show average clocks at stock it would seem that the clock speeds were not a match.

So it's not a CU scaling example but a CU+clock speed scaling example.
Bandwidth requirements are resolution * framerate. If you double the framerate, you double the bandwidth regardless of resolution. The likelihood that the whole game can fit into the IC of 128MB is unlikely at ultra settings, which I think you' were thinking of footprint requirements as a result of resolution. Since we know they cannot fit everything into 128MB, they are constantly hitting off-chip memory as well and for everything else, so off-chip memory is a massive factor and obviously the CUs rely heavily on the IC to make up for the shortfall of how little off-chip memory bandwidth is available. Normalizing the bandwidth on the IC would be unfair penalization because it needs as much bandwidth as possible to go as faster.

If increasing bandwidth results in increased performance, then bandwidth is the bottleneck. This is why if you want to do CU scaling, you must do normalized bandwidth per CU. If you look at the offchip bandwidth per CU of the 6800XT it's bandwidth per CU is 7.1 GB/s. The bandwidth per CU for the 6700XT is 9.6GB/s. 7.1 / 9.6 is 75%. Roughly very close to your 72% scaling. Or put another way 512 GB/s is only 33% more than 384 GB/s, once again very close to the "Despite 50 percent more compute units in the Radeon RX 6800 and a good deal more memory bandwidth, the performance only increases by an average of 28 percent compared to the 40 CU configuration."

A good deal more bandwidth is relative because that bandwidth gets used up to output more frames per second. If you have unlimited bandwidth you move the bottleneck to compute and processing. If you have more compute and processing available than bandwidth, the bottleneck is bandwidth. The only way to measure CU scaling is to either max out everythign else, or normalize everything especially the bandwidth per CU if that's what you want to measure. But of course there are other factors like fixed function hardware etc. Which wasn't done.
 
IC has little effect at 1080p and comes more important as the resolution increases, and yet in their testing 1080p still shows roughly the same 70% CU scaling as 4k does.
To be clear, are we talking about this article? Going from 40 CUs to 80 CUs at 1080p shows a 50% increase in performance, while it's 60% at 1440 and 69% at 4k. This makes sense because adding more compute power at lower resolutions is going to give you less of an increase because you are less compute limited.
 
The bandwidth per CU for the 6700XT is 9.6GB/s. 7.1 / 9.6 is 75%. Roughly very close to your 72% scaling. Or put another way 512 GB/s is only 33% more than 384 GB/s, once again very close to the "Despite 50 percent more compute units in the Radeon RX 6800 and a good deal more memory bandwidth, the performance only increases by an average of 28 percent compared to the 40 CU configuration."

I know all the data won't fit into the IC.

What you're not factoring in is the smaller IC in the 6700XT and the smaller amount of IC bandwidth when compared to the 6800XT

So while it has more bandwidth per CU on paper then a 6800XT it will also need to access off die memory much more frequently, nulling any on paper advantage it has.
 
Bandwidth requirements are resolution * framerate. If you double the framerate, you double the bandwidth regardless of resolution. The likelihood that the whole game can fit into the IC of 128MB is unlikely at ultra settings, which I think you' were thinking of footprint requirements as a result of resolution. Since we know they cannot fit everything into 128MB, they are constantly hitting off-chip memory as well and for everything else, so off-chip memory is a massive factor and obviously the CUs rely heavily on the IC to make up for the shortfall of how little off-chip memory bandwidth is available. Normalizing the bandwidth on the IC would be unfair penalization because it needs as much bandwidth as possible to go as faster.

If increasing bandwidth results in increased performance, then bandwidth is the bottleneck. This is why if you want to do CU scaling, you must do normalized bandwidth per CU. If you look at the offchip bandwidth per CU of the 6800XT it's bandwidth per CU is 7.1 GB/s. The bandwidth per CU for the 6700XT is 9.6GB/s. 7.1 / 9.6 is 75%. Roughly very close to your 72% scaling. Or put another way 512 GB/s is only 33% more than 384 GB/s, once again very close to the "Despite 50 percent more compute units in the Radeon RX 6800 and a good deal more memory bandwidth, the performance only increases by an average of 28 percent compared to the 40 CU configuration."

A good deal more bandwidth is relative because that bandwidth gets used up to output more frames per second. If you have unlimited bandwidth you move the bottleneck to compute and processing. If you have more compute and processing available than bandwidth, the bottleneck is bandwidth. The only way to measure CU scaling is to either max out everythign else, or normalize everything especially the bandwidth per CU if that's what you want to measure. But of course there are other factors like fixed function hardware etc. Which wasn't done.
Sure but you need to profile the game to see if main memory bandwidth is really the main limiting factor. It's probably going to depend of the game.
 
Benchmarking games to get real world performance is always useful, but I think if we are trying to determine if adding more CUs adds more compute power, we would need a single game or application that has adjustable compute loads with fixed bandwidth, memory, rop, fixed function, etc, loads.
 
Benchmarking games to get real world performance is always useful, but I think if we are trying to determine if adding more CUs adds more compute power, we would need a single game or application that has adjustable compute loads with fixed bandwidth, memory, rop, fixed function, etc, loads.

The XSX in certain games (was it Control), the almost exact percentage of advantage as it paper specs show. Would be intresting to explore what these games are doing when that happens.
 
Sure but you need to profile the game to see if main memory bandwidth is really the main limiting factor. It's probably going to depend of the game.
I agree, it's what I wrote. Those tests were never conducted. The article locked the clockspeed at 2Ghz and began 'CU Scaling tests'.

I know all the data won't fit into the IC.

What you're not factoring in is the smaller IC in the 6700XT and the smaller amount of IC bandwidth when compared to the 6800XT

So while it has more bandwidth per CU on paper then a 6800XT it will also need to access off die memory much more frequently, nulling any on paper advantage it has.
While that's certainly true, we are not trying to determine why the 6700XT is slower. We want to determine how well work is divided over the CUs. If you don't have enough bandwidth, the CUs are sitting idle, that's not a CU scaling problem that's a bandwidth problem. Compared to the 6700XT in which the CUs are not sitting idle if there is enough bandwidth to keep them all fed.
 
While that's certainly true, we are not trying to determine why the 6700XT is slower. We want to determine how well work is divided over the CUs. If you don't have enough bandwidth, the CUs are sitting idle, that's not a CU scaling problem that's a bandwidth problem. Compared to the 6700XT in which the CUs are not sitting idle if there is enough bandwidth to keep them all fed.

You argument was that the 6700XT has more bandwidth per CU which gives it a benefit, there's evidence to suggest in the real world that it doesn't.
 
Bandwidth requirements are resolution * framerate.
Plus overdraw. Multiple full-screen samples such as post effects quickly add up. Likewise particle effects. As part of the suggested benchmark, you'd want a game that's just rendering opaque geometry

Edit: I'll just tech this up a bit. A 9 tap Gaussian as needed for bloom is going to need 9 x resolution per update. I don't think 9 taps is anything like enough - Google throws up 17 taps. That's 17 full-screen buffer reads, taking your res*framerate requirement up to res*framerate*post-effect-samples. Hence these are undersampled at quarter res or worse.

I bet ML could do amazing quality post effects without crushing the RAM bus!
 
Last edited:
You argument was that the 6700XT has more bandwidth per CU which gives it a benefit, there's evidence to suggest in the real world that it doesn't.
Real world performance hits all over the GPU. If you want to know CU scaling alone you run synthetic benchmarks where you use just compute shaders only. So you get Memory to CU back to Memory. If you want to benchmark the unified shader pipeline, then geometry output and ROPs are automatically part of the equation as well and are likely the largest consumer of bandwidth; they are the ones that do all the read/write work in the 3D pipeline.
 
The article locked the clockspeed at 2Ghz and began 'CU Scaling tests'.
Regarding the effect of clockspeed on the cache, it might be helpful to look at NVIDIA's alternatives to determine the original effect of the cache to begin with.

The 6700XT is equivalent to the 3060Ti in performance, the memory bandwidth of the 6700XT is 384GB/s, while the 3060Ti is 448GB/s so the 64MB of cache in the 6700XT is only compensating for a 64GB/s defeciet.

However it gets serious as we move up the ladder. The 6800XT is an equal to the 3080, the 6800XT is 512GB/s, the 3080 is 760GB/s. So the 128MB cache of the 6800XT is compensating for about 250GB/s.

The 6900XT is a 512GB/s GPU too, but it's direct competitor, the 3080Ti is a monstrous 912GB/s GPU, so the 128MB of cache in the 6900XT is compensating for a large 400GB/s defeciet this time.
 
Regarding the effect of clockspeed on the cache, it might be helpful to look at NVIDIA's alternatives to determine the original effect of the cache to begin with.

The 6700XT is equivalent to the 3060Ti in performance, the memory bandwidth of the 6700XT is 384GB/s, while the 3060Ti is 448GB/s so the 64MB of cache in the 6700XT is only compensating for a 64GB/s defeciet.

However it gets serious as we move up the ladder. The 6800XT is an equal to the 3080, the 6800XT is 512GB/s, the 3080 is 760GB/s. So the 128MB cache of the 6800XT is compensating for about 250GB/s.

The 6900XT is a 512GB/s GPU too, but it's direct competitor, the 3080Ti is a monstrous 912GB/s GPU, so the 128MB of cache in the 6900XT is compensating for a large 400GB/s defeciet this time.
I would, but then we get into architectural differences on how they approach work scheduling. I was really tempted to go that route, but wanted to minimize the amount of variation that could arise.
 
Real world performance hits all over the GPU. If you want to know CU scaling alone you run synthetic benchmarks where you use just compute shaders only. So you get Memory to CU back to Memory. If you want to benchmark the unified shader pipeline, then geometry output and ROPs are automatically part of the equation as well and are likely the largest consumer of bandwidth; they are the ones that do all the read/write work in the 3D pipeline.
Then you either shove those benchmarks in rivals' faces as proof of your superiority and market the bejeezus out of them, or throw them out the window if you are making a game and just use the profiler. :D
 
Real world performance hits all over the GPU. If you want to know CU scaling alone you run synthetic benchmarks where you use just compute shaders only. So you get Memory to CU back to Memory. If you want to benchmark the unified shader pipeline, then geometry output and ROPs are automatically part of the equation as well and are likely the largest consumer of bandwidth; they are the ones that do all the read/write work in the 3D pipeline.

I feel that we're going around in circles now.

We've established the comparison is far from perfect, we've also established the comparison is the only one available at the moment and shows an improvement over previous AMD GPU CU scaling but is far from perfect.

So time to leave it there.
 
Then you either shove those benchmarks in rivals' faces as proof of your superiority and market the bejeezus out of them, or throw them out the window if you are making a game and just use the profiler. :D
yea my bad. I'm not interested in pursuing this any further. I'm pretty sure we just ended up agreeing that the test was not perfect but there isn't anything better either. And architecturally the CU saturation/scaling improved.

But in the face of being challenged to find synthetic benchmarks, I couldn't resist.

Not easy to find these, unfortunately they don't have DL benchmarks for AMD stuff, in particular I need to look for GEMM becnhmarks. It really just depends on what type of GEMM benchmark is being run.

Compute & Synthetics - The NVIDIA GeForce RTX 2070 Founders Edition Review: Mid-Range Turing, High-End Price (anandtech.com)

101942.png


This a synthetic benchmark that outputs the result single precision TFLOPS.
In accordance to their theoretical limits
The 2070 is 98% of max theoretical
The 2080 is 95% of max theoretical
The 2080TI is 93% of max theoretical
The Titan V is 97% of max theoretical

So if a benchmark is showcasing a GPU reaching it's maximum theoretical limit in a benchmark, then we are now seeing the effects of CU scaling - since it cannot achieve more TFLOPs than it is rated for. This must imply near perfect CU/SM scaling. This is an ideal benchmark to showcase CU scaling.

Of note, these GEMM benchmarks are designed to do what you see above. As per the original commentary about CU scaling, @davis.anthony is correct no game will do this, and in real world performance the results can be far worse than the best case patterns. But I wanted to just point out that GPUs are capable of dividing work to fully saturate all of it's CUs, which I was trying to communicate, they can technically do it, even though in practice they won't achieve it. I don't think any game ever achieves more than 40-50% saturation of their CUs IIRC and that's fairly high. Unfortunately, there are no benchmarks here for AMD cards. I don't know many who do these types of benches.
 
Last edited:
Df Article @ https://www.eurogamer.net/articles/...romising-campaign-with-tech-issues-to-address

Halo Infinite tech preview: a promising campaign with tech issues to address
Does Halo work in an open world? We finally have answers.

Halo Infinite's impressive multiplayer component is now available and based on what we've played in prior test flights, it's highly impressive. However, it's the campaigns that we really love to play and in the wake of last year's controversial gameplay trailer, 343 Industries chose to delay and rework the campaign significantly, re-revealing the single-player component only a couple of weeks back. Despite clear visual improvements, it didn't answer the key question: do we really need an open world Halo? Can a series defined by skilfully crafted combat encounters work in a sandbox format? After going hands-on with preview code, we're optimistic - but there's still significant work to do in polishing the game to perfection.

Let's quickly discuss what we can talk about in this preview phase. Essentially this boils down to the first four missions of the game, two of which introduce the new campaign and take place indoors, so yes, right away we can confirm that 'classic' Halo levels are present in the new game. We can also share our thoughts on a further two missions, both of which take place in the new Zeta Halo open world. Halo Infinite launches on all current and last-gen Xbox consoles and PC, but the code we had runs only on Xbox Series consoles. We'll be looking at all versions in a much more granular fashion closer to launch.

...
 
Df Article @ https://www.eurogamer.net/articles/...romising-campaign-with-tech-issues-to-address

Halo Infinite tech preview: a promising campaign with tech issues to address
Does Halo work in an open world? We finally have answers.

Halo Infinite's impressive multiplayer component is now available and based on what we've played in prior test flights, it's highly impressive. However, it's the campaigns that we really love to play and in the wake of last year's controversial gameplay trailer, 343 Industries chose to delay and rework the campaign significantly, re-revealing the single-player component only a couple of weeks back. Despite clear visual improvements, it didn't answer the key question: do we really need an open world Halo? Can a series defined by skilfully crafted combat encounters work in a sandbox format? After going hands-on with preview code, we're optimistic - but there's still significant work to do in polishing the game to perfection.

Let's quickly discuss what we can talk about in this preview phase. Essentially this boils down to the first four missions of the game, two of which introduce the new campaign and take place indoors, so yes, right away we can confirm that 'classic' Halo levels are present in the new game. We can also share our thoughts on a further two missions, both of which take place in the new Zeta Halo open world. Halo Infinite launches on all current and last-gen Xbox consoles and PC, but the code we had runs only on Xbox Series consoles. We'll be looking at all versions in a much more granular fashion closer to launch.

...

Cutscene animations being rendered at lower update frequencies when the cutscene itself is being rendered at 60 Hz seems a bit strange. I'm curious why they chose to go that route.

Regards,
SB
 
Cutscene animations being rendered at lower update frequencies when the cutscene itself is being rendered at 60 Hz seems a bit strange.
More curious is the comment by Alex that the animation frequency is very variable. It almost feels like a bug.
 
Status
Not open for further replies.
Back
Top