Games and bandwidth

That's also my take from the google translation. Nice find btw. and nice work from Tridam!

Interestingly, the better the shader-structure is suited for the AMD-ALUs, the more bandwidth seems to become an issue as apparent in GRID. Also, (partly) deferred renderers profit the most from additional BW.
 
Well done by Tridam! He has a good understanding of GPU workload, as witnessed here:
Chaque étape de la création d'une image 3D peut être limitée par un élément différent. Par exemple, les 10 premiers pourcents peuvent être limités par le CPU, les 10% suivants par le setup engine, les 60% suivants par le cœur de calcul du GPU et enfin les derniers 20% par la bande passante mémoire. Dans ce cas, la doubler ne va être bénéfique qu'à la dernière phase du rendu et ne va augmenter les performances globale "que" de 11%.
I'm going to do the regression on all the games in that review soon. Getting values for MB per frame will be neat, particularly since we have AA and non-AA data
 
Nice! What about adding oc gpu+oc mem % gain to your oc graphs? to see if both combined skew the perf picture a little more?

Can do, but I'm getting too tired to add it now...

I just added a 4870 vs. 4850 @ equal clocks graph on post 4.
 
It's not just bandwidth. It's about fillrate and amount of bandwidth needed to use that fillrate properly. AA is very dependent on bandwidth. Most games are not SP limited rather fillrate hungry.

RV770 is much more efficient than GT200 or G92. It's FP16 blending fillrate even surpasses 280gtx with less theoretical fillrate although GT200 surpasses RV770 when it comes to bilinear fillrate.

G80 did have lower bilinear fillrate than G92 but it's FP16 fillrate was very close to 9800gtx. So we have 9800gtx beating out ultra in lower resolutions but loses steam when it's bandwidth limited at higher resolutions with AA.

GT200 has a lot of processing power but it's fillrate is only 40% better than 9800gtx hence you get the performance difference and more when it's bandwidth constraint. Bandwidth is useless without fillrate. Good example would be 2900xt.
 
It's not just bandwidth.
The point of this thread is to illustrate and quantify that very point. Even with the 4850, most games are only BW limited about 30% of the time.

It's about fillrate and amount of bandwidth needed to use that fillrate properly. AA is very dependent on bandwidth. Most games are not SP limited rather fillrate hungry.
If that was true then the 32-ROP, 512-bit GT200 would be a lot faster than the 16-ROP, 256-bit G92 (core clock speed is only 11% lower). It's more likely that setup speed and texturing/SP is holding back GT200 from blowing out G92.

RV770 is much more efficient than GT200 or G92. It's FP16 blending fillrate even surpasses 280gtx with less theoretical fillrate although GT200 surpasses RV770 when it comes to bilinear fillrate.

G80 did have lower bilinear fillrate than G92 but it's FP16 fillrate was very close to 9800gtx. So we have 9800gtx beating out ultra in lower resolutions but loses steam when it's bandwidth limited at higher resolutions with AA.
Higher resolution don't increase the time %age you're BW limited unless CPU or vertex limitations are present at the lower resolution (which is possible, and the G92 has a substantial advantage here over G80). This is a common myth. AA does, though.

Anyway, it depends on the game. If a game is 40% limited by BW or ROPs, you'd prefer the 28% advantage of G80 in that department over the 17-30% advantage of G92 elsewhere.

GT200 has a lot of processing power but it's fillrate is only 40% better than 9800gtx hence you get the performance difference and more when it's bandwidth constraint. Bandwidth is useless without fillrate. Good example would be 2900xt.
It's not 40%, it's 78%. GT200 isn't lacking in fillrate at all. Few if any games perform 78% faster on the GTX 280 than the 9800 GTX.
 
The point of this thread is to illustrate and quantify that very point. Even with the 4850, most games are only BW limited about 30% of the time.

It's not just bandwidth with 4870 vs 4850. Bandwidth is useless if you don't have fillrate backing it up. Do you understand generator and carrier concept?

If bandwidth is the key why did 2900xt fail? Why isn't GTX280 not twice as fast G92 with twice the bandwidth?

You can't just point and say it's bandwidth is limited by 30%. That's absurd.


If that was true then the 32-ROP, 512-bit GT200 would be a lot faster than the 16-ROP, 256-bit G92 (core clock speed is only 11% lower). It's more likely that setup speed and texturing/SP is holding back GT200 from blowing out G92.

I wasn't talking about pixel fillrate rather texel fillrate. GT200 isn't much more than a g92 in this dept. Games for the most part load textures from the memory unload. It's no surprise that GT200 doesn't perform much better than G92. GT200 performs better because it's not being constraint by bandwidth and a little more fillrate combinate of 1 gig of vram there you go. :!:


Higher resolution don't increase the time %age you're BW limited unless CPU or vertex limitations are present at the lower resolution (which is possible, and the G92 has a substantial advantage here over G80). This is a common myth. AA does, though.

What are you really trying to say here? Some percentage higher resolution vertex low resolution? G92 better than G80... :?:

G80 and G92 have much more things in common than you think it does. There really isn't a vertex difference what have you because the SP hasn't changed. Just ROP and memory bits did. The real difference between G80 and G92 is that g92 has more texture address per SP to balance lower bandwidth and ROP.



Anyway, it depends on the game. If a game is 40% limited by BW or ROPs, you'd prefer the 28% advantage of G80 in that department over the 17-30% advantage of G92 elsewhere.

Since you can't measure percentage differences I don't know why you keep mentioning this percentage limitation.


It's not 40%, it's 78%. GT200 isn't lacking in fillrate at all. Few if any games perform 78% faster on the GTX 280 than the 9800 GTX.

Again I wasn't talking about color fill rather texel fillrate. Sorry it's not 40% rather 15-20% depending how you want to look at it.

GTX280 48.2 peak bilinear 24.1 FP16
9800gtx 43.2 peak bilinear 21.6 FP16
 
It's not just bandwidth with 4870 vs 4850. Bandwidth is useless if you don't have fillrate backing it up. Do you understand generator and carrier concept?
It's not useless, because the 4870 is faster per clock than the 4850 (i.e. it is more than 20% faster).

If bandwidth is the key why did 2900xt fail? Why isn't GTX280 not twice as fast G92 with twice the bandwidth?
Who said bandwidth is key? Not me.

You can't just point and say it's bandwidth is limited by 30%. That's absurd.
Did you even read this thread?

I'm not just pointing and saying. I have a model, I fit data to it, and I have a very low standard error.

I'm not saying "it's bandwidth is limited by 30%", whatever the heck that means. I'm saying ~30% of a typical frame's rendering time (in HL2:E2, ET:QW, and F.E.A.R.) is consumed by operations that are BW limited on the 4850.

That part of the workload will be completed faster with more bandwidth. The other parts will be completed quicker with a faster GPU.

I wasn't talking about pixel fillrate rather texel fillrate.
Then use the right terminology. Pixels fill polygons, but texels don't fill anything. You should have said texturing/sampling/fetch rate. You even used the word blending, which implies pixel rate.

Anyway, you are still wrong in suggesting games are texturing limited. The number of texture samples required for a frame is proportional to screen pixel count (for the most part). Framerate slows down much slower than that when you change resolutions.

What are you really trying to say here? Some percentage higher resolution vertex low resolution? G92 better than G80... :?:
No, I'm educating you on two points:

- Increasing resolution doesn't increase the bandwidth usage per pixel. Contrary to myth, GPUs do not become more bandwidth limited at higher resolution. The exception is when polygon speed is a factor, because then higher resolution increases the percentage of time spent limited on pixel processing. AA, however, does increase BW usage per pixel, so I agree with you there.

- For some rendering tasks, the 23% advantage of the 8800 GTX in BW/ROPs is important. In others, the 17-29% advantage of G92 in everything else is important. A typical frame has a mixture of both types of rendering tasks.

(sorry about the 28% figure in my previous post, as I used the wrong G80 memory clock.)

G80 and G92 have much more things in common than you think it does. There really isn't a vertex difference what have you because the SP hasn't changed.
You forgot about clock speed. See above.

Since you can't measure percentage differences I don't know why you keep mentioning this percentage limitation.
Again, read the thread. The whole point is that with the 4850 and 4870 - which differ only in BW/clock - we can measure it using multiple regression.
 
It's not useless, because the 4870 is faster per clock than the 4850 (i.e. it is more than 20% faster).

My exact point and all this time you are here arguing about games and bandwidth?


Who said bandwidth is key? Not me.

This thread is about games and bandwidth is it not? You put bunch of #'s and say bandwidth at this much this percentage etc...


Did you even read this thread?

I'm not just pointing and saying. I have a model, I fit data to it, and I have a very low standard error.


You have low standard error? Says who? You? :p


I'm not saying "it's bandwidth is limited by 30%", whatever the heck that means. I'm saying ~30% of a typical frame's rendering time (in HL2:E2, ET:QW, and F.E.A.R.) is consumed by operations that are BW limited on the 4850.

That part of the workload will be completed faster with more bandwidth. The other parts will be completed quicker with a faster GPU.

The obvious. :smile:





Then use the right terminology. Pixels fill polygons, but texels don't fill anything. You should have said texturing/sampling/fetch rate. You even used the word blending, which implies pixel rate.

Anyway, you are still wrong in suggesting games are texturing limited. The number of texture samples required for a frame is proportional to screen pixel count (for the most part). Framerate slows down much slower than that when you change resolutions.

I am using right terminology which have been used by many hardware sites. You just assumed I was talking about something else. Anyways you are wrong. Nvidia said games unload textures to memory for the most part which dictates performance in modern games.

No, I'm educating you on two points:

I never appointed you to be my teacher and I'm educated enough.

- Increasing resolution doesn't increase the bandwidth usage per pixel. Contrary to myth, GPUs do not become more bandwidth limited at higher resolution. The exception is when polygon speed is a factor, because then higher resolution increases the percentage of time spent limited on pixel processing. AA, however, does increase BW usage per pixel, so I agree with you there.

Absurd. Bandwidth is vital for moving pixels on to the screen.

- For some rendering tasks, the 23% advantage of the 8800 GTX in BW/ROPs is important. In others, the 17-29% advantage of G92 in everything else is important. A typical frame has a mixture of both types of rendering tasks.

(sorry about the 28% figure in my previous post, as I used the wrong G80 memory clock.)

I agree with this statement. The percentage issue aside all games are programmed to be limited to certain aspects of the GPU. One might be SP lmited another might be texture hungry. Key is bringing a more efficient GPU to eliminate the bottlenecks.



You forgot about clock speed. See above.

Was that even a point to what I was even talking about?
 
Marvelous, may I point you're just looking pretty stupid in those posts? :!: You're just missing the point completely pretty bad a number of times... (example: G80 vs G92 clock speeds; why do you think Mint said G92 had an advantage for vertex limitated cases? 675 vs 575MHz)

And "Absurd. Bandwidth is vital for moving pixels on to the screen." is horribly wrong and highly arrogant. The point is that bandwidth *per pixel* doesn't go up (in fact, it goes down for at least three distinct reasons on modern GPUs, including vertex attributes, texture magnification, and framebuffer compression), so you're less likely to be bandwidth limited.

There's nothing wrong with being wrong. But there is with being overconfident and wrong at the same time. So may I point out it'd be a good idea to be a bit more open to others' points? Oh, and nice thread Mintmaster, since I didn't comment on it yet! :)
 
RV770 is much more efficient than GT200 or G92. It's FP16 blending fillrate even surpasses 280gtx with less theoretical fillrate although GT200 surpasses RV770 when it comes to bilinear fillrate.

You're talking 3DMark Vantages rather strange fillrate- tests? Several hundred GTex/sec. – I wonder how they get at that number. Haven't found anything in the docs as of yet.
 
Thanks Arun. I'm not going to bother with with marvelous anymore because he clearly has no interest in learning anything about 3D performance or clearing up his misconceptions.

-----------------------------------------

So I've now analyzed the hardware.fr data, and there's a lot to present. While I mull over the best way to post it, here are some key findings:

1. They recently added a data point of the 4870 with GDDR5 @ 64 GB/s. The performance is well below GDDR3 @ 64 GB/s, thus throwing a bit of a wrench into the calculations. The question is whether the full speed GDDR5 results are still valid for comparison with the GDDR3 results in the regression, because it could simply be a matter of not being tuned for the low speed. Latency can be hidden by GPUs, but it may not be for GDDR5 at 993MHz.

In any case, there are two options for the regression. I can ignore the 4870 @ 725/993 and do it with the other 4 data points, or I can do it with just the three 4870 data points. I think the first method is more sound due to the latency argument. As expected, the second method shows games to be far more dependent on memory bandwidth since the 64 GB/s data point is much slower in all games.

2. Adding a constant term to the regression changes the outcome notably. It represents the average time per frame that performance is limited by the CPU or PCIe bus. The hardware.fr data makes a much better case for including this term than the Firingsquad data, as only one of the 18 benchmarks wound up with a negative coefficient. Standard error is 0.35 fps with the constant term (sweet! :cool:) and 0.92 fps without.

3. 4xAA makes a huge difference the percent of the workload that's BW limited. Most games are seeing about 2.5 times the BW limited loads with AA enabled, the exceptions being Crysis, CoH, and GRID. The GPU limited load is not changing much, which is an encouraging sign for the model.
 
Here's more data with HD4870 at 625MHz core with memory clock unchanged:

http://www.digit-life.com/articles3/video/rv770-2-part1-p3.html

Real shame there's no 8xMSAA comparisons or tent/edge-detect comparisons. Surely they would stretch bandwidth?

Jawed

Here you go:

Sys:
Xeon E3110 @ 3.6GHz
4GB DDR2-800
Vista x64
Catalyst 8.7 Beta

Test: 3DMark06 - 1680x1050 AA8x AF16x
HD4870 512MB @ 650/2000 -> 6516 (SM2:2475 | SM3:2556)
HD4870 512MB @ 650/3200 -> 8091 (SM2:3122 | SM3:3339)
HD4870 512MB @ 650/3600 -> 8245 (SM2:3203 | SM3:3403)
HD4870 512MB @ 650/4000 -> 8394 (SM2:3279 | SM3:3453)

Test: Company of Heroes - 1680x1050 All Max
HD4870 512MB @ 650/2000 -> (MIN:41.1 | AVG: 95.2)
HD4870 512MB @ 650/3200 -> (MIN:52.5 | AVG:107.0)
HD4870 512MB @ 650/3600 -> (MIN:49.4 | AVG:112.7)
HD4870 512MB @ 650/4000 -> (MIN:50.4 | AVG:110.1)

Test: Devil May Cry 4 - 1680x1050 DX9 8xMSAA All Max
HD4870 512MB @ 650/2000 -> (S1: 85.16 | S2:60.51 | S3: 99.88 | S4:65.80)
HD4870 512MB @ 650/3200 -> (S1:111.26 | S2:74.34 | S3:135.19 | S4:81.75)
HD4870 512MB @ 650/3600 -> (S1:111.53 | S2:77.69 | S3:153.11 | S4:84.38)
HD4870 512MB @ 650/4000 -> (S1:112.22 | S2:74.80 | S3:141.23 | S4:84.16)

Test: Call of Juarez - 1680x1050 DX10 4xMSAA All Max
HD4870 512MB @ 650/2000 -> (MIN:12.6 | AVG:28.9)
HD4870 512MB @ 650/3200 -> (MIN:15.6 | AVG:34.0)
HD4870 512MB @ 650/3600 -> (MIN:15.7 | AVG:34.3)
HD4870 512MB @ 650/4000 -> (MIN:15.0 | AVG:34.5)


The chip seems b/w limited with 2GHz memory.

Source: link
 
I'll join in the fun in this great thead too. Yes, I know 3DMark isn't a game but I can't be bothered with FRAPS these days!

Q6600 @ 3GHz (9 x 333)
4GiB DDR2-666
GeForce 8800 GTX
175.16 ForceWare
Vista 64
Core / Shader / Memory
Code:
3DMark06 - Graphics Test 1: Return to Proxycon
1680 x 1050 - No AA / Trilinear		1680 x 1050 - 8x AA / Trilinear
576 / 1350 / 900 - 38.97 fps		576 / 1350 / 900 - 26.16 fps
288 / 1350 / 900 - 23.50 fps		288 / 1350 / 900 - 15.69 fps
288 /  675 / 900 - 20.53 fps		288 /  675 / 900 - 14.27 fps
576 / 1350 / 450 - 31.90 fps		576 / 1350 / 450 - 18.05 fps


3DMark Vantage - Graphics Test 2: New Calico (Extreme settings)
1680 x 1050 - No AA / Optimal		1680 x 1050 - 8x AA / Optimal
576 / 1350 / 900 - 11.67 fps		576 / 1350 / 900 - 9.30 fps
288 / 1350 / 900 - 7.03 fps		288 / 1350 / 900 - 6.37 fps
288 /  675 / 900 - 6.22 fps		288 /  675 / 900 - 5.09 fps	
576 / 1350 / 450 - 8.18 fps		576 / 1350 / 450 - 6.09 fps


Crysis - Standard GPU benchmark (DX10 Very High settings, 64 bit)
1680 x 1050 - No AA / Optimal		1680 x 1050 - 8xQ AA / Optimal
576 / 1350 / 900 - 19.06 fps		576 / 1350 / 900 - 10.46 fps
288 / 1350 / 900 - 12.62 fps		288 / 1350 / 900 - 7.08 fps
288 /  675 / 900 - 10.61 fps		288 /  675 / 900 - 6.21 fps	
576 / 1350 / 450 - 15.19 fps		576 / 1350 / 450 - 6.87 fps
Unfortunately I couldn't get the shader clock to drop by itself - it would only play ball if the core clock was dropped too. The AA settings are in-game/benchmark.
 
Here you go:

The chip seems b/w limited with 2GHz memory.
Thanks.

I think I saw a comparison somewhere that showed that HD4870 with GDDR5 at the same clocks as HD4850's GDDR3 is considerably slower, so it prolly isn't a useful baseline.

I'm not sure why you presented 650MHz core clock when stock is 750MHz. At 800MHz the benefits of greater bandwidth are slightly more pronounced.

Best case I could find is at 800MHz core, Devil May Cry DX9 8xMSAA S3:
  • 3200MHz = 153.36
  • 4000MHz = 174.85
which is 14% against a 25% bandwidth increase. Otherwise at 800MHz scaling seems to be in the range 5-9%.

So overall I guess more resolution is required to really exercise the bandwidth of HD4870 with 8xMSAA...

Jawed
 
Thanks.

I think I saw a comparison somewhere that showed that HD4870 with GDDR5 at the same clocks as HD4850's GDDR3 is considerably slower, so it prolly isn't a useful baseline.

I'm not sure why you presented 650MHz core clock when stock is 750MHz. At 800MHz the benefits of greater bandwidth are slightly more pronounced.

I reported 650MHz results as it is HD4850's stock frequency as I don't have a 4850 to test with and I wanted to stress that 4850 is b/w limited, but I didn't have any clue on what I highlighted in bold. Do you remember where did you find that comparison, it'd be interesting.

So overall I guess more resolution is required to really exercise the bandwidth of HD4870 with 8xMSAA...

Agreed...
 
Marvelous, may I point you're just looking pretty stupid in those posts? :!: You're just missing the point completely pretty bad a number of times... (example: G80 vs G92 clock speeds; why do you think Mint said G92 had an advantage for vertex limitated cases? 675 vs 575MHz)

There's many different kind of G92 chips that are clocked differently. Wasn't mint taking about shader clocks and why are you comparing core clocks?

And "Absurd. Bandwidth is vital for moving pixels on to the screen." is horribly wrong and highly arrogant. The point is that bandwidth *per pixel* doesn't go up (in fact, it goes down for at least three distinct reasons on modern GPUs, including vertex attributes, texture magnification, and framebuffer compression), so you're less likely to be bandwidth limited.

You are telling me you don't need more bandwidth to run higher resolutions and AA because this thread is showing benchmarks in high resolutions with AA? :rolleyes:


There's nothing wrong with being wrong. But there is with being overconfident and wrong at the same time. So may I point out it'd be a good idea to be a bit more open to others' points? Oh, and nice thread Mintmaster, since I didn't comment on it yet! :)

I see you are buddy buddy with mintmaster. :rolleyes: Perhaps it's time you took your own advice and open to others' points.
 
Back
Top