Games and bandwidth

Mintmaster

Veteran
So in another thread, Jawed pointed out some interesting OC data at Firingsquad:
http://www.firingsquad.com/hardware/ati_radeon_4850_4870_performance/page16.asp
What's great is that we have identical hardware with 4 fairly different core:mem ratios, with the highest BW config having 75% more BW/clk than the lowest.

My thinking is as follows:

Most tasks for the GPU are either clearly BW limited or GPU limited. Some workloads would run just as fast with 1/10 the bandwidth (i.e. scales perfectly with core clock), some would speed up 3x with 3x more BW (i.e. does nothing with core clock), and others elsewhere in this range. On the whole, it won't be often that a task lies in the range where a 75% BW/clk boost changes it from BW limited to GPU limited. What we can do, then is characterize the workload into parts A and B. A requires a certain number of cycles to get done, and B requires a certain amount of data transfer. Higher GPU clock reduces the time to finish A, and higher memory clock reduces the time to finish B.

What I did was invert the framerates from the above link, and use multiple regression to fit the rendering times (inverse of FPS) to the inverses of clock speed and BW and no constant term (it gave messed up results from overfitting). The model fit the data exceptionally well, having a standard error of 0.6 fps. Then I took B and divided it by the bandwidth to get the time that the card was BW limited. Expressing this as a percentage of total render time:

ET:QW 2560x1600 4xAA/16xAF
4850: 30%
4870: 22%

HL2:E2 2560x1600 4xAA/16xAF
4850: 29%
4870: 21%

FEAR 2560x1600 4xAA/16xAF
4850: 36%
4870: 27%

CoH 1920x1200 4xAA/16xAF
4850: 11%
4870: 7%

These seem like pretty good estimates of how often you get BW limited in these games, though the numbers would be larger if the CPU is a limit for parts of the tests. Crysis has this happen for sure, as evidenced by 4850 -> 4870 being less than 20%, so using the above model is flawed (you get negative BW dependence :) ). Maybe some other data from these games (e.g. resolution scaling on different GPUs) would let me extract this factor accurately.

Interesting stuff, though. We can see that the 4850 isn't overwhelmingly BW limited, but GDDR5 definately makes an impact. The games' data yields coefficients suggesting 280-500 MB per frame of BW limited operations.


*****************************

EDIT: Looks like this thread has a little more appeal than I thought it would, so I'll elaborate one example. With regression, I came up with the following for RV770 running HL2:EP2 at 2560x1600 w/ 4xAA/16xAF:

Predicted HL2 fps = 1 / ( 9.12M clocks / RV770 freq + 375.6MB / bandwidth )

4870 stock (750/1800): 64.9 predicted fps, 64.7 actual fps
4870 OC'd (790/2200): 70.4 predicted fps, 70.6 actual fps
4850 stock (625/993): 48.8 predicted fps, 49.1 actual fps
4850 OC'd (690/1140): 54.5 predicted fps, 54 actual fps

Not bad at all! With other GPUs, the 9.12M figure will change, but 375.6MB should be similar unless compression efficiencies are different. It might be interesting to test that out if we had the data...

Common question: Is the 4850 bandwidth limited? How about the 4870?

Answer: This is the wrong way to think about it. If you chopped a typical HL2:EP2 frame into 1000 pieces that take the same amount of time on the 4850, it would be BW limited for 288 of those. If you doubled the bandwith with a 512-bit bus, you'd crunch through those parts in half the time, thus giving you a 17% framerate boost. For some GPUs, this is worth it -- I doubt a GTX 280's total cost would go down 14% by using eight 1 GBit chips and a simpler PCB instead of the current sixteen 512 MBit chips. On the other hand, if you had a 128-bit version of the 4850, you would double the time on the BW limited parts, knocking 23% off the fps.
 
Last edited by a moderator:
This is the thread that i wanted :)

After watching the GTX280 vs 8800Ultra results, you can see that the GTX280 is 2-3x times faster in some games (2-3x times more shading power than the 8800Ultra), but only a 40-50% faster in a lot of games (45% higher bandwidth + tweaks to save bw).

I think that if nVidia had raised the bw of the GTX280 2-3x times too, the card could be performing always 2-3x times faster than the 8800Ultra.

I did some sort of study about gaming & bw too:
http://forum.beyond3d.com/showpost.php?p=1147916&postcount=60

If you can, please read it & tell me what do you think about it.

Basically i started to play with the GPU & memory freqs, to check what was bounding the performance of my fav game.

Recently i got the scores from this game running on a GTX280, and it was running 1920x1200 SSAA 2x, at 100FPS (like my overclocked 8800GTX at 1280x1024 +-). So, the game performs a 70-75% faster. It could be due to the higher bandwidth of the GTX280 (45%), the new compressing techniques, and the faster ROPs.

So, i really agree with you about the importance of the bandwidth.
 
In the case of really high resolution + AA + compressed textures, texture cache works quite well. So how much of your "bandwidth" limited cases are actually texture instruction+latency+address+filter limited instead of bandwidth limited.
 
This is the thread that i wanted :)

After watching the GTX280 vs 8800Ultra results, you can see that the GTX280 is 2-3x times faster in some games (2-3x times more shading power than the 8800Ultra), but only a 40-50% faster in a lot of games (45% higher bandwidth + tweaks to save bw).
On top of what Timothy mentioned, there's also setup rate which is lower on the GTX 280 than many other chips. This is very important, because things like shadow map and environment map rendering are often limited by this.

I think that if nVidia had raised the bw of the GTX280 2-3x times too, the card could be performing always 2-3x times faster than the 8800Ultra.
I really doubt it, because the biggest increases are math at 62% and bilinear RGBA8 texturing at 146% (much of the time it's only a 23% advantage).

I did some sort of study about gaming & bw too:
http://forum.beyond3d.com/showpost.php?p=1147916&postcount=60

If you can, please read it & tell me what do you think about it.

Basically i started to play with the GPU & memory freqs, to check what was bounding the performance of my fav game.
I don't have enough data there, but that's the right idea. For any game that you want to characterize, just run a few tests, downclocking and/or upclocking each frequency individually (core, shader, mem) and/or in pairs.

With a regression fit, we can figure out how much is core limited (setup, texturing, ROPs), shader limited (math), BW limited, and CPU/PCI-e limited. The more data samples and the more reproducible the test, the better the outcome. I'd be glad to do that for you with BloodRayne if you give me the data.

So, i really agree with you about the importance of the bandwidth.
Well don't read too much into the results. Even in the most bandwidth limited game (FEAR), doubling the BW would only get the 4850 a 22% increase in FPS (36%/2 = 18% reduction in render time).

BTW, just to show you how well the model fits, here's the regression data for the HL2 workload:
Part A: 9.12 million clocks
Part B: 375.6 MB of data

Divide those number by the GPU clock and BW to get the render times, add, and invert. The numbers match up almost exactly with FiringSquad's numbers.
 
Throw in some hideous lod bias and find out.
Actually, this gobble BW too. If the cache lines are 64 bytes, then a GTX 280 gets reduced to 2.2 GSamples per second with a very large -ve LOD bias causing incoherent fetches.

Unless you meant +ve LOD bias, but that doesn't guarantee you low BW since pixels are still there.
 
These seem like pretty good estimates of how often you get BW limited in these games, though the numbers would be larger if the CPU is a limit for parts of the tests. Crysis has this happen for sure, as evidenced by 4850 -> 4870 being less than 20%, so using the above model is flawed (you get negative BW dependence :) ).

Sadly the review didn't say how they performed the Crysis benchmark.
For example particle effects usually have high BW requirements.
If they benchmarked a flyby demo they wouldn't get that many particle effects, but it would have a lot of streaming which can cause CPU limitations.

For example the Tech Report review shows better scaling between the 4870 & 4850 than the FiringSquad one.
 
Removal of the CPU/PCI-e limitation is enough to account for most of the 22% increase in the TR review. I don't doubt that BW matters in Crysis, as some tests show G80 doing well vs G92 (though that could be ROPs rather than BW), but it doesn't seem to make much difference.

A few megapixels of particle effects per frame wouldn't be enough to make a huge difference in framerate, as the framerates are already quite low.
 
No problem. I figured that several people here would be quite interested.

Even more interesting is when you vary 3 parameters on the NVidia cards: core clock, shader clock, and BW. Some people here (I'm looking at you Jawed ;) ) would be surprised at how much G80 is limited by the core speed, which includes texturing, ROPs, and triangle setup, and how little the relatively underpowered shader core matters.
 
No problem. I figured that several people here would be quite interested.

Even more interesting is when you vary 3 parameters on the NVidia cards: core clock, shader clock, and BW. Some people here (I'm looking at you Jawed ;) ) would be surprised at how much G80 is limited by the core speed, which includes texturing, ROPs, and triangle setup, and how little the relatively underpowered shader core matters.

Don't count me in that group. I've heard "triangle setup-limited" preached around these parts far too often to believe G80 isn't held back by its core clock :D
 
how much G80 is limited by the core speed, which includes texturing, ROPs, and triangle setup, and how little the relatively underpowered shader core matters.

and, to go one step further, I'd count ROPs out as the G80/G92 is still often core clock limited without AA, and the pixel fillrate of even the G92 should be more than sufficient. similarly, if those cards were texel fillrate limited, then what would happen to an RV670 card?
that leaves us with triangle setup. and I'm still suspecting that the scheduler of the G80/G92 parts are not always up to the task.

nice analysis on the bandwidth part btw!
 
Expressing this as a percentage of total render time:

ET:QW 2560x1600 4xAA/16xAF
4850: 30%
4870: 22%

HL2:E2 2560x1600 4xAA/16xAF
4850: 29%
4870: 21%

FEAR 2560x1600 4xAA/16xAF
4850: 36%
4870: 27%

CoH 1920x1200 4xAA/16xAF
4850: 11%
4870: 7%

Interesting stuff, though. We can see that the 4850 isn't overwhelmingly BW limited, but GDDR5 definately makes an impact. The games' data yields coefficients suggesting 280-500 MB per frame of BW limited operations.
Interesting indeed - are we running into some kind of Amdahls-Bandwidth-Barriere here?

I mean - honestly and if i got those numbers right - we're looking at a massive 93-percentish BW advantage for the 4870 yet this enormous increase only alleviates BW restrictions on only 4 - 9 percent of rendertimes?
 
Interesting indeed - are we running into some kind of Amdahls-Bandwidth-Barriere here?

I mean - honestly and if i got those numbers right - we're looking at a massive 93-percentish BW advantage for the 4870 yet this enormous increase only alleviates BW restrictions on only 4 - 9 percent of rendertimes?
Well, I like to look at the 4870 as a 50% increase in BW per clock, so we can isolate the effect of increasing BW. We know that a 20%/20% increase for core/mem would result in a 20% boost. Also, it's not really alleviating the BW restrictions; instead, it's simply crunching through the BW limited parts faster.

That 50% BW boost knocks off a third of the rendering time for the BW limited parts, and aside from Company of Heroes and Crysis, that amounts to chopping off ~10-12% from rendering time. We are approaching a bit of a barrier with the perf/clk and BW/clk of the 4850, but it's not too bad. At 40nm I expect to see a 50/50 ratio for these older games, at least for the value boards.

Also interesting are the board economics. Take HL2 as an example, and assume RAM costs 1/4 of board cost for the 4850. If perf/$ is your goal, then you are willing to pay 11% more per MB for every 10% increase in RAM speed. For the 4870, though, if RAM is 1/3 the board cost, you'd only pay 6% more per MB for 10% increase in RAM speed.
 
Last edited by a moderator:
Interesting indeed - are we running into some kind of Amdahls-Bandwidth-Barriere here?

I mean - honestly and if i got those numbers right - we're looking at a massive 93-percentish BW advantage for the 4870 yet this enormous increase only alleviates BW restrictions on only 4 - 9 percent of rendertimes?


Shader-limited or Overdraw ?
 
I think that if nVidia had raised the bw of the GTX280 2-3x times too, the card could be performing always 2-3x times faster than the 8800Ultra.

If someone has a GTX280 lying around...take a GTX and down clock the non-memory clocks by 3x and find out.

This would also reduce the effective latency by 3x
 
Back
Top