Games and bandwidth

Discussion in 'Architecture and Products' started by Mintmaster, Jun 27, 2008.

  1. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    So in another thread, Jawed pointed out some interesting OC data at Firingsquad:
    http://www.firingsquad.com/hardware/ati_radeon_4850_4870_performance/page16.asp
    What's great is that we have identical hardware with 4 fairly different core:mem ratios, with the highest BW config having 75% more BW/clk than the lowest.

    My thinking is as follows:

    Most tasks for the GPU are either clearly BW limited or GPU limited. Some workloads would run just as fast with 1/10 the bandwidth (i.e. scales perfectly with core clock), some would speed up 3x with 3x more BW (i.e. does nothing with core clock), and others elsewhere in this range. On the whole, it won't be often that a task lies in the range where a 75% BW/clk boost changes it from BW limited to GPU limited. What we can do, then is characterize the workload into parts A and B. A requires a certain number of cycles to get done, and B requires a certain amount of data transfer. Higher GPU clock reduces the time to finish A, and higher memory clock reduces the time to finish B.

    What I did was invert the framerates from the above link, and use multiple regression to fit the rendering times (inverse of FPS) to the inverses of clock speed and BW and no constant term (it gave messed up results from overfitting). The model fit the data exceptionally well, having a standard error of 0.6 fps. Then I took B and divided it by the bandwidth to get the time that the card was BW limited. Expressing this as a percentage of total render time:

    ET:QW 2560x1600 4xAA/16xAF
    4850: 30%
    4870: 22%

    HL2:E2 2560x1600 4xAA/16xAF
    4850: 29%
    4870: 21%

    FEAR 2560x1600 4xAA/16xAF
    4850: 36%
    4870: 27%

    CoH 1920x1200 4xAA/16xAF
    4850: 11%
    4870: 7%

    These seem like pretty good estimates of how often you get BW limited in these games, though the numbers would be larger if the CPU is a limit for parts of the tests. Crysis has this happen for sure, as evidenced by 4850 -> 4870 being less than 20%, so using the above model is flawed (you get negative BW dependence :) ). Maybe some other data from these games (e.g. resolution scaling on different GPUs) would let me extract this factor accurately.

    Interesting stuff, though. We can see that the 4850 isn't overwhelmingly BW limited, but GDDR5 definately makes an impact. The games' data yields coefficients suggesting 280-500 MB per frame of BW limited operations.


    *****************************

    EDIT: Looks like this thread has a little more appeal than I thought it would, so I'll elaborate one example. With regression, I came up with the following for RV770 running HL2:EP2 at 2560x1600 w/ 4xAA/16xAF:

    Predicted HL2 fps = 1 / ( 9.12M clocks / RV770 freq + 375.6MB / bandwidth )

    4870 stock (750/1800): 64.9 predicted fps, 64.7 actual fps
    4870 OC'd (790/2200): 70.4 predicted fps, 70.6 actual fps
    4850 stock (625/993): 48.8 predicted fps, 49.1 actual fps
    4850 OC'd (690/1140): 54.5 predicted fps, 54 actual fps

    Not bad at all! With other GPUs, the 9.12M figure will change, but 375.6MB should be similar unless compression efficiencies are different. It might be interesting to test that out if we had the data...

    Common question: Is the 4850 bandwidth limited? How about the 4870?

    Answer: This is the wrong way to think about it. If you chopped a typical HL2:EP2 frame into 1000 pieces that take the same amount of time on the 4850, it would be BW limited for 288 of those. If you doubled the bandwith with a 512-bit bus, you'd crunch through those parts in half the time, thus giving you a 17% framerate boost. For some GPUs, this is worth it -- I doubt a GTX 280's total cost would go down 14% by using eight 1 GBit chips and a simpler PCB instead of the current sixteen 512 MBit chips. On the other hand, if you had a 128-bit version of the 4850, you would double the time on the BW limited parts, knocking 23% off the fps.
     
    #1 Mintmaster, Jun 27, 2008
    Last edited by a moderator: Jul 3, 2008
  2. Rayne

    Newcomer

    Joined:
    Jun 23, 2007
    Messages:
    91
    Likes Received:
    0
    This is the thread that i wanted :)

    After watching the GTX280 vs 8800Ultra results, you can see that the GTX280 is 2-3x times faster in some games (2-3x times more shading power than the 8800Ultra), but only a 40-50% faster in a lot of games (45% higher bandwidth + tweaks to save bw).

    I think that if nVidia had raised the bw of the GTX280 2-3x times too, the card could be performing always 2-3x times faster than the 8800Ultra.

    I did some sort of study about gaming & bw too:
    http://forum.beyond3d.com/showpost.php?p=1147916&postcount=60

    If you can, please read it & tell me what do you think about it.

    Basically i started to play with the GPU & memory freqs, to check what was bounding the performance of my fav game.

    Recently i got the scores from this game running on a GTX280, and it was running 1920x1200 SSAA 2x, at 100FPS (like my overclocked 8800GTX at 1280x1024 +-). So, the game performs a 70-75% faster. It could be due to the higher bandwidth of the GTX280 (45%), the new compressing techniques, and the faster ROPs.

    So, i really agree with you about the importance of the bandwidth.
     
  3. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    In the case of really high resolution + AA + compressed textures, texture cache works quite well. So how much of your "bandwidth" limited cases are actually texture instruction+latency+address+filter limited instead of bandwidth limited.
     
  4. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,750
    Likes Received:
    470
    Throw in some hideous lod bias and find out.
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    On top of what Timothy mentioned, there's also setup rate which is lower on the GTX 280 than many other chips. This is very important, because things like shadow map and environment map rendering are often limited by this.

    I really doubt it, because the biggest increases are math at 62% and bilinear RGBA8 texturing at 146% (much of the time it's only a 23% advantage).

    I don't have enough data there, but that's the right idea. For any game that you want to characterize, just run a few tests, downclocking and/or upclocking each frequency individually (core, shader, mem) and/or in pairs.

    With a regression fit, we can figure out how much is core limited (setup, texturing, ROPs), shader limited (math), BW limited, and CPU/PCI-e limited. The more data samples and the more reproducible the test, the better the outcome. I'd be glad to do that for you with BloodRayne if you give me the data.

    Well don't read too much into the results. Even in the most bandwidth limited game (FEAR), doubling the BW would only get the 4850 a 22% increase in FPS (36%/2 = 18% reduction in render time).

    BTW, just to show you how well the model fits, here's the regression data for the HL2 workload:
    Part A: 9.12 million clocks
    Part B: 375.6 MB of data

    Divide those number by the GPU clock and BW to get the render times, add, and invert. The numbers match up almost exactly with FiringSquad's numbers.
     
  6. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Actually, this gobble BW too. If the cache lines are 64 bytes, then a GTX 280 gets reduced to 2.2 GSamples per second with a very large -ve LOD bias causing incoherent fetches.

    Unless you meant +ve LOD bias, but that doesn't guarantee you low BW since pixels are still there.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Nice work Mintmaster.

    There's extra data for CoH at 1600x1200 and 2560x1600:

    http://www.firingsquad.com/hardware/ati_radeon_4850_4870_performance/page9.asp

    but that's it sadly.

    I still wonder if 2560x1600 4xMSAA is running against the end-stops of 512MB of memory on the ATI cards. I suppose the 8xMSAA results, later in the review, at the same resolutions should provide some clues.

    Jawed
     
  8. Hyp-X

    Hyp-X Irregular
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,170
    Likes Received:
    5
    Sadly the review didn't say how they performed the Crysis benchmark.
    For example particle effects usually have high BW requirements.
    If they benchmarked a flyby demo they wouldn't get that many particle effects, but it would have a lot of streaming which can cause CPU limitations.

    For example the Tech Report review shows better scaling between the 4870 & 4850 than the FiringSquad one.
     
  9. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Removal of the CPU/PCI-e limitation is enough to account for most of the 22% increase in the TR review. I don't doubt that BW matters in Crysis, as some tests show G80 doing well vs G92 (though that could be ROPs rather than BW), but it doesn't seem to make much difference.

    A few megapixels of particle effects per frame wouldn't be enough to make a huge difference in framerate, as the framerates are already quite low.
     
  10. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,056
    Likes Received:
    1,020
    Good analysis.
    Thanks for the effort and educational value.
     
  11. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    No problem. I figured that several people here would be quite interested.

    Even more interesting is when you vary 3 parameters on the NVidia cards: core clock, shader clock, and BW. Some people here (I'm looking at you Jawed ;) ) would be surprised at how much G80 is limited by the core speed, which includes texturing, ROPs, and triangle setup, and how little the relatively underpowered shader core matters.
     
  12. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    3,984
    Likes Received:
    34
    Don't count me in that group. I've heard "triangle setup-limited" preached around these parts far too often to believe G80 isn't held back by its core clock :D
     
  13. Pantagruel's Friend

    Newcomer

    Joined:
    Jun 17, 2007
    Messages:
    59
    Likes Received:
    0
    Location:
    Budapest, Hungary
    and, to go one step further, I'd count ROPs out as the G80/G92 is still often core clock limited without AA, and the pixel fillrate of even the G92 should be more than sufficient. similarly, if those cards were texel fillrate limited, then what would happen to an RV670 card?
    that leaves us with triangle setup. and I'm still suspecting that the scheduler of the G80/G92 parts are not always up to the task.

    nice analysis on the bandwidth part btw!
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    Interesting indeed - are we running into some kind of Amdahls-Bandwidth-Barriere here?

    I mean - honestly and if i got those numbers right - we're looking at a massive 93-percentish BW advantage for the 4870 yet this enormous increase only alleviates BW restrictions on only 4 - 9 percent of rendertimes?
     
  15. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Well, I like to look at the 4870 as a 50% increase in BW per clock, so we can isolate the effect of increasing BW. We know that a 20%/20% increase for core/mem would result in a 20% boost. Also, it's not really alleviating the BW restrictions; instead, it's simply crunching through the BW limited parts faster.

    That 50% BW boost knocks off a third of the rendering time for the BW limited parts, and aside from Company of Heroes and Crysis, that amounts to chopping off ~10-12% from rendering time. We are approaching a bit of a barrier with the perf/clk and BW/clk of the 4850, but it's not too bad. At 40nm I expect to see a 50/50 ratio for these older games, at least for the value boards.

    Also interesting are the board economics. Take HL2 as an example, and assume RAM costs 1/4 of board cost for the 4850. If perf/$ is your goal, then you are willing to pay 11% more per MB for every 10% increase in RAM speed. For the 4870, though, if RAM is 1/3 the board cost, you'd only pay 6% more per MB for 10% increase in RAM speed.
     
    #15 Mintmaster, Jul 1, 2008
    Last edited by a moderator: Jul 2, 2008
  16. Vincent

    Newcomer

    Joined:
    May 28, 2007
    Messages:
    235
    Likes Received:
    0
    Location:
    London

    Shader-limited or Overdraw ?
     
  17. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    17,219
    Likes Received:
    1,738
    Location:
    Winfield, IN USA
    Interesting stuff, thanks Mint! :D
     
  18. coredump

    Newcomer

    Joined:
    Jan 7, 2004
    Messages:
    35
    Likes Received:
    0
    Location:
    Here
    If someone has a GTX280 lying around...take a GTX and down clock the non-memory clocks by 3x and find out.

    This would also reduce the effective latency by 3x
     
  19. Arty

    Arty KEPLER
    Veteran

    Joined:
    Jun 16, 2005
    Messages:
    1,906
    Likes Received:
    55
  20. Arnold Beckenbauer

    Veteran

    Joined:
    Oct 11, 2006
    Messages:
    1,413
    Likes Received:
    347
    Location:
    Germany
    What's about this one: Radeon HD 4800: GDDR5 utile?

    If I understand it correct, they tested both RV770s with the same GPU clock (725 MHz) but their RAM clocks weren't changed (GDDR5 1800 MHz vs. GDDR3 933 MHz).
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...