The nvidia future architecture thread (G100/GT300 and such)

Discussion in 'Architecture and Products' started by CarstenS, Jul 14, 2008.

  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,807
    Likes Received:
    2,073
    Location:
    Germany
    In this thread I'd love to see a discussion about what Nvidia needs to (or can, mind you) do to catch up in terms of performance leaps over GPU generations.

    In other words: If Ati continues at a pace like this, nvidia will be in for a very hard time by the next refresh. What can they possibly do about it architecture wise?


    --

    Me, I think, they've invested a bit too much in granularity and also their texturing power - as much as I love the almost free and good 16x aniso - seems to be a bit over the top.

    Obviously, using more modern tech, AMDs RV770 at sufficient clockspeeds oftentimes is not only game but a fair match for even the GTX280 despite the latter using about 50% more transistors on more than twice the die size (allegedly).

    So, what could Nvidia do?
     
  2. Freak'n Big Panda

    Regular

    Joined:
    Sep 28, 2002
    Messages:
    898
    Likes Received:
    4
    Location:
    Waterloo Ontario
    It's really hard to say what NV will do this early. But they have two fundamental options:

    1. Continue to build monolithic chips for the high end
    2. SLI on a stick for the high end

    The path they pick will affect the architectural decisions they make during chip design. But in general I would expect to see vector based ALUs to increase perf/mm^2 and an increase in math:tex ratio.
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    The 512-bit memory bus curse distorts the picture really badly with GT200, I reckon.

    With 256-bit GDDR5, 128GB/s+, and half the ROPs running at, say, 750MHz+ I imagine GT200b would perform better than GT200 - though the cache per ROP partition would prolly need to be doubled to retain the overall cache amount.

    The ROPs + MCs + memory IO consumes about 28-29% of the die according to Arun. If halved that would be quite a significant gain. Though GDDR5 would be more costly in area I imagine. So, say, 60%?... A saving of about 17% of the die?

    I still think NVidia could get away with only 64 TMUs at this kind of performance level (saving 5% of the die). So 8 clusters.

    Then, increase the ALU:TEX ratio some more, i.e. 4 SIMDs to retain ALU capability and Bob's your uncle. 32 SIMDs in 8 clusters versus 30 SIMDs in 10 clusters prolly isn't much of a reduction in die space, though.

    Alternatively keep 3 SIMDs per cluster and rely on ALU clocks in the region of 1.6GHz? That would save another few percent of die space?

    Overall, this seems to be a die saving of ~25%. Shrunk to 55nm, perhaps somewhere around 20% bigger than RV770 (310mm2)?

    Is AMD likely to refresh RV770, e.g. RV780? Would that be ~20% more capable, e.g. 12 SIMDs? +10% clocks?

    Overall I think the size of GT200 meant that clocks were considerably lower than what should transpire with GT200b. In other words I think GT200 gives a distorted picture of NVidia's future - it inclines us to think too low.

    If GT200b comes in at around 55% of the die size of GT200 allowing NVidia to seriously boost clocks, RV770 just won't be in the same picture.

    As a base for discussion of GT300, I think GT200b should be the real deal - GT200 is misleading - though prolly not to the same degree that R600 misled on what RV770 would be, though :twisted:

    Apart from that I think NVidia needs fine-grained redundancy in the ALUs. Turning off clusters just seems clumsy. When you're trailing ATI's peak compute density by ~35% and when that includes 6% redundancy in ATI and your GPUs will, for other reasons, be bigger - fine-grained redundancy just looks like low-hanging fruit.

    Jawed
     
  4. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I guess the biggest question is whether GT200's lower clocks and per-mm2 performance compared to G92 along with its delay were a result of monolithic design or some other factors.

    Monolithic design is obviously the best choice for scalability. It's just a matter of whether that advantage is worth the extra design effort, tapeout cost, and drain on resources that could be used for larger parts of the market.
     
  5. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Jawed, a simpler way of looking at your suggestions is just taking G92b and adding GDDR5 and more math per cluster (24 or 32 SIMDs instead of 16).

    I don't see how it would become faster than RV770, and it'll definitely be substantially larger. Sure, these changes help in math heavy applications, but those are the apps where G92b falls furthest behind RV770 anyway.
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    True enough. If we estimate that the SIMDs in G92b are 20% of the die, then 24 SIMDs with 10% die size increase + say 5% for GDDR5 puts G200b at around 300mm2.

    It'd be faster in the same way that GTX280 is faster - and that's before getting a clock boost.

    Don't forget the prodigal MUL - missing from G92b - distorting any performance scaling argument that uses G92b as a base.

    ATI's architecture needs to be rated at ~75-80% utilisation in static (no DB) code it seems. Although, to be fair, the interpolation of attributes does present a significant overhead on NVidia's architecture - one that hasn't really been quantified. Hmm...

    Jawed
     
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,436
    Likes Received:
    443
    Location:
    New York
    With Nvidia's stuff it all comes down to clocks. An 8-cluster 256 shader, GDDR5 G9x variant at G92b clocks should run circles around RV770. It'll be considerably bigger than RV770 though and there's still the challenge of RV770's higher texturing and AA efficiency to overcome.
     
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm not so sure. In shaders that aren't so simple that they're fillrate limited, even the 625MHz RV770 is about even with the GTX 280 (except for two shaders where only the PS4.0 version runs badly on ATI's GPUs for some reason):
    http://www.ixbt.com/video3/rv770-part2.shtml

    Besides, as we learned from R580, math alone can only do so much. Heck, even doubling math and texturing doesn't do too much, as we see with G94 vs. G92.

    I'd expect 10-15% more speed overall in games when doubling G92's math speed. Along with GDDR5 it would probably approach the 4870 in speed, but would need at least 25% more die space. That's worth it, IMO, but doesn't improve NVidia's perf/$.
     
  9. bowman

    Newcomer

    Joined:
    Apr 24, 2008
    Messages:
    141
    Likes Received:
    0
    No, it's due to have a 6-month life cycle on the high end, before R800 Q1 next year. Maybe a 40nm refresh next year for the mid-end.
     
  10. Shadowmage

    Newcomer

    Joined:
    Sep 30, 2005
    Messages:
    60
    Likes Received:
    3
    What NVIDIA needs now is the new equivalent of G80->G92: a die shrink with half the bus width, GDDR5, 55nm, and much higher clock speeds.

    However I still think that RV770 would beat this at high AA.

    On a related note, do we know yet how RV770 performs so well at 4xAA and 8xAA? Do you think it might be some kind of new compression algorithm?
     
  11. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,583
    Likes Received:
    703
    Location:
    Guess...
    GT200 is still faster than RV770. All NV needs to do is make it small enough to sell cheaper and put two on a die.

    A die shrink, GDDR5 with a 256bit bus, remove the DP unit and maybe half the ROPs (at a higher clock speed).

    That may be enough to bring the cost close enough in line with RV770 to match or exceed its price/performance.

    Then as long as they can manage to squeeze 2 of them on a board they should be able to keep the overall performance advantage as well.

    Its no short order of course and I don't want to take anything away from the amazing achievement that R7xx is but it shouldn't be forgotten that GT200 is still the faster chip. ATI only has a performance advantage by using 2 chips vs 1. Credit were its due though, because thats not possible for NV atm.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    That's usually a bandwidth/fillrate issue though, something that's identical on those two GPUs. There are times when G92 is 50%+ faster per clock and framerate minima are often where the real value lies.

    GTX280 comfortably leads RV770 in games with AF/MSAA off - I don't think math is a useful basis for comparisons here.

    RV770's per-unit and per-mm2 AF/MSAA performance seems to be the key.

    I think NVidia's route to performance/$ or performance/mm2 is by cutting back on the excessive unit counts of TMUs and ROPs. GT200's TMUs are more efficient than G92's. The ROPs might get a new lease of life with a redesign for GDDR5 - regardless they've been choked by GDDR3 bandwidth, so should bounce back somewhat.

    Jawed
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Hmm, you appear to be saying that R800 will be a 55nm GPU (presumably in this case two on one board). Presumably the die can't get any smaller than RV770 on 55nm for a 256-bit bus - unless they radically reduce power consumption?

    Jawed
     
  14. wishiknew

    Regular

    Joined:
    May 19, 2004
    Messages:
    332
    Likes Received:
    6
    Does nvidia have to spend some transistors to get more double precision flops?
     
  15. Freak'n Big Panda

    Regular

    Joined:
    Sep 28, 2002
    Messages:
    898
    Likes Received:
    4
    Location:
    Waterloo Ontario
    yeah they would have to
     
  16. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    That's precisely my point. When setup, fillrate, and BW are equal, game performance isn't affected as much as you'd expect by having double the ALU and TEX. Adding only ALUs will have an even smaller effect.

    Again, exactly my point. I don't think adding math to G92b (or, almost equivalently, chopping everything else from GT200) will do much.

    You do realize that a few sentences before this you attributed G94's speed to its ROPs/BW, right? :wink:

    If 16 ROPs are useful to a 4 cluster GPU, you can't say 32 ROPs are excessive for a GPU with 10 even faster clusters. You can be sure that GT200 would take a hit with half the ROPs, and likewise RV770 would be notably faster with double the ROPs.

    There's no easy fix here. These adjustments that you're suggesting will change perf/$ by a few percentage points at best. NVidia didn't really make any mistakes in the balance between the execution units. RV770 simply raised the bar on how good each part of a GPU -- TMU, ROP, MC, ALU, TEX, scheduling, etc -- can be for a given silicon budget.
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,436
    Likes Received:
    443
    Location:
    New York
    So Mint, what's your theory for where GT200 needs to be improved upon relative to RV770? You've already eliminated math and texturing.....
     
  18. ChrisRay

    ChrisRay <span style="color: rgb(124, 197, 0)">R.I.P. 1983-
    Veteran

    Joined:
    Nov 25, 2002
    Messages:
    2,234
    Likes Received:
    26
    The main things I see. Are better memory management ((So 512 cards dont die as quickly)) Increased shader clocks ((I really do feel they have plenty of pixel/texture fillrates)) and perhaps improved AA cycles through ROPS at 8x MS ((not all that important to me.))
     
  19. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I think you misinterpreted my comments. I'm saying that's the problem doesn't lie in the balance of ROPs, math, texturing, etc. The balance is fine, and no single aspect is disproportionately sized for the improvement it brings, IMO.

    The problem is, again, that RV770 is just too good in all those areas. Throw in the worse memory management than ATI and low clocks of GT200 and you have a much worse product from a perf/$ standpoint.

    Again, there's no easy fix. NVidia will have to improve everything if they want to get their margins up to the level they used to be at. It's not only areal efficiency, but aside from the ALUs they need better per-cycle efficiency too. Doing both is tough, and the only reason ATI was able to do it was the mediocrity of their previous design.
     
  20. seahawk

    Regular

    Joined:
    May 18, 2004
    Messages:
    511
    Likes Received:
    141
    The key imho would be not to overdo things like Physix and GPGPU ideas.

    It is all nice and well but in the end game performance is what sells cards to gamers.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...