NVIDIA Kepler speculation thread

Discussion in 'Architecture and Products' started by Kaotik, Sep 21, 2010.

Tags:
  1. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I had to get rid of the former "claim" in order to make the above assumption a little bit clearer. Please note that it's truly just an assumption based of course on the GK104 specifications. And that's exactly why I called for bullshit when I saw the lenzfire claimed 6.4b transistors for the GK110.

    Depends what you mean with bottlenecks exactly; for the record I don't expect to see a 512bit bus for one.
     
  2. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    The GF110 wasn't bw limited AFAIK
    - but a GK110 could be (just like the GK114)
    - but the bottlenecks that I'm aware are
    1) not enough TMUs in the SM (fixed in the GF104/114)
    2) Bus width problem into/out of the SM, limiting the Fill rate
    (I can't remember off the top of my head if it was out of, or into the SM - probably out of, if it's fill-rate limited)

    But anyway, there were supposed to be some limitations of the architecture, that aren't immediately obvious from a raw functional unit count, although the GF110 was known to be low on ROPs also.

    So, basically, one would hope they'd fixed a few of these issues...

    6.4B transistors?
    50% more than the GK104 would get you there more-or-less, would it not?
     
  3. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    "Fixed" it's debatable; GF1x4 were merely designs to have a performance part within a reasonable distance to the lowest top dog salvage part. In a relative sense you could say that texel fillrate and on paper single precision FLOPs were redundant on GF1x4, but the main culprit would always had been bandwidth. You can either think that GF110 had too little texel fillrate (for which I'd need some solid indication I haven't seen so far) or GF114 too much (which is obviously closer to reality since the texel fillrate to bandwidth ratio is quite a bit different compared to GF110).

    Care to elaborate since I don't understand what you mean?

    Why is the GF110 low on ROPs? When ROPs are coupled to the MC like in this case it's normal to expect 48 ROPs when there are 8 ROPs in each partition (6*64bits). Each rasterizer out of the 4 for each GPC is capable of 8 pixels/clock (32 pixels/clock in total), but I don't see what that would have to do with the ROP amount. What am I missing?

    There's no safe equation for that as long as the exclusive HPC additional functionalities of the top dog are unknown. However twice or almost twice as many transistors as GK104 sounds idiotic, especially considering that the die area estate of GK110 is most likely at ~550mm2 as SA stated.
     
  4. TKK

    TKK
    Newcomer

    Joined:
    Jan 12, 2010
    Messages:
    148
    Likes Received:
    0
    Oh dear, that's what you get for making rough, uneducated guesses :lol:

    It's the 'corrected' number AMD's PR gave out, so it's not my fault. I'm actually aware that it's very unlikely that this number is correct, that's why I added the "officially, at least" :wink:

    The only thing I noticed in computerbase reviews is that GF110 takes a slightly higher performance hit when enabling 16xAF compared to Tahiti and to a lesser extent Cayman. Nothing major, though.
     
  5. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Do they have a recent review where they exclusively investigated AF without AA? If yes then I've missed it. In any other case if it's the typical 1xAA/1xAF and 4xAA/16xAF tests (amongst others) how can you attribute the higher performance drop in the second case just to filtering performance? The framebuffer difference (2GB vs. 1.5GB) should be enough to make a slightly higher difference for Cayman for 4xMSAA mostly and not AF. With 8xAA in =/>1080 those two depending on case either break even or Cayman pulls occassionally slightly ahead.

    ***edit: this one is a wee bit more interesting: http://www.computerbase.de/artikel/...7970-crossfire/6/#abschnitt_leistung_mit_ssaa I'm just not sure if they've offset LOD on GeForces for that comparison, but either way the framebuffer differences are a bit clearer in that one.
     
  6. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Here you go, separate AA/AF scaling:
    http://www.computerbase.de/artikel/...-radeon-hd-7970/7/#abschnitt_skalierungstests
    (You can also find the same tests for hd5870/hd6870 unfortunately not for GTX460/560 though I'd really expect them to lose less performance there.)

    The difference to Tahiti is barely worth mentioning (don't forget Tahiti actually has the same tmu/alu ratio as GF110 anyway though you could argue GF110 has somewhat more alus as it has dedicated SFUs).
     
  7. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    It's the path from the SMs to the ROPs - 64-bits per SM

    http://www.behardware.com/articles/795-3/report-nvidia-geforce-gtx-460.html



    It's why AA on Fermi appears to cost less than on other architectures
    - it's actually because it's only with AA that the ROPs can get fully utilised
    - without AA the SMs can't supply enough pixels to the ROPs...

    So, it may not be an important limitation, given that everyone uses AA anyway...
     
  8. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I see what you mean. However the 64bit datapath between SMs and ROPs is the same between GF110 and GF114 and it's more an architectural decision (for whatever reason) than anything else.

    It's bleedingly obvious that the 4 raster units on GF110 capable of 8 pixels/clock each can process in total only 32 pixels/clock (which is exactly what I wrote in one of my former posts). Since ROPs and memory controller aren't decoupled on Fermi the amount of ROPs depend on the buswidth in a relative sense. I don't expect the latter to have changed in Kepler and the only other difference would be that GK104 with 4 GPCs will be capable of 32 pixels/clock from the 4 raster units this time.

    Tessellation aside if NV should also use the GK104 for Quadros this time, I don't expect the desktop variant to be capable of as much geometry.

    Τhank you. Hadn't seen that one.
     
  9. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    So the GK104 'fixes' this by having 4 GPCs, and it's probably safe to assume that the GK110 'fixes' this by having 8 GPCs, each with 4 SMs, with 64SPs each...

    And there, ladies & gentleman, we have the spec of the GK110!
    :lol:

    The 384-bit bus is a given
    - so how about TMUs?
    - hopefully some sensible increase over the GF110....
    :grin:
     
  10. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Yes. But with GF114 having "fatter" SMs (and half the SMs in total) it's probably more of a problem there (if it's really an issue outside synthetic benchmarks). Also on GF114 there are more ROPs (per SM) so essentially half the (color) ROPs are always idle (I don't think this should change with 4xmsaa in theory at least).
    Also rasterizer matching that rate is only sort of true, since it's 64bit/SM if you've got 4-channel fp16 (for instance) your effective pixel fill rate is now down to 16 pixels/clock (full GF110) or 8 pixels/clock (full GF114). Maybe such trivial shaders which would need higher export rate don't really matter much overall though for performance but with the synthetic tests the limitation is easily visible.

    That makes sense, especially since GK110 comes quite a bit later. Nvidia might simply decide to wait releasing new Quadros though, it's not like AMD is flooding the market with GCN based workstation cards right now which could threaten their high-end cards (if they even can since Tahiti's geometry throughput is still no match for GF110). (For that matter, no FireStream CGN parts neither so far even though the chip has all the bloat bits needed for that market.)
     
  11. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Jebus as it would had been soo damn hard to reach such an assumption and no I don't know yet if it's true. It makes sense though. Now all you need is to tell me what it actually "fixes", since how many pixels/clock the raster units can process is completely irrelevant to the ROPs as Damien points out in the article you linked to.

    How many do you think the GK104 has?
     
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Does it really matter as much in the end in a GF114 vs. GF100 comparison under the light that the first has way less bandwidth than the latter anyway?

    I don't consider that "quite a bit" as confirmed yet. It'll come down in what shape GK110 really is and if and how many metal spins it might need. None (highly unlikely IMO) and the quite a bit is probably not valid; just one chances are high that it might be somewhere mid year. More than the former is more in the quite a bit region.
     
  13. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    If you've got something useful to contribute, by all means go ahead and do so!
    :razz:
     
  14. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,022
    Likes Received:
    122
    Well even taking bandwidth difference in account, GF114 still has considerably less pixel export capability (as GF110 has roughly 1.5 times the bandwidth but twice the shader export capability) the picture doesn't really change there.
    As said I'm not sure it really makes much of a difference in the real world, but those fillrate tests were typically always bound by memory bandwidth, with ROPs capabilities usually far exceeding what the memory could sustain (well at least int8 is now ROP bound with Tahiti actually). But with GF114 (and GF110 too even) the ROPs capabilities exceed that of shader export (usually by a factor of 2 for GF114), and often it's shader export limiting these tests, not memory bandwidth.

    I dunno with latest rumor I find it unlikely it's before August. Though maybe that fits your definition of mid-year, in any case it should be "a couple of months later" :).
     
  15. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    I thought you like riddles :razz:

    A couple of months later in the strict sense, is 3 months before August :lol:
     
  16. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    Eight setup pipes in Kepler are actually not that far from plausible. Yes, 8 primitives per clock is probably an overkill by a wide margin, and the logic complexity too, but if NV is seeking an easy way to boost scan-out throughput, they could just use simpler setup units with half-rate speed (and 1/4 rate for consumer SKUs). This will keep the logic block size in check and will provide more optimized wiring to the SIMD multi-processors, avoiding critical hotspots.
     
  17. whitetiger

    Newcomer

    Joined:
    Feb 5, 2012
    Messages:
    57
    Likes Received:
    0
    Well, I think the GK114 will have the same ratio of TMU to SP as the GF114
    - and the GK114 has 4x the SPs, but its depends on which side of the hotclock the GF114 TMUs were on.
    - as I said the GF104/114 fixed the problem that the GF100 had with not having enough TMUs ...


    On a different note:
    Yikes:
    http://techreport.com/discussions.x/22478
     
  18. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    961
    Likes Received:
    855
    the more important questions is will Nvidia decouple ROPs from memory controller on GK110.. as Ailuros points supposed GK110 has 8GPCs 32 SMs 2048CCs with 384 bit bus wide.. on Fermi you couldnt use 32SMs&64ROPs with 384 bit bus you'd need 512 bit for it..
     
  19. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,552
    Likes Received:
    514
    Location:
    Varna, Bulgaria
    NV still has room for improvements for the ROPs on some surface formats that are now half ot 1/4 rate, compared to AMDs architecture. Boosting the count (event if that would mean decoupling) isn't imperative, me thinks. Also, decoupling means yet another mesh of wires for the cross-bar (i.e. a hot-spot), that NV had particularly bad dealings with Fermi's design process, especially with large number of end-points.
     
  20. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,742
    Likes Received:
    152
    @ Man from Atlantis, they could just increase the number of ROPs per memory channel. But I don't really think ROPs are that much of a performance issue...
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...