The G92 Architecture Rumours & Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Aug 8, 2007.

Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    My math earlier was off, so a billion transistor chip is doable within the confines of the 65nm process at a die size smaller than G80.

    I'd expect that as long as SLI and Crossfire continue to have headaches, whichever GPU manufacturer can keep single-die solutions at a higher market segment will win out, if they don't lose money in the process.

    Yes, but the wafer size and its effect on the area of silicon that can be produced by a fab has an impact on the costs of manufacturing a given die size.

    My argument that larger die sizes than G80 and R600 are undesirable is a primarily economic one.
    The cost per good die from a foundry goes up and variation starts to hurt binning out of that smaller pool of dies.
    Lower binned chips can't be sold with good margins, even if the silicion itself is functional.

    I put too much emphasis on timings as opposed to the leakage variance.

    On that note, G80 has parts that are not standard cell design, and R600 is an example where clock timings are likely good.

    What has turned up is that AMD(ATI) with R600 has discovered what the CPU guys in every market but the extreme high end has known for years: that TDP and power draw is a first order limiting factor, regardless of circuit performance.

    GPU price segmentation (and by extension the required binning) is odd.
    For reasons I'm not sure I am fully aware of, the number of speed grades a given GPU die can be assigned is incredibly small compared to CPUs these days.

    CPUs have over a half-dozen speed grades per chip stepping that become products.

    GPUs like R600 will at the end of its life have only 3, and one is a cut down version, probably due to defects.

    It can't go higher because of power draw, while it can't go too much lower with the cheaper to produce RV cores in the way.

    I'm betting there is a fair amount of selection bias going on with GPU fabbing that neither Nvidia or AMD will disclose.

    The CPU side is little better, they don't really give details on binning, but the wider selection of products provides more data points.

    That may be more of a market segmentation thing and an engineering concern with regards to fixed TDP brackets for marketability.

    It does help with crummy drive currents when you want that last GHz.
    Intel does well on this account as well.

    Most of those products don't seem need to push the envelope for performance or die size like top-flight GPUs, which in turn don't push the envelope on circuit performance like CPUs do.

    That sort of begs the question why AMD's going the other route was so much slower... ;)

    I agree that as long as the design is highly repetitive internally, the debugging effort need only incrementally increase over a smaller design based on the same building blocks.

    My concern is that the company might not make much money if the die size places the product at the wrong end of scaling trends.
    The foundry isn't going to shield GPU designers from increased costs and fewer good dies per wafer start.
    Once the inflection point where the costs in achieving a given level of performance versus the die area needed makes multi chip more cost effective, why not go multi-die?
     
  2. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    As I see it, nV already went multichip with their NVIO. I think we may see it going further along this trend, all the non-3D related stuff in one chip, the rest in the other. And while it could be possible to imploy another one of those on one single board via SLI or whatever future connection, I don't see it happening very soon. Though I think if they do a dual card, we'll rather see that than a GX2-like card.
     
  3. Vincent

    Newcomer

    Joined:
    May 28, 2007
    Messages:
    235
    Likes Received:
    0
    Location:
    London

    Several AIC told that G92 is a 64 SP G80.

    My assumption is that

    G92 is a 96 (MADD+MADD) with 35% clock increase.

    Therefore, G8800GT can have similar 8800GTS performance.

    By next year, Nvidia will introduce the G92X2 with 1 Tera Flop at peak.
     
  4. dnavas

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    375
    Likes Received:
    7
    Hmm, well, I found the Even's paper. Not having much in the way of a hardware background, it's a little puzzling, but, it looks like their single precision mulitplier has a latency of two and a throughput of one result/clock, while their double precision multiplier has a latency of three, and a throughput of one result every-other clock.

    While the SSR paper indicates the gate cost is minimal, I don't see much in the way of details in the Even paper (probably because I'm a layperson ;^/) However, if I am to believe SSR, half-speed double-precision isn't as expensive as I would have thought it would be (I have a funny feeling of having gone down this road before...?).

    Are there any details regarding G80's "double-pumping" and how that might work in the context of a dual-mode MUL?

    I'm still puzzled by what G92 is or isn't, and if it's 64-wide, why NV thought that was all they needed this November. Competing against 320 shaders with 64, and remaining behind the node curve while Intel spins up its new architecture, smacks of slacking. But, I can say that NV's parking lot wouldn't seem to indicate that that's an issue. Speaking of smack, I seem to recall the earlier-in-the-year talk of NV really going after math performance. 64MADs would be a rather unconvincing outcome. At anyrate, looks like it'll be an interesting end to this year one way or the other.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    It seems to me it'll be 800MHz G92 versus 850MHz RV670.

    G92:
    • core 800
    • 16 TMUs = 12.8 G bilinear/s, 12.8 G trilinear/s
    • 16 ROPs = 12.8 G pixels/s, 102.4 G Z/s
    RV670:
    • core 850
    • 16 TMUs = 13.6 G bilinear/s, 6.8 G trilinear/s
    • 16 ROPs = 13.6 G pixels/s, 27.2 G Z/s
    G80-GTS:
    • core 500
    • 24 TMUs = 12 G bilinear/s, 12 G trilinear/s
    • 20 ROPs = 10 G pixels/s, 80 G Z/s
    64 SP G92 needs to run its SPs at about 1.9GHz to retain the ALU:TEX ratio of 8800GTS (500v1200 upgraded to 800v1920).

    Jawed
     
  6. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Why are we talking of codenames and so forth if we have die sizes? It might not be for 'G92', but CJ's 17x17mm chip (assuming he's right here) is certainly slated for this year or very early next year if he already knows the die size in late September.

    Of course, there also is the possibility that G92 is that 17x17mm chip, that it does have a 192SPs shader core running at 2.4GHz+ with nearly one 1TFlop of easily benchmarkable performance... And even if that's G90, it's probably worth discussing too (but that does make things a lot more confusing!)

    EDIT: As I said previously, the big big question right now is whether there will be a GDDR4 G92 SKU. If that does exist, then it implies G92 is very possibly bandwidth-starved with 256-bit GDDR3, and a 64SP chip likely wouldn't be, really!
     
  7. max-pain

    Regular

    Joined:
    Feb 13, 2004
    Messages:
    309
    Likes Received:
    2
    A 17x17mm chip @ 65nm means more than 700 million transistors...
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Why would NVidia implement 4x the GFLOPs of 8800GTS for their new upper-mainstream GPU?

    Jawed
     
  9. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    Well, there are two ways to look at this of course: you're presuming G92 is upper-mainstream, I'm presuming the 17x17 chip is more than just upper-mainstream and that it *might* (or might not) be G92.

    As I pointed out previously, a 65nm shrink of G80 that had G86-like stream processors running at 2GHz+ could easily be marketed as reaching nearly 1TFlop. However, it is true that such a chip would only make sense in the CUDA marketplace with enough bandwidth to go along with it.
     
  10. Farhan

    Newcomer

    Joined:
    May 19, 2005
    Messages:
    152
    Likes Received:
    13
    Location:
    in the shade
    You'd probably need more than 4 cycles (mul+3mad+add) unless you have a double width adder (54+bits) for the ADD part of the MAD32 pipeline.

    1. I think it's easier to just do a Multiply/Accumulate within each MAD32 pipeline over multiple clocks.

    2. Depends on where the extra add logic is placed in the pipeline.

    3. A MAD64 would be quite complicated. But if they're going for IEEE compliant DP (except for denorms perhaps), then i don't think IEEE has a specified behavior for MADs (i am not completely sure). So it might not be too important.


    The "dual-mode" multiplier in Even's paper has a throughput of one SP per cycle or one DP every 2 cycles. To do this he uses 27*53 a multiplier array, half of that is unused in SP mode. The definition of "dual-mode" in the SSR paper is 2 SP per cycle and 1 DP per cycle (all fully pipelined). So it's not exactly the same thing as Even's. A single cycle DP MUL requires a full ~54*54 multiplier array. You can split that up into 2 independent arrays for SP mode. That's quite a bit more hardware than a ~27*27 multiplier array for SP only. A much larger array like that will have to be deeper pipelined to achieve the same clocks. Pipeline stages cost hardware as well. So maybe the SSR paper is being overly optimistic about dual mode, or the datapaths really don't take up that much hardware relative to the rest of the chip as they claim (i don't know the ratio of ALU transistors to the other stuff in GPUs, i have only worked a bit on ALU datapaths in isolation). Or i am missing something fundamental here about the modifications required to the datapath.

    But assuming ALU transistors make up a significant amount of the total transistor count in a GPU, i don't think they would want to spend that much extra hardware and/or latency for half speed/throughput DP. So 1/4 speed or less seems much more logical.

    I don't think the double pumping changes anything in this case.
     
  11. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    And this invalidates the fact that 65nm is doing well how, precisely? The why is irrelevant.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Hmm, there's an idea. What if the rumours of G92's 64 SPs actually means 64 double precision SPs? Which translates into 128 single precision SPs :razz:

    Jawed
     
  13. Vincent

    Newcomer

    Joined:
    May 28, 2007
    Messages:
    235
    Likes Received:
    0
    Location:
    London

    There is another possibility that Nvidia will launch 8800GT, 8950GT, and 8850GX2 with G92 as 500M transistor counts.

    680 * 3/4 =510 M
     
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    The fact that a limited subset of the total output of AMD's 65nm fab is doing well does not indicate that it is doing well or badly in the overall scheme of things.

    At the point in time that the data for the G1 stepping was looked at over at ihub, 65nm chips showed a wide leakage variance between the best binning chips and the worst.

    It is likely there have been improvements since then, as the Black Edition 5000+ appears to be running at a marginally lower voltage. Any improvements in recent months do not change my assertion that process variation is an increasing problem that requires significant effort to combat, and that it has an increasing impact on binning.

    Regardless of whether AMD's 65nm process is doing "fine" by whatever standard you decide is good, there was still significant variation in the chips released prior to July.

    The Phenom or Barcelona demonstration AMD had with a chip running at 3.0 GHz is another example.

    As it was the same stepping as some of the other reviewed K10 chips out there, we see that the early steppings of Barcelona have a range between <2 GHz to 3.0 GHz.
    The higher end's power draw was not a value I saw when reading on the demo.

    This range is likely a confluence of speedpath issues and TDP concerns.
    The gap is not necessarily because of process variation, due to the rawness of the chips (6 months late, for some reason...), but it is not inconsistent with the hypothesis.
     
  15. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    It would be interesting to see the overclockability of these chips as well as the volumes in which they are produced as a means to determine the overall health of the 65nm process. Clearly AMD has been holding back higher-clocked 65nm K8 chips so as not to "show up" K10 by outperforming it in the vast majority of workloads.

    I think we'll have a much better picture of 65nm's health by year-end.
     
  16. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,742
    Likes Received:
    152
    Where are you getting 3/4 from?
     
  17. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,245
    Likes Received:
    3,409
    It doesn't work like this. There are only 1/4 SPs in G84 but it's nearly half of transistors of G80. You're forgetting about control logic. And vice versa a possible "G90" with 192 SPs should only be around 800M or so...
     
  18. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
  19. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    How can we say it's the same size without actually knowing the scale of those images? Am I missing something? :)
     
  20. Farhan

    Newcomer

    Joined:
    May 19, 2005
    Messages:
    152
    Likes Received:
    13
    Location:
    in the shade
    About 1200 pins according to that image.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...