The G92 Architecture Rumours & Speculation Thread

Status
Not open for further replies.
Fair enough. We're really have to see about what's going to happen. My belief is not that GX2-like solutions won't happen, but they don't necessarily mean that larger dies than R600 and G80 are impossible. I think we'll see both solutions for quite some time.
My math earlier was off, so a billion transistor chip is doable within the confines of the 65nm process at a die size smaller than G80.

I'd expect that as long as SLI and Crossfire continue to have headaches, whichever GPU manufacturer can keep single-die solutions at a higher market segment will win out, if they don't lose money in the process.

I'm walking a bit out of my comfort zone here, but there aren't really that many technology parameters that are related to wafer diameter. The reason to go to larger wafers has mainly 2 reasons, both economic:
  • increase fab capacity: higher die throughput per handled wafer
  • reduce the amount of unusable wafer real estate: this is important when dies grow larger. Even with current large die sizes, it's still not that much of a factor, though it's definitely part of some equation in some cost calculation spreadsheet.
Yes, but the wafer size and its effect on the area of silicon that can be produced by a fab has an impact on the costs of manufacturing a given die size.

My argument that larger die sizes than G80 and R600 are undesirable is a primarily economic one.
The cost per good die from a foundry goes up and variation starts to hurt binning out of that smaller pool of dies.
Lower binned chips can't be sold with good margins, even if the silicion itself is functional.

I tend to conveniently ignore these kind of issues. ;) They are really not much of a concern for vendors who are using external fabs and standard cell design. So let me clarify: worst case electrical characteristic (which are always used to calculate the critical timing path of a chip) are really quite reliable, even in 65nm. Going forward, I don't expect major changes with this.
I suspect this is because fabs for one tend to add a certain margin of error exactly to make sure customers get what they expect.

I put too much emphasis on timings as opposed to the leakage variance.

On that note, G80 has parts that are not standard cell design, and R600 is an example where clock timings are likely good.

What has turned up is that AMD(ATI) with R600 has discovered what the CPU guys in every market but the extreme high end has known for years: that TDP and power draw is a first order limiting factor, regardless of circuit performance.

This is less of a concern for the CPU fabs of Intel and AMD, so they'll try to get closer to the limits of their process. As long as AMD will continue produce GPUs externally (which is obviously not a given), I'd like to stay with that model, and there I believe my argument still holds.
GPU price segmentation (and by extension the required binning) is odd.
For reasons I'm not sure I am fully aware of, the number of speed grades a given GPU die can be assigned is incredibly small compared to CPUs these days.

CPUs have over a half-dozen speed grades per chip stepping that become products.

GPUs like R600 will at the end of its life have only 3, and one is a cut down version, probably due to defects.

It can't go higher because of power draw, while it can't go too much lower with the cheaper to produce RV cores in the way.

I'm betting there is a fair amount of selection bias going on with GPU fabbing that neither Nvidia or AMD will disclose.

The CPU side is little better, they don't really give details on binning, but the wider selection of products provides more data points.

If there are multiple timing skews for R670/G92/... it will be interesting to see how much the clocks differ from from each other. They are really close for e.g. the 8800Utra/GTX/GTS, so there is clearly not yet a problem.
That may be more of a market segmentation thing and an engineering concern with regards to fixed TDP brackets for marketability.

If wires continues to play a larger role in the overall delay, I expect that variance in speed actually to go down. (Just like we're already seeing now.) You can't significantly reduce wire delays by increasing voltages.

It does help with crummy drive currents when you want that last GHz.
Intel does well on this account as well.

I agree that leakage variation can be quite high within the same process. Speed variation is much less so, once again keeping my more restricted rules in mind. Unlike the GPU world, there are lot of silicon products where all chips have to run at the seem speed. (Think cell phones, modems, TV chips, ...)
Most of those products don't seem need to push the envelope for performance or die size like top-flight GPUs, which in turn don't push the envelope on circuit performance like CPUs do.

I'd argue that this is more a matter of getting to market faster. Just like the 7950GX2 was a nice way to crash the R580 party while the next big thing (in the same process!) was getting ready back-stage.
That sort of begs the question why AMD's going the other route was so much slower... ;)

Anyway, my main initial argument was that debugging chips with 1B transistors didn't have to be a major burden. We deviated quite a bit from that. ;)
I agree that as long as the design is highly repetitive internally, the debugging effort need only incrementally increase over a smaller design based on the same building blocks.

My concern is that the company might not make much money if the die size places the product at the wrong end of scaling trends.
The foundry isn't going to shield GPU designers from increased costs and fewer good dies per wafer start.
Once the inflection point where the costs in achieving a given level of performance versus the die area needed makes multi chip more cost effective, why not go multi-die?
 
As I see it, nV already went multichip with their NVIO. I think we may see it going further along this trend, all the non-3D related stuff in one chip, the rest in the other. And while it could be possible to imploy another one of those on one single board via SLI or whatever future connection, I don't see it happening very soon. Though I think if they do a dual card, we'll rather see that than a GX2-like card.
 
As I see it, nV already went multichip with their NVIO. I think we may see it going further along this trend, all the non-3D related stuff in one chip, the rest in the other. And while it could be possible to imploy another one of those on one single board via SLI or whatever future connection, I don't see it happening very soon. Though I think if they do a dual card, we'll rather see that than a GX2-like card.


Several AIC told that G92 is a 64 SP G80.

My assumption is that

G92 is a 96 (MADD+MADD) with 35% clock increase.

Therefore, G8800GT can have similar 8800GTS performance.

By next year, Nvidia will introduce the G92X2 with 1 Tera Flop at peak.
 

Hmm, well, I found the Even's paper. Not having much in the way of a hardware background, it's a little puzzling, but, it looks like their single precision mulitplier has a latency of two and a throughput of one result/clock, while their double precision multiplier has a latency of three, and a throughput of one result every-other clock.

While the SSR paper indicates the gate cost is minimal, I don't see much in the way of details in the Even paper (probably because I'm a layperson ;^/) However, if I am to believe SSR, half-speed double-precision isn't as expensive as I would have thought it would be (I have a funny feeling of having gone down this road before...?).

Are there any details regarding G80's "double-pumping" and how that might work in the context of a dual-mode MUL?

I'm still puzzled by what G92 is or isn't, and if it's 64-wide, why NV thought that was all they needed this November. Competing against 320 shaders with 64, and remaining behind the node curve while Intel spins up its new architecture, smacks of slacking. But, I can say that NV's parking lot wouldn't seem to indicate that that's an issue. Speaking of smack, I seem to recall the earlier-in-the-year talk of NV really going after math performance. 64MADs would be a rather unconvincing outcome. At anyrate, looks like it'll be an interesting end to this year one way or the other.
 
It seems to me it'll be 800MHz G92 versus 850MHz RV670.

G92:
  • core 800
  • 16 TMUs = 12.8 G bilinear/s, 12.8 G trilinear/s
  • 16 ROPs = 12.8 G pixels/s, 102.4 G Z/s
RV670:
  • core 850
  • 16 TMUs = 13.6 G bilinear/s, 6.8 G trilinear/s
  • 16 ROPs = 13.6 G pixels/s, 27.2 G Z/s
G80-GTS:
  • core 500
  • 24 TMUs = 12 G bilinear/s, 12 G trilinear/s
  • 20 ROPs = 10 G pixels/s, 80 G Z/s
64 SP G92 needs to run its SPs at about 1.9GHz to retain the ALU:TEX ratio of 8800GTS (500v1200 upgraded to 800v1920).

Jawed
 
Why are we talking of codenames and so forth if we have die sizes? It might not be for 'G92', but CJ's 17x17mm chip (assuming he's right here) is certainly slated for this year or very early next year if he already knows the die size in late September.

Of course, there also is the possibility that G92 is that 17x17mm chip, that it does have a 192SPs shader core running at 2.4GHz+ with nearly one 1TFlop of easily benchmarkable performance... And even if that's G90, it's probably worth discussing too (but that does make things a lot more confusing!)

EDIT: As I said previously, the big big question right now is whether there will be a GDDR4 G92 SKU. If that does exist, then it implies G92 is very possibly bandwidth-starved with 256-bit GDDR3, and a 64SP chip likely wouldn't be, really!
 
Of course, there also is the possibility that G92 is that 17x17mm chip, that it does have a 192SPs shader core running at 2.4GHz+ with nearly one 1TFlop of easily benchmarkable performance... And even if that's G90, it's probably worth discussing too (but that does make things a lot more confusing!)
Why would NVidia implement 4x the GFLOPs of 8800GTS for their new upper-mainstream GPU?

Jawed
 
Why would NVidia implement 4x the GFLOPs of 8800GTS for their new upper-mainstream GPU?
Well, there are two ways to look at this of course: you're presuming G92 is upper-mainstream, I'm presuming the 17x17 chip is more than just upper-mainstream and that it *might* (or might not) be G92.

As I pointed out previously, a 65nm shrink of G80 that had G86-like stream processors running at 2GHz+ could easily be marketed as reaching nearly 1TFlop. However, it is true that such a chip would only make sense in the CUDA marketplace with enough bandwidth to go along with it.
 
Yes, sorry, I believe I was trying to say you'd have (well, might have) twice the *count* of fp32 as you would fp64 shaders, rather than twice the speed. However, as long as my ignorance is firmly out of the closet, I have a question (several) about implementation.

If my hopelessly human long multiplication is any guide, fp64 ~= a MUL followed by 3MADs, but it looks to me like between the second and third MAD you could be "carrying" a 64bit fp (well, okay, 48bits or whatever the non-exponent size is).

Question 1: does it make sense to deconstruct the wider MUL64 across four clocks, or use two SPs and two clocks (or four SPs) with additional add logic?

Question 2: does this additional add logic get surfaced in any way when running fp32?

Question 3: if you run over four clocks, and you're already using the existing ADD32 logic for MUL64, how do you get MAD64?

Pointers and clue-by-fours equally welcome :)
You'd probably need more than 4 cycles (mul+3mad+add) unless you have a double width adder (54+bits) for the ADD part of the MAD32 pipeline.

1. I think it's easier to just do a Multiply/Accumulate within each MAD32 pipeline over multiple clocks.

2. Depends on where the extra add logic is placed in the pipeline.

3. A MAD64 would be quite complicated. But if they're going for IEEE compliant DP (except for denorms perhaps), then i don't think IEEE has a specified behavior for MADs (i am not completely sure). So it might not be too important.


Hmm, well, I found the Even's paper. Not having much in the way of a hardware background, it's a little puzzling, but, it looks like their single precision mulitplier has a latency of two and a throughput of one result/clock, while their double precision multiplier has a latency of three, and a throughput of one result every-other clock.

While the SSR paper indicates the gate cost is minimal, I don't see much in the way of details in the Even paper (probably because I'm a layperson ;^/) However, if I am to believe SSR, half-speed double-precision isn't as expensive as I would have thought it would be (I have a funny feeling of having gone down this road before...?).

Are there any details regarding G80's "double-pumping" and how that might work in the context of a dual-mode MUL?
The "dual-mode" multiplier in Even's paper has a throughput of one SP per cycle or one DP every 2 cycles. To do this he uses 27*53 a multiplier array, half of that is unused in SP mode. The definition of "dual-mode" in the SSR paper is 2 SP per cycle and 1 DP per cycle (all fully pipelined). So it's not exactly the same thing as Even's. A single cycle DP MUL requires a full ~54*54 multiplier array. You can split that up into 2 independent arrays for SP mode. That's quite a bit more hardware than a ~27*27 multiplier array for SP only. A much larger array like that will have to be deeper pipelined to achieve the same clocks. Pipeline stages cost hardware as well. So maybe the SSR paper is being overly optimistic about dual mode, or the datapaths really don't take up that much hardware relative to the rest of the chip as they claim (i don't know the ratio of ALU transistors to the other stuff in GPUs, i have only worked a bit on ALU datapaths in isolation). Or i am missing something fundamental here about the modifications required to the datapath.

But assuming ALU transistors make up a significant amount of the total transistor count in a GPU, i don't think they would want to spend that much extra hardware and/or latency for half speed/throughput DP. So 1/4 speed or less seems much more logical.

I don't think the double pumping changes anything in this case.
 
Hmm, there's an idea. What if the rumours of G92's 64 SPs actually means 64 double precision SPs? Which translates into 128 single precision SPs :p

Jawed
 
Why are we talking of codenames and so forth if we have die sizes? It might not be for 'G92', but CJ's 17x17mm chip (assuming he's right here) is certainly slated for this year or very early next year if he already knows the die size in late September.

Of course, there also is the possibility that G92 is that 17x17mm chip, that it does have a 192SPs shader core running at 2.4GHz+ with nearly one 1TFlop of easily benchmarkable performance... And even if that's G90, it's probably worth discussing too (but that does make things a lot more confusing!)

EDIT: As I said previously, the big big question right now is whether there will be a GDDR4 G92 SKU. If that does exist, then it implies G92 is very possibly bandwidth-starved with 256-bit GDDR3, and a 64SP chip likely wouldn't be, really!


There is another possibility that Nvidia will launch 8800GT, 8950GT, and 8850GX2 with G92 as 500M transistor counts.

680 * 3/4 =510 M
 
And this invalidates the fact that 65nm is doing well how, precisely? The why is irrelevant.

The fact that a limited subset of the total output of AMD's 65nm fab is doing well does not indicate that it is doing well or badly in the overall scheme of things.

At the point in time that the data for the G1 stepping was looked at over at ihub, 65nm chips showed a wide leakage variance between the best binning chips and the worst.

It is likely there have been improvements since then, as the Black Edition 5000+ appears to be running at a marginally lower voltage. Any improvements in recent months do not change my assertion that process variation is an increasing problem that requires significant effort to combat, and that it has an increasing impact on binning.

Regardless of whether AMD's 65nm process is doing "fine" by whatever standard you decide is good, there was still significant variation in the chips released prior to July.

The Phenom or Barcelona demonstration AMD had with a chip running at 3.0 GHz is another example.

As it was the same stepping as some of the other reviewed K10 chips out there, we see that the early steppings of Barcelona have a range between <2 GHz to 3.0 GHz.
The higher end's power draw was not a value I saw when reading on the demo.

This range is likely a confluence of speedpath issues and TDP concerns.
The gap is not necessarily because of process variation, due to the rawness of the chips (6 months late, for some reason...), but it is not inconsistent with the hypothesis.
 
The fact that a limited subset of the total output of AMD's 65nm fab is doing well does not indicate that it is doing well or badly in the overall scheme of things.

It would be interesting to see the overclockability of these chips as well as the volumes in which they are produced as a means to determine the overall health of the 65nm process. Clearly AMD has been holding back higher-clocked 65nm K8 chips so as not to "show up" K10 by outperforming it in the vast majority of workloads.

I think we'll have a much better picture of 65nm's health by year-end.
 
showoriginal-11538.jpg
showoriginal-11539.jpg

http://we.pcinlife.com/thread-826798-1-1.html

Looks like the same big heat-spreader on G80.
 
How can we say it's the same size without actually knowing the scale of those images? Am I missing something? :)
 
Status
Not open for further replies.
Back
Top