G70, G71, R520, R580 die size discrepancy

{Sniping}Waste said:
What a sad place. Talk about ppl soaking up PR garbage and the few that see it for what it is are ignored.

I am just measuring photos and referencing from one photo to another. Not the most accurate approach since the errors could add up. Even a conservative approx still yield 220mm^2. G73 is much easier since Samsung chips (11mmx14mm) is nearby, so it can't be 125mm^2 either.

Someone with a 7800GT card could give a more accurate reference to determine G71
die size.

Ppl just like to debate or speculate. I have given up on trying to get a more accurate
measurement since I don't have any Nvidia or PCI-E card for dimension reference.

A few more days, the truth will be out.
 
{Sniping}Waste said:
What a sad place. Talk about ppl soaking up PR garbage and the few that see it for what it is are ignored.

Just one of several sad places. B3D is an oasis in the desert.
 
That die pic gave me a giggle. I tend to think it's tilted and part of the package cut off specifically to make it harder to measure.

But really, we have an official NV slide "Nvidia Corporation Do Not Distribute" that says 196mm2 --I don't see any reason to question that. At least not yet. Quick fiddling with the calculator suggest 13.5w X 14.5h (since it looks longer than wider).
 
Last edited by a moderator:
{Sniping}Waste said:
The pic you have there could be a G73 and not a G71. It looks about the right size for a G73.

That per Sunrise and others is a G73. However, compare that to the 8 Samsung
GDDR3. Both are about the same. GDDR3 is 11x14mm which is 154mm^2.

G73 per Nvidia slide is 125mm^2.
 
Jawed said:
Well you could turn that argument around and say that ATI has fucked-up by making R580 so huge.
Well, I think the majority of the die size increase of the R580 over the G7x is in one form of cache or another (in the register file, or in I/O cache, or buffers between units, etc.). And ATI does obviously benefit from having more cache on the die, with better AA performance and better dynamic branching performance (both of which require more on-die storage).

But seriously fucked up? Well, we'll have to see. It might just be that ATI jumped the gun on adding in that extra on-die storage. That's not necessarily such a bad thing: it gives them a step up on implementing similar tech for their next architecture, even if it does cost money in the short term.

I think of real importance for ATI's bottom line will be how the G71 fares in AA benchmarks. Oblivion will likely be quite important as a benchmark among many people as well (especially given the timing of release).
 
geo said:
Quick fiddling with the calculator suggest 13.5w X 14.5h (since it looks longer than wider).

That is the problem. Is G71 really longer than wider? G70 package on some pictures that shows it to be square and others to be rectangular. I printed the image on a printer because
on LCD, the aspect ratio may not be correct. I don't have a photo/scan that I am positive is accurate.
 
atomt said:
That is the problem. Is G71 really longer than wider? G70 package on some pictures that shows it to be square and others to be rectangular. I printed the image on a printer because
on LCD, the aspect ratio may not be correct. I don't have a photo/scan that I am positive is accurate.

14x14 is also 196!
 
geo said:
Remembering the "hidden quads" conversation, I'd say you weren't the only one who thot that G70 was too big! :smile:

That's an interesting thot to juxtapose against atomt's speculation that they squeezed out some transistors, isn't it?

Edit: Or was that too subtle? :LOL: I calculated 225mm2 myself for G70 trans count at 90nm. Somewhat smaller than where atomt came out, but still quite a bit bigger than 196mm2. So, yeah, if a bunch of transistors suddenly go missing in G71 we might want to reconsider if the "hidden quads" theory had some merit after all, and they've now jettisoned them as unnecessary on a much smaller die.

EditII: And if you don't like "hidden quad(s)", substitute any other level of redundancy to up yields on a gehenna big part, that wouldn't be nearly as attractive on a much smaller part.

Let me make a hypothesis here...

Executive Summary: Fine Grained Clock Gating

The long story:
There are a few ways to do conditional transfers of data into a register. Up to, say, 2002 almost everybody used the plain vanilla way: you put a multiplexer in front of the flip-flop. One input is connected to the provider of new data. The other input is connected to the output of the FF. The control input of the multiplexer determines if you want to feed the FF with new data or if you want to keep the old data stored in the FF.

This configuration has a number of advantages:
- it's very easy to control testability, but, more important, ...
- ... it results in 1 clock tree for the whole clock domain.
The last part is *very* important: one of hard problems during the back-end stage of layout is to create a clock tree that has minimal skew, meaning, the rising edge of the clock at different location on the die happens at the same time.
Bad skew can result in all kinds of really nasty problems, including reduced maximal clock speed.
Generally, the lower the amount of clock tree, the easier things get.

There is, however, one really big disadvantage to this system: power consumption.
Even if your circuit doesn't do *anything* (meaning: all FF's keep their current value), the whole clock tree keeps on switching from 0 to 1 to 0 to 1 etc, thereby consuming gobs of power (typically, 10 - 20% of total chip consumption.)

One way of reducing power is to shut down part of the chip that aren't working. This way, you still keep large clock trees for those subblocks, but at least they are not toggling when you don't use them. You still have to tune the different trees so minimize inter-sub-domain clock skew, but it's managable. This is called coarse grained clock gating. Big win, but you have to explicitly design for it and once a small part of the block is active, the whole block is active. Early 2000 Intel and AMD processors are good examples of where this was used.

For even more power savings, there's fine grained clock gating. Instead of using the multiplexer to make sure the FF keeps its value, you simply shut down it's clock input when it doesn't need a new value. Basically, the 'select' input of the multiplexer is now connected to an 'AND' port that kills the clock. There are some details to get right, but the theory is very simple.
New clock tree synthesis tools introduced around 2002 made it possible to automate the clock tree skew tuning *through* those AND gates. In addition, there's a tool called PowerCompiler (among others) that takes all existing code and, instead of adding multiplexers it simply adds the clock gate. Those 2 together can easily reduce power by 30% (depending on the application).
The disadvantages to clock gating are that, initially, the skew was still a bit higher than without gating, resulting in a speed impact. Also, testability is a bit more difficult (though not too much) and late-in-the-game design changes (called ECO's) are also quite a bit harder. But the power benefits are definitely there.

Did I forget something?

Oh yes, one more detail: in the old style, you need 1 multiplexer per FF. So if you have a pipeline of, say, 32 flipflops, you need 32 multiplexers. With clock gating, you can use 1 AND port to shut down the clock of all those 32 FF's. Clock gating has some additional overhead to it, so typically, you start to gate registers that are 3 or 4 bits wide.

I was project leader of the first 130um chip at my company. Our initial test-version of the chip did not use clock gating because of backend problems due to the fragility of the tools (related to the clock gating.) The core of the chip had a size X.
The next version in the same technology with some *additional* features added and with clock gating, had a size of 0.9 * X ! That's a 10% area improvement right there.
Now keep in mind that this design was not highly pipelined with fairly narrow busses, so the FF area/total area ratio was fairly low. For a highly pipelined design, we would have had even better results.

I asked a Magma sales guy some time ago how many of his customers are using clock gating these days for their new designs. Answer: all of them.

Now tell me if you know a type of hardware that's highly pipelined. Anyone? :smile:

It's a hypothesis, all right... But if it's true, you should also see significantly reduced power consumption.

For some real numbers, have a look here:
http://www.deepchip.com/items/0396-06.html
Look at the second table, column 'relative area'.
 
geo said:
14x14 is also 196!
Your assumption was Nvidia is correct to arrive at 13.5mm x 14.5mm.

14x14 = 196. Whew!!! That was a huge relief. What if I read the slide wrong and
it was 186. Would the die dimensions auto-adjust to 13.63x13.63?
 
atomt said:
Your assumption was Nvidia is correct to arrive at 13.5mm x 14.5mm.

14x14 = 196. Whew!!! That was a huge relief. What if I read the slide wrong and
it was 186. Would the die dimensions auto-adjust to 13.63x13.63?

I'd offer to bet on it, but I try not to take easy money from junior members.

See ya on Thursday.
 
geo said:
I'd offer to bet on it, but I try not to take easy money from junior members.

See ya on Thursday.
I would like to accept the bet, but I try not to make senior members look .....

Reality is that I don't have a 7800GT for a more accurate verification, and all the great minds here prefer not to use a ruler to make some measurements. I still think it is > 200mm^2.

Thursday is just 2 days away and by month end, we will know whether G73 is 125mm^2 or 150mm^2.

No problem for junior member to eat crow. I just scratch my balls wondering where all the scaling went bad.
 
silent_guy said:
It's a hypothesis, all right... But if it's true, you should also see significantly reduced power consumption.
Do you think that ATI and NVidia are only now just implementing fine-grained clock-gating?

I was under the impression that they've both been doing this kind of thing for a while now, to achieve low-power in laptop/mobile (and using that tech in desktop, too).

Perhaps what you're saying is that they're moving from coarse-grained to fine-grained. Or at least that NVidia may well have done so with G71.

I'm just trying to get a feel for timescales here. If we assume that they're lagging behind Intel/AMD, by how much?

Jawed
 
Razor1 said:
http://theinq.com/?article=30130

Inq is saying 278 million the rumor of an extra quad in the g70 seems to singing :)

That would put them at about 58% increase in density moving 110nm to 90nm, taking both the 334mm2 and 196mm2 numbers seriously.

I'm tempted to go look at Wavey's "hidden quads" thread and see how many extra transistors he calculated G70 had. :D
 
Razor1 said:
http://theinq.com/?article=30130

Inq is saying 278 million the rumor of an extra quad in the g70 seems to singing :)

Hmmmm, interesting. If that was the case though, why not leave the extra quad in there for G71? It would still have ended up pretty small. And if you're dumping ~30M transistors, why not use it for something else - like HDR+AA or better AF?
 
trinibwoy said:
Hmmmm, interesting. If that was the case though, why not leave the extra quad in there for G71? It would still have ended up pretty small. And if you're dumping ~30M transistors, why not use it for something else - like HDR+AA or better AF?
why bother adding HDR+AA ?
IF G80 is coming soon ?
 
atomt said:
Of course if someone else can independently measure the die size, that would be most assuring in case I made a big BOO-BOO in elementary calculations.

I measured it at 180-220mm in the earlier thread with it likely to be the lower end of that range. I thought it too low then, but now I think it was spot on ..we shall see !
 
Back
Top