geo said:
Remembering the "hidden quads" conversation, I'd say you weren't the only one who thot that G70 was too big! :smile:
That's an interesting thot to juxtapose against atomt's speculation that they squeezed out some transistors, isn't it?
Edit: Or was that too subtle?
I calculated 225mm2 myself for G70 trans count at 90nm. Somewhat smaller than where atomt came out, but still quite a bit bigger than 196mm2. So, yeah, if a bunch of transistors suddenly go missing in G71 we might want to reconsider if the "hidden quads" theory had some merit after all, and they've now jettisoned them as unnecessary on a much smaller die.
EditII: And if you don't like "hidden quad(s)", substitute any other level of redundancy to up yields on a gehenna big part, that wouldn't be nearly as attractive on a much smaller part.
Let me make a hypothesis here...
Executive Summary: Fine Grained Clock Gating
The long story:
There are a few ways to do conditional transfers of data into a register. Up to, say, 2002 almost everybody used the plain vanilla way: you put a multiplexer in front of the flip-flop. One input is connected to the provider of new data. The other input is connected to the output of the FF. The control input of the multiplexer determines if you want to feed the FF with new data or if you want to keep the old data stored in the FF.
This configuration has a number of advantages:
- it's very easy to control testability, but, more important, ...
- ... it results in 1 clock tree for the whole clock domain.
The last part is *very* important: one of hard problems during the back-end stage of layout is to create a clock tree that has minimal skew, meaning, the rising edge of the clock at different location on the die happens at the same time.
Bad skew can result in all kinds of really nasty problems, including reduced maximal clock speed.
Generally, the lower the amount of clock tree, the easier things get.
There is, however, one really big disadvantage to this system: power consumption.
Even if your circuit doesn't do *anything* (meaning: all FF's keep their current value), the whole clock tree keeps on switching from 0 to 1 to 0 to 1 etc, thereby consuming gobs of power (typically, 10 - 20% of total chip consumption.)
One way of reducing power is to shut down part of the chip that aren't working. This way, you still keep large clock trees for those subblocks, but at least they are not toggling when you don't use them. You still have to tune the different trees so minimize inter-sub-domain clock skew, but it's managable. This is called coarse grained clock gating. Big win, but you have to explicitly design for it and once a small part of the block is active, the whole block is active. Early 2000 Intel and AMD processors are good examples of where this was used.
For even more power savings, there's fine grained clock gating. Instead of using the multiplexer to make sure the FF keeps its value, you simply shut down it's clock input when it doesn't need a new value. Basically, the 'select' input of the multiplexer is now connected to an 'AND' port that kills the clock. There are some details to get right, but the theory is very simple.
New clock tree synthesis tools introduced around 2002 made it possible to automate the clock tree skew tuning *through* those AND gates. In addition, there's a tool called PowerCompiler (among others) that takes all existing code and, instead of adding multiplexers it simply adds the clock gate. Those 2 together can easily reduce power by 30% (depending on the application).
The disadvantages to clock gating are that, initially, the skew was still a bit higher than without gating, resulting in a speed impact. Also, testability is a bit more difficult (though not too much) and late-in-the-game design changes (called ECO's) are also quite a bit harder. But the power benefits are definitely there.
Did I forget something?
Oh yes, one more detail: in the old style, you need 1 multiplexer per FF. So if you have a pipeline of, say, 32 flipflops, you need 32 multiplexers. With clock gating, you can use 1 AND port to shut down the clock of all those 32 FF's. Clock gating has some additional overhead to it, so typically, you start to gate registers that are 3 or 4 bits wide.
I was project leader of the first 130um chip at my company. Our initial test-version of the chip did not use clock gating because of backend problems due to the fragility of the tools (related to the clock gating.) The core of the chip had a size X.
The next version in the same technology with some *additional* features added and with clock gating, had a size of 0.9 * X ! That's a 10% area improvement right there.
Now keep in mind that this design was not highly pipelined with fairly narrow busses, so the FF area/total area ratio was fairly low. For a highly pipelined design, we would have had even better results.
I asked a Magma sales guy some time ago how many of his customers are using clock gating these days for their new designs. Answer: all of them.
Now tell me if you know a type of hardware that's highly pipelined. Anyone? :smile:
It's a hypothesis, all right... But if it's true, you should also see significantly reduced power consumption.
For some real numbers, have a look here:
http://www.deepchip.com/items/0396-06.html
Look at the second table, column 'relative area'.