Azuro PowerCentric

Geo · May 31, 2007

http://www.beyond3d.com/content/pr/31

SANTA CLARA, Calif.--(BUSINESS WIRE)--Azuro, Inc. the provider of advanced clock implementation tools for nanometer (nm) chip design, today announced that NVIDIA Corporation, the worldwide leader in programmable graphics processor technologies, has entered into a multi-year agreement to purchase Azuroâ€™s PowerCentricâ„¢. NVIDIA selected PowerCentric after a successful evaluation that demonstrated PowerCentricâ€™s ability to reduce power and also meet complex variability-driven clock tree implementation requirements.

Anybody care to give a little more english-like color to what this is all about?

trinibwoy · Jun 1, 2007

Nvidia $$$ --->> Azuro

Azuro (power saving thingy) --->> Nvidia

Geo · Jun 1, 2007

Well, I'm sure power saving is part of it. And any clock thing can be expressed as being about power saving like any death could be expressed as, in the end, "heart failure". It was all the "variability" and "tree" parts that I was more interested in, and how broadly they might be looking to apply that. Could be just handheld and mobile. But then again maybe not.

digitalwanderer · Jun 1, 2007

trinibwoy said:
Nvidia $$$ --->> Azuro

Azuro (power saving thingy) --->> Nvidia

Hey, that's even easy enough for me to understand! :yep2:

Thanks.

Rufus · Jun 1, 2007

Clock Tree 101

One of my projects this semester was hacking at clock trees, so hopefully I can explain this out.

When a designer is working on a chip, or a part of a chip, one of the primary concerns (other than basic functionality) is that the amount of logic you have within every pipeline stage fits within the clock period that has been decided on. So to start off someone decides "this chip will run around 700mhz" and then the designers go off an make an ALU that works at 700mhz (adding pipeline stages and such). From a designer's perspective there's just this magical "700mhz clock" that clocks everything they touch all at once.

Now the designer hands it off to someone to do the layout. You generally have 1 clock source. You have some 100million transistors that all need to be switched by this 1 clock (well, more like some 100,000 registers, but still a lot). The major problem is how do you get this 1 clock source to all 100,000 registers at the same time? Obviously bad things happen if the ALU is designed for a 700mhz clock, but the source register of an ALU switches half a clock later than the destination register switches (clock skew).

The "standard" method of doing this is a balanced H- or X- tree. Here is an image showing an H tree. The basic idea is you take the source, split it 4 ways, and then split each of those 4 ways. If this is done properly then every path from the source to a register is the same length and has the same number of splits, so the clock will arrive at every register at the same time.

This method works well, but its major flaw is power and size. Every branch requires a buffer (area) and every buffer is switched every clock cycle (power). It's amazing but in some large chips as much as 50% of the power can be wasted on nothing more than clock distribution (no computation or anything). The other problem is dynamic power adjustments / clock gating. We all know turning off unused portions of the chip is a good thing, but turn off half of one leg and you no longer have a balanced tree. Saving power shouldn't cause the rest of the chip to malfunction. Also figuring out what to turn off an when is non-trivial. Azuro apparently can automatically gate certain areas above what a designer has manually gated.

I haven't looked much into what Azuro's product does, but this is the problem space. I would not be surprised if the clock tree took up on the order of 1/3 - 1/2 of a large GPUs power, so 15% overall power reduction from standard solutions with a good clock tree is certainly possible.

DISCLAIMER: do not expect the next Nvidia GPU using this technology to magically have drastically lower power consumption. Sure Azuro must have something good for Nvidia to have picked it up, but I doubt Nvidia has been using just the normal baseline ungated clock tree that Azuro compares against. This is simply one tool in the overall power management strategy.

Davros · Jun 1, 2007

thats a great explaination rufus..
so the signal from the has to reach all the components + they have to be able to finish theeir instructions before the next clock tick + signal arrive ?

Rufus · Jun 1, 2007

Here's a real-world example that hopefully makes this more concrete (anyone who's taken any Comp Arch should recognize this as your classic 5-stage MIPS pipeline)

The pink vertical bars are your pipeline registers that split the computation across multiple clock cycles. On a rising clock edge every value going in on the left of a register is captured and provided to the right for the next cycle. So on one rising clock edge the values from the registers are captured by the ID/EX register and provided to the ALU. On the next clock edge the value from the ALU is captured by the EX/MEM register.

Lets say for simplicity that this CPU is designed to run at 1ghz. That means whoever designs the ALU has ideally 1ns of computation time for the ALU, and they design an ALU that works in 1ns. In the designer's ideal world the ID/EX register is clocked at 0ns, it takes 1ns to compute through is ALU providing its result at 0+1=1ns, and the EX/MEM register is also clocked at 1ns.

Now lets say whoever did the layout does a bad job and the ID/EX register is always clocked 0.2ns late. This means that at 0.2ns the ALU gets its data and it takes 1ns to calculate so the result is ready at 0.2 + 1 = 1.2ns. But the EX/MEM register already captured whatever the halfway done result of the ALU was at 1ns. Oops. This is the basic functional requirement for clocks.

Optimizing the clock tree to minimize power usage while still providing correct results is just one part of what Azuro does. This picture shows another example of how they can save power. The adder top adder in the EX stage (takes "shift left 2" as an input) is only used on branch instructions. Usually the PC+4 adder on the left is used when you're just executing straight-line code. Normal implementations will just calculate this second addition and then throw away the result, wasting power. A designer could manually disable the adder, but it becomes a pain to find every possible place you could do this. Azuro claims they can automatically locate logic like this and turn it off when not in use, while obviously still leaving the clock tree balanced and functionally correct.

silent_guy · Jun 1, 2007

Geo said:
Anybody care to give a little more english-like color to what this is all about?

Cherry picking from their website:
- Variability-aware: They're take all process corners into account instead of just 1. As a result, your tree will be better balanced over large temperature/process ranges. I don't think others have this. Don't know how important this is, but I assume it will reduce the effort of backend people in getting all timing corners correct. (Less tuning iterations.)

- Physically-aware: In a traditional flow, clock gating is done early in the netlist creating process, before initial placement. It looks like they are doing gating based on already placed cells. This has the advantage that they already know how much they can potentially save by clock gating... or lose, in which case they won't do anything. Very smart to do this at the end instead of in the beginning.

- Vector-less power analysis: in a traditional flow, the register of an incrementer will either be fully gated or fully active. But some bits (LSB's) will toggle all the time, while others will not. So when it increments, all FF's will consume power instead of just a few. With statistical analysis, you can do a much better job, and clock gate the upper bits differently than the lower bits. Don't know how well it works in practise, but it's very interesting. For even better results, you can use vector-driven analysis, though I believe others also do this (but it's always a pain to get meaningful vectors, as is the case for all power estimation tools.)

- Useful skew: this means that you will deliberately introduce skew, when 1 pipeline stage is not critical, while the next one is. By unbalancing the FF in the middle towards the 'left', the next stage gets more breathing room and overall circuit can run faster. This is not really new though.

I'm sure it must be nice design win for Azuro and they're definitely on to something. I won't be surprised at all if they end up being acquired by either Synopsys or Cadence.

Not so sure if it's that important in the grand scheme of things, though.

Pressure · Jun 1, 2007

Geo said:
Well, I'm sure power saving is part of it. And any clock thing can be expressed as being about power saving like any death could be expressed as, in the end, "heart failure". It was all the "variability" and "tree" parts that I was more interested in, and how broadly they might be looking to apply that. Could be just handheld and mobile. But then again maybe not.

Different clock domains as used in the G80 for example.

INKster · Jun 2, 2007

Pressure said:
Different clock domains as used in the G80 for example.

And in the Geforce 7 too.

silent_guy · Jun 2, 2007

Pressure said:
Different clock domains as used in the G80 for example.

Yes, but in the context of this tool, that's really irrelevant. The amount of clocks used in GPU's is trivial compared to some telecom chips, where it's normal to have dozens of clocks.

Farhan · Jun 2, 2007

Rufus said:
It's amazing but in some large chips as much as 50% of the power can be wasted on nothing more than clock distribution (no computation or anything).

Just an extreme example, in the power4 clock distribution and latches consume about 70% of total power. 80 out of 115W. zomg

Azuro PowerCentric

Geo

Mostly Harmless

trinibwoy

Meh

Geo

Mostly Harmless

digitalwanderer

wandering

Rufus

Davros

Rufus

silent_guy

Pressure

INKster

silent_guy

Farhan

Similar threads