Clock Tree 101
One of my projects this semester was hacking at clock trees, so hopefully I can explain this out.
When a designer is working on a chip, or a part of a chip, one of the primary concerns (other than basic functionality) is that the amount of logic you have within every pipeline stage fits within the clock period that has been decided on. So to start off someone decides "this chip will run around 700mhz" and then the designers go off an make an ALU that works at 700mhz (adding pipeline stages and such). From a designer's perspective there's just this magical "700mhz clock" that clocks everything they touch all at once.
Now the designer hands it off to someone to do the layout. You generally have 1 clock source. You have some 100million transistors that all need to be switched by this 1 clock (well, more like some 100,000 registers, but still a lot). The major problem is how do you get this 1 clock source to all 100,000 registers at the same time? Obviously bad things happen if the ALU is designed for a 700mhz clock, but the source register of an ALU switches half a clock later than the destination register switches (clock skew).
The "standard" method of doing this is a balanced H- or X- tree.
Here is an image showing an H tree. The basic idea is you take the source, split it 4 ways, and then split each of those 4 ways. If this is done properly then every path from the source to a register is the same length and has the same number of splits, so the clock will arrive at every register at the same time.
This method works well, but its major flaw is power and size. Every branch requires a buffer (area) and every buffer is switched every clock cycle (power). It's amazing but in some large chips as much as 50% of the power can be wasted on nothing more than clock distribution (no computation or anything). The other problem is dynamic power adjustments / clock gating. We all know turning off unused portions of the chip is a good thing, but turn off half of one leg and you no longer have a balanced tree. Saving power shouldn't cause the rest of the chip to malfunction. Also figuring out what to turn off an when is non-trivial. Azuro apparently can automatically gate certain areas above what a designer has manually gated.
I haven't looked much into what Azuro's product does, but this is the problem space. I would not be surprised if the clock tree took up on the order of 1/3 - 1/2 of a large GPUs power, so 15% overall power reduction from standard solutions with a good clock tree is certainly possible.
DISCLAIMER: do not expect the next Nvidia GPU using this technology to magically have drastically lower power consumption. Sure Azuro must have something good for Nvidia to have picked it up, but I doubt Nvidia has been using just the normal baseline ungated clock tree that Azuro compares against. This is simply one tool in the overall power management strategy.