I think what Carsten is getting at, why can a 32nm 315mm² Bulldozer be mass-produced at 3.6 GHz while a 28nm 365mm² Radeon 7970 only clocks at 925 MHz and yet consumes more power.
My guess would be it's mainly transistor density. Tahiti has 4.3 billion transistors, BD only has 1.2 (officially, at least). BD's clockspeed is nearly 4 times as high, while its transistor density is roughly 3.5 times lower.
mostly wrong.
It's all about pipeline length.
Bulldozer has such pipeline length that there are much less transistors (and much less wire length) serially in one pipeline stage.
The following is somewhat oversimplified, but explains the principles:
ie. the transistors are capable of switching state in about 10 picoseconds ( 100 GHz) but there are maybe 25 of those transistors serially on each pipeline stage on bulldozer, meaning every pipeline stage takes at least 250 picoseconds, putting the clock speed to about 4 GHz.
In AMD GPUs, if the transistors are equally fast, but there are 100 transistors serially on each pipeline stage, then it means each pipeline stage takes at least 1 nanosecond time, putting the clock speed to about 1 GHz.
In Nvidia GPU, if transistors are equally fast, but there are 65 transistors serially on each pipeline stage(on the shaders/hot clock domain), then each pipeline stage takes at least 650 picoseconds, putting the clock speed to around 1540 MHz.
In reality wire lengths and delays caused by those might have more effect than the transistor delays, but the principles still are the same. And the GPU might be manufactured with a bit slower manufacturing process, it might mean that the GPU transistors take 12.5 picoseconds to change state and there are only like 80 of transistors them serially on ATI, 52 on nvidia.
Btw. your transistor count for bulldoze is way off. 1.2G is impossible number, correct is about 1.5G.
The reason for the transistor densities are that different transistors in different structures consume different amount of space.
In CPUs most space is consumed by "dedicated logic transistors" doing something complex, each transistor has to be positioned "for it's job".
Only the register files(very small part of chips) and caches in CPU chips are very tightly packet, and >80% of the transistor count comes from the caches, even though only about half of the die area comes from the caches
In GPU's most space is consumed by register files which are very regular structures and can be packed very tightly. Also the logic can be packed more tightly in GPU's because most of it is highly symmetric vector units.
But of course 28nm allows packing more transistors to same space than 32nm.