Going by the diagram, with seperate AGUs and ALUs, I think they intend to crack instructions with memory oprerands, similar to what Intel does and unlike K7/8/x.
Separate AGUs are nothing new for AMD chips.
I'd view the ALU/AGU pairs as evidence that this hasn't changed.
Micro-op fusion for Pentium M and onwards indicates Intel has moved towards not cracking ALU/MEM ops as well (or cracking them and then immediately fusing them back together or however else Intel's design does this internally).
I wondered a little bit about the Bulldozer CMT feature after reading / glazing over the PDFs mentioned in the lastest information from Dresdenboy:
http://citavia.blog.de/2009/09/06/architecture-cpu-cluster-microarchitecture-multithreading-6910375/
It seems that CMT hurts the proMHz performance of normal single thread workloads between 5% and 40%. So does it really make sense to use a clustered integer pipeline?
Clustering cuts down on the complexity of structures like the register file, reorder buffer, and forwarding network versus a structure where the same number of execution resources are linked together completely.
If all else were equal, clustering could potentially be a loss.
However, as the diagram suggests, integer resources increase by a third.
Each local register file can be physically smaller for a given amount of capacity, as it doesn't need to have as many register read/write ports.
This can either allow a larger register file per cluster, or a faster register file.
A larger file can mean more instructions in flight, which means more opportunities for OOE.
A faster register file might mean higher clock speeds, which means clocks can be higher.
These could reduce the penalties in clustering.
Either way, it doesn't look like it's an equal case to a larger unclustered core that AMD apparently has decided would not be worth implementing.
Given that there is a resource monitor block in a lot of places, I'm curious if there is another opportunity for power savings and higher clocks.
In a single-threaded case where the resource monitor detects that the core cannot achieve full issue width (some kind of narrow pointer-chasing code), it would be potentially easier to hit a higher turbo mode if a whole cluster is deactivated.
That means half of the issue logic is shut off, and half of the resources that would have been wasted anyway are shut off.
The rest of the core can ramp up frequency higher with a given power envelope than a big core that must ramp up a full complement of units.
The front end and dispatch parts are more complex, however.
This might not impact clock speeds directly, as this can be distributed over multiple stages, but this would hurt branch mispredict penalties and power as well.
There seem to be additional levels of logic dedicated to speeding up mispredict handling.