Ok here's my summary of what I'm thinking after reading some reviews:
The good:
- TurboCore which actually does something. Of course for perf/power this is not good but it's nice to finally see this truly working.
- lower idle power. In some reviews it made a much larger difference (almost to the point of being similar to intel idle power consumption) in some not so much but in any case looks like an improvement most likely thanks to powergating the cores.
- higher memory bandwidth than Phenom II (not quite as good as SNB but definitely an improvement).
- AES instructions for catching up with SNB, AVX, FMA4 etc.
- shared FPU actually looks ok. Only some rare synthetics seem to show this to cause a performance hit, otherwise scaling to multiple threads seems largely independent if it's integer or float workload (of course this most likely is a result of the beefed up FPU too, it probably will show bad scaling with code using AVX-256 where the FPU suddenly doesn't look all that beefy anymore).
- CMT is a neat idea and scaling isn't too bad (hardware.fr has some numbers, 4->8 threads scaling is better than going from 4 to 6 Phenom II cores and of course better than HT), so might be a win on a perf/area scale for multithreaded workloads, if just the singlethreaded baseline would be a bit higher...
The bad:
- nowhere near the once promised "30% higher" clockspeed. If that's due to cpu design or manufacturing trouble I don't know.
- high load power consumption. Efficiency actually didn't increase compared to a X6 1100T which is worrysome.
- low single-thread performance even with some clock increase just barely at Phenom II levels, and not in the same ballpark as SNB cpus at all (of course it was expected but the difference is bigger than it should be).
The ugly:
- AMD promised roughly same IPC for single threaded performance and they clearly missed it for typical workloads by about 10-15% or so. I think that's the biggest problem actually. Why did they miss it? Is it just the removal of the 3rd ALU in an integer core? It looked like it could have been possible to "compensate" for that with other improvements (like memory disambiguation, better branch prediction, larger scheduling queues) given that it's hard to schedule 3 alu ops in the first place but it didn't happen.
- L1I thrashing issues with only iffy OS bandaids for a problem which should have been avoided by better cache design (higher associativity).
- not convinced of the whole cache design. Requires large area and the latency is just bad both compared to Phenom II and more so the competition, so the large size might not be worth it. Moreover, I wonder if the very low L1D write bandwidth (not even 1/5 of read bandwidth while traditionally it "should" be roughly half of the read bandwidth, it is a result of the L1D write through design together with the low L2 write bandwidth) isn't a real problem reducing throughput quite heavily in some cases (there is a "Write Coalescing Cache" to help with that, but at least in hardware.fr numbers it didn't turn up).