looking at performance, the module with 2 CMT cores seems to deliver the promised performance of around 80% of 2 independent cores I think,
what is killing "Bulldozer" is the low single thread performance (and poor power efficiency), but maybe CMT is a barrier for them to improve on single thread performance and power efficiency, considering they are not using CMT (and SMT) for their most power efficient CPUs, if "Jaguar" is the basis for the new architecture, it would be natural to drop CMT!?
The catch to the promise about 80% of 2 independent cores is an implicit assumption that these are 2 independent cores with the same resources outside of the shared ones that are now split off.
The reality is that an independent core doesn't necessarily need to be bound to the constraint that it be designed so that it can serve as a counterexample for a CMT design.
There's a rough generalization that there is a set or footprint in terms of active transistors that can be utilized by a context. CMT makes sure a significant fraction of that footprint cannot readily go where it may be needed at any moment due to the static split, and CMT's forcing the other half to be active when it is not needed and other costs actually shrink the per-context activity budget further. This is on top of the other physical and engineering constraints that likely cut its budget further.
As far as using Jaguar as an example, I have doubts. Jaguar's other design parameters besides (somewhat) low power was that it be cheap and dense. AMD's sights need to be a bit higher for their next design.
Jaguar's line doesn't seem capable of hitting the real low power range covered by Intel and ARM, and it's getting squeezed by the lower range of the higher-performance lines.
In terms of cheap and low power, it's hard to beat an ARM core AMD doesn't have to design, but that's not good enough to stand against the designs that are compressing Jaguar from above.
In terms of engineering, there's probably a footprint of R&D that AMD needs to be able to focus by not devoting even a Bobcat or Jaguar team away from the band its next in-house cores target.
In a SMT processor, many resources (ROB, store buffers etc) are split between contexts;
True, but generally a significantly reduced set of them are split statically.
Some resources, like the data caches and load buffers that Bulldozer wound up duplicating, actually raise the cost of the design over a unified one, because there are obligations that coherent cores have to each other and the memory system as a whole. The costs that can come from burdening inter-core communication get missed in the CMT narrative. The overly complicated and frequently underperforming even on its own merits memory subsystem indicates this cost was higher than they could handle.
AMDs rationale for CMT was duplicating integer execution units for each context was a small incremental cost. The premise for this rationale was based on the K7/8 microarchitectures where the integer units were a tiny fraction of a core.
I'm not sure if AMD focused much on the integer units as part of their rationale. At least publically, they all but lied about the nature of the four integer pipelines in the core until they finally divulged some architectural details.
If they focused on it afterwards, although I'm not sure they did above other things like FPU and front-end savings, it might be putting the cart before the horse.
If integer execution resources were that low cost, why did they cut Bulldozer's per-core integer unit count by 33% from its predecessor, a chip that had no problem hosting that many units?
Unless other architectural decisions, like hardware savings and a measurable decrease in per-stage complexity due to higher clock targets, made that incremental cost unpalatable.
It wasn't only the INT logic being duplicated, but also the L/S pipelines and data caches, so it wasn't that small, though. AMD banked on the future of IGP, as an integral part of a common architecture, where both intensive and more casual FP code would be "naturally" offloaded. The FPU was left shared, inefficient and mostly underpowered, as a consequence.
I personally have trouble accepting this premise. The architectural differences and workloads for the IGP and a core like Bulldozer are so vast that I don't see how their engineers could have thought this could have been made to work, not in the time frame of Bulldozer or any CPU in its line prior to replacement. The granularity, latency, features, infrastructure, implementation needs, and software ecosystem differences are so vast that they had to have known it would take years--possibly in the line that replaced Bulldozer or beyond--to even get things past the point of "this solution is incredibly unacceptable", as they haven't really gotten beyond it now.
Why make the CPU of that day match a hypothetical design reality it and its descendants could not live to see?