Why you blame the failure of this CMT implementation on CMT itself?
Shared units can be done for hardware the design barely cares about.
When dealing with something where performance matters, the architect who designed the concept wanted it for certain specific things:
1) Enabling a tighter critical loop for a fireball-type architecture -- this has completely lost with the physical realities of today, although this was clear much earlier than BD
2) Enabling newer, interesting modes of execution such as speculative threading and other memory tricks. He flat out said CMT didn't make sense in the absence of doing things in a more interesting manner, and Bulldozer does nothing interesting.
Problem is, nobody is doing any of these interesting things on an ongoing basis, or if they are they aren't doing CMT.
If the interesting things that the designer of CMT said were needed to justify it are either impractical or don't need CMT at all, then CMT is pointless.
Something I simply don't understand is why AMD went for such a big chip, they made it tough for themselves to fight on price, extra transistors don't pay for them selves through an increase of performances.
The transistors themselves don't add cost. Kaveri isn't bigger than Trinity, and it's on a bulk process.
It likely trends towards being mildly cheaper to make.
AMD should simply rework the Jaguar architecture for more scalability upwards (laptop, desktop, etc.) with relaxed pipeline to allow for higher clockrates and full-speed L2. I would rather see eight pumped Jaguar cores in Kaveri than yet another botched attempt to iterate the poor Bulldozer.
Hmm, an amped high-clocking Jaguar.
It probably needs some extra pipe stages. With the current 15-cycle mispredict latency, we could maybe squeeze in a few more stages, sure there's a performance hit, but maybe boosting the length to something like 18-19 cycles wouldn't be too bad.
So its integer resources would be about 2 wide, with about two instructions per core per cycle and about 32KB of L1 Icache per core.
Shared, long latency L2.
It might be hard getting the L1 to clock as high, especially if we want to avoid losing performance with its very low associativity. Maybe a 4-way 16 KB L1 data cache.
The higher end workloads might enjoy having a more flexible load/store situation than one dedicated store and one load pipe.
Roughly 8 FLOPs per cycle per core.
Does that sound about right?
AAnyway the whole point is that Kaveri is as we know it and that is a 240mm^2 chip, two third of th7e PS4 chip, that performs poorly in comparison.
Orbis would not be a good desktop chip. Kaveri might not be that compelling, but it doesn't need a competely custom platform and custom code that doesn't do a fraction of the things of a desktop to not fall on its face.
Back to Entity279 point, another thing with CMT is that it scales poorly from a production pov, you move by increment of 2 cores /1 module with the matching amount of L2.
There is no rule stating that the number of cores has to be an even number, those Phenom X3 were nice.
Those still had four cores.