Why not just go with the waternoose agian. Its what 170m tranistors ? Thats at 90nm. I'm pretty sure we will be looking at 32nm when these things launch. So why not the 12 core monster they envisioned and add back in OOE support, beef up the cache and each of the cores. They already have tools based around the design.
I think that if a company is to go with a "larrabee wanabee" it will likely be used as both CPU and GPU, Implement OoO in a many cores design may land to a too power hungry chip.
I thank about using a modified waternoose as a building mainly because of cost(R&D cost) but there are others reasons. Also I don't state it has to be that way, it was just an hypothesis that allow me to estimate grossly the kind of "computational density Ms could reachs.
By using the xenon as a building block, Ms would not have to start for ground.
The point is that for compatiility sake Ms would have to reach @3,2GHz this way retrocompatibilty would not be an issue. As Ms is doing way better, the 360 is likely to coexist some time with its successor, pass on BC would not be a good choice.
Ms could go the same way as Intel and base it's design on something old like a pc603/4.
They share similarity with the pentium=> simple, ~3 millions transistors, short pipeline.
But it would prove difficult to reach 3.2GHz with such a short pipeline, for reference the pc604 (which is OoO, 4 issues, and~3.6 millons transistors) is a six stages design.
Ms could split each stage in two, ends with a 12 stages pipeline more likely to reach high frequency. It's possible might yeld better result but resolve in more work and Ms has already quiet some work to do. That's why for the sake of my hypothesis I used the waternoose.
The waternoose is a long pipeline design that is not that power hungry (more than the cell but nothing bothering) running modified xenons @3.2GHz should be less of a challenge.
As I state Ms would have quiet some work on the table.
They would need a new instruction set for the VPU
They would have to design a real good VPU, to give the design a chance they aim at the kind of density STI reached with the SPU (1.4x the number of transistors per mm² compared to ppu/px). Thus it should be a huge ingeneering effort.
Then come the caches, once again huge effort here:
better latencies, good prefetching capability, ldetermine in which way the L2 cache(s) is shared among the different cores, implement the same tricks as Intel (some new instruction in regard to the L2 cache, read only policy within the different "non shared" L2 (no matter how it shared or the layout). Once density would also be a concern.
Then come the fixed function units (texture sampler) and how they are shared among the different cores.
Here a gross estimation based on cell datas:
--------------SPE---------- PPE--------xenon
90 nm----14.76------- 26.86--------168
65 nm----11.08--------19.6----------122
45 nm-----6.47---------11.32---------70
then the gross estimation
@45nm
Xenon x4
12cores, L2+4Mo
~300GFlops
~280mm²
XenonII x4
~1200GFlops
320mm ---> (random increase due to bigger VPU, texture samplers, and a Lot more registers (smtx4+512bit wide vpu))
@45nm XenonII => 80mm²
As it stand Ms would not lag that much behind GPU manufacturers, say @32nm they manage to pack 60% more transistors per mm² we have this:
@32nm Xenon II would be 48mm² and worse 300GFlops.
It's likely to be less as Ms would focus on density. I feel like 40mm²(@45nm) for wider VPU+bunch of registers some fixed function units is healthy estimate if we look at the kind of density ATI achieves for example.
I would say that depending on engineering efforts and how good is the process MS could lend anywhere between 35 and 45 mm² for the reworked xenon.
I we consider 40mm²=300GFlops MS would have legs to design something worse it.
For example depending on the silicon budget and power consumption/thermal dissipation ( and obviously retail price they aim at launch) could end with these kind of designs:
One chip systems:
8 xenonII (24 cores)
L2 cahe 8MO
~320mm²
2.4TFlops @3.2GHZ
6 XenonII (18 cores)
L2 cache 6MO
~240mm²
2TFlop @3.6GHZ.
A two chips design could be like this:
2x 4XenonII (24 cores)
2x160mm²
2x1.5TFlops @4GHZ
6 XenonII (36 cores)
2x240mm²
2X1.8TFlops @3.2Ghz
These example are random to show that Ms could have ways to adapt to meet its performace goals.
My most optimistic prevision could that Ms manage to have the XenonII around 35mm² or less (huge effort on density + good process) and good TDP and I would go with a one chip:
10 xenonII (30 cores)
L2 cache 10Mo
350 mm² or slightly less
3TFlops @3.2GHZ or higher
EDIT those are random numbers for the sake of discussion.
The real question remains can MS afford the R&D behind such a project?