Urian, the basis for the PPE core and thus the Xenon/Waternoose core is a project that IBM had started a lot earlier than these projects began.
MS and SCE had different objective with each processor: MS wanted more L2 cache, better VMX implementation and some custom instructions to better adapt it to their graphics API's and to collaborate with the GPU; SCE wanted to have a lower penalty SMT implementation with less of a speed hit when both threads are active and trying to work at full speed (supposedly there is much more resources duplication inside the PPE compared to how SMT i implemented in each of Xenon's cores: the fact that MS reccomends one worker thread paired with a more lightweight one [one more memory bound thread rather than compute time bound] MIGHT give some more confidence in this claim).
The DD2+ revisions of the CBE processor show a PPE that is 2x the size of the earlier DD1 revision and larger than each of the 3 cores included inside the Xenon CPU chip. Since, comparing the PPE with each of Xenon's cores, fundamentally the integer register file is the same (32x64 bits GPR's), the L2 cache of the PPE is 0.5 MB vs 1.0 MB for whole Xenon and CELL's VMX implementation has half the number of architectural registers (32x128 bits versus 128x128 bits) it makes you think about why the PPE still manages to take so much space (and I do not think it is because of bad chip design
).