Look at the EE: two Vector Units (each with 32x128 bits registers... a whopping 40 KB of Local Storage for instruction and data if you add both VU's ) and a SIMD enhanced, single threaded and with 16 KB of Instruction Cache, 8 KB of Data Cache and 16 KB of SPRAM.
Look at the CELL chip presente at ISSCC: 8 independent Vector Processors (each with a TLB enabled DMA engine, 128x128 bits registers and 256 KB of Local Storage for instruction and data) and a Multi-Threaded (2-way SMT) core with dedicated Vector Processing extensions (VMX for Integer and Floating-Point processing) and with 32 KB of L1 Instruction Cache, 32 KB of L1 Data Cache and 512 KB of L2 cache. SPE's have some access to the PPE's L2 cache quite likely.
Also, XDR from what we have seen should have lower latency compared to Direct RDRAM (in which data, addresses and control were all multiplexed on the same shared bus): XDR was not just choosen for the higher bandwidth it provides, but also for the lower latency compared to its predecessor (an improovement on two fronts).
I think they are coming towards developers in lots of ways: they saw one of the biggest shortcomings of the EE (its RISC core) and observed how it pulled the system down. They made the Vector Processor self-feeding and gave them a MUCH larger Local Storage (we do not have situations like with the VU0 which offered only 4 KB of Instruction Memory and 4 KB of Data Memory and the VIF0 did not even support double buffering like VIF1) this way they would not need to wait for the central "managing" processor as much and the central processor would not have to waste tons of cycles feeding each Vector Processor all the times it needed data or when they are sharing data with each other.
When they looked at the Vector Processors and the idea for their role in CELL, they realized that programmers would be helped by a good compiler that would do loop-unrolling for them and do it well: VCL taught some lessons there... they saw that the register file was way too tight that often VCL had to increase the length of the loops and would not be able to take away all the stalls because there were not enough registers to take the existing VU code and unroll all loops efficiently so programmaers had to do that by hand too. Thus the register file was increased in size by a factor or 4x.
What I am hearing developer say is that they see the PPE as one of CELL's saving graces while the RISC core in the EE was seen as one of the worst part of the architecture, the one with basically few or no redeeming qualities. Getting a good compiler to do a decent job with the resources the R5900i is not easy... getting GCC 2.95 to optimize things perfectly there... well... the rsult is that someone has to pick up tons of C/C++ code and manually convert it to optimized ASM if you want the EE to go decently fast and not pull everything down.
Data Cache misses-wise and latency for random memory accesses-wise the R5900i is clearly on another planet: 8 KB of L1 D-Cache vs 32 KB of L1 D-Cache + 512 KB of L2 Cache... the here winner is quite clear IMHO.
What cna this allow ? This allow to optimize a C/C++ compiler better for the PPE and allow this compiler to do a good job (it has access to enough CPU resources to do so: a compiler and a CPU core should not be developed separetely, but together, they should help complement each other) which in turns allows the programmer to spend less time re-writing code in PPE optimized ASM. How many developers pursue the same strategy as PlayStation developers on Xbox ? How many trust ICC 8.x for the XCPU only just as much as GCC is trusted for the RISC core in the EE ?
DMA engines on the CELL processor now understand Virtual addresses which is another help they have given to developers when dealing with more complex OS's that have Virtual Memory support (Inane Dork, get a Linux kit, install SPS2 and have fun managing DMA transfers... oh yes, you can do it no doubt, it is not an impossible task woth the courage of a heroic genius to solve it... is it effortless and painless ? I do not think so
)
After the POWER4+ to POWER5 transition (everyone expected minor changes... yeah... add SMT, tweak things here and there, etc..., but they got an incredible jump forward that exceeded people's expectation because IBM was able to spot the short-comings of the POWER4+ core and work around them, doing somethings minor changes, but in all the right spots) I'd have more faith in IBM's R&D labs to be able to assist SCE and Toshiba in developing the PlayStation 3 SDK and the related tool-set.