In the case of the 2 Cell blade versus the 2 Woodcrests, it is a matter of 2 coherent caches for Cell for the usually non-critical PPEs, versus a more complex set of caches that on average looks like something between 8 and 4 caches for the performance-critical cores on the x86 platform.
Ah ok in this example - 2Cell blade - there's still going to be off chip communication going on;
so it's not necasserily internal bandwidth. It's still having to divide up into multiple independant working sets with main memory being the final point of sharing.
on the 9 v 4 issue - whats the count of execution units (and functionality per unit..)?
"What's the difference from a software point of view between a peer core and a synergistic core?" .
Well the biggest difference from a software point of view is the manual async cache management/"distributed memory" ???
In porting regular code to DMA you're having to encode not just the locality, but also assumptions about object ownership per thread.
It's almost like the concept of 'syntactic Salt' at the program design level- a hoop you have to jump through to prove you know whats going on.
You've done this work at design/compile time, so the processor doesn't need to waste silicon/watts on guessing it at run time.
Isn't it <20m transistors for an SPE, vs approx 50m for a Xenon core or PPE. (and god knows how many for a Core2.. but i suppose you have to count execution units x clock etc to make a comparison) So thats where the benefit of the approach shows up...
I know the cell also gets a benefit over other processors for it's clean SIMD based instruction set.. I've no idea how much that contributes here.
The question some of my colleagues have is "but how much of this can you do with decent cache control instructions". - a properly shared L2 should be able to do the job of inter-LS transfers? - Prefetch +(Hyperthreading?) could help deal with cache misses (without the expense of oooe)? - maybe increase the line size to get the effect of the larger dma transfers?
- the difference is the wasted control logic, right ??
when i say "i like the clarity" - i like the fact you appear to be able to take more implementation decisions based on reasoning (one big factor being code coherence, you actually know the size of each module earlier..) rather than measuring random cache effects

that seems to point to something being very 'right' about the cell.
Whats Larabee going to do. I had heard that they definitely won't do cell style DMA "but for high thoughput they will definitely extend the memory model , you wont be far off if you think for the cell". locked cache lines?