First Cell Benchmarks

Neb

Iron "BEAST" Man
Legend
The fastest processors Intel makes - i.e. high end versions of the Core 2 chips.

Not the fastest Intel CPU since they say "A single Cell SPE outperforms a 3.0 GHz single-core Xeon processor". Xeon based on P4 tech or Core 2 tech?
 

ADEX

Newcomer
maybe it means a single core from a core 2?

IBM just published a paper about a financial app on Cell, their end result was a Dual Cell blade beating a dual Xeon (Woodcrest, quad core) by 11X on single precision, 3X on double precision.

That's 8 high end Core2 cores V's 16 SPEs - but the margin is way bigger than the difference in number of cores.
 

pjbliverpool

B3D Scallywag
Legend
IBM just published a paper about a financial app on Cell, their end result was a Dual Cell blade beating a dual Xeon (Woodcrest, quad core) by 11X on single precision, 3X on double precision.

That's 8 high end Core2 cores V's 16 SPEs - but the margin is way bigger than the difference in number of cores.

One quad core at 3Ghz theoretically achieves more than double the DP FLOPS of Cell so we are talking a >6x difference between theoretical and "reality" there.

Somethings sounds up to me.

What its essentially saying is that a single SPU is 50% faster than a 3Ghz Core2 at DP FLOPS despite the fact that the Core 2 has about 4-5x higher theoretical peak and a similar transistor count advantage. Sounds a bit ludicrous to me.
 
Last edited by a moderator:

Npl

Veteran
One quad core at 3Ghz theoretically achieves more than double the DP FLOPS of Cell so we are talking a >6x difference between theoretical and "reality" there.

Somethings sounds up to me.

What its essentially saying is that a single SPU is 50% faster than a 3Ghz Core2 at DP FLOPS despite the fact that the Core 2 has about 4-5x higher theoretical peak and a similar transistor count advantage. Sounds a bit ludicrous to me.
Flops arent everything. Cell also has huge internal bandwidth, dunno how many times higher than a core duo.
 

pjbliverpool

B3D Scallywag
Legend
Flops arent everything. Cell also has huge internal bandwidth, dunno how many times higher than a core duo.

Internal bandwidth to what though? Between cores? Between Cores and LS? How does that compare with similar on a Core 2? Im not doubting the results and thus some architectural trait of Cell was obviously explouted to attain them. I am doubting the fairness of such a comparison given that it was compiled by IBM though. I.e. either the Core 2 was not fully taken advantage of or the results were not exactly the same on both architectures. Or perhaps this is a very special case were a slighly altered input factor would completely change the results.
 

Npl

Veteran
Internal bandwidth to what though? Between cores? Between Cores and LS? How does that compare with similar on a Core 2? Im not doubting the results and thus some architectural trait of Cell was obviously explouted to attain them. I am doubting the fairness of such a comparison given that it was compiled by IBM though. I.e. either the Core 2 was not fully taken advantage of or the results were not exactly the same on both architectures. Or perhaps this is a very special case were a slighly altered input factor would completely change the results.
I cant speak for that specific test, but Cells architecture allows it to get closer to its theoretical peak than other processors. Thats considering you can fit the code to suit it.
You have a LS thats highly predictable, you (or the compiler) know how long reads/writes will take and - in the optimal case - can effectively remove all stalls caused by memory latency. Further, having seperate LS allows SPUs to operate independendly of each other - no shared cache that hinders scaling of multiple cores.
 

3dilettante

Legend
Alpha
Woodcrest has an FSB that provides ~10 GB/sec.
Cell has an on board memory controller that provides ~25 GB/sec.

Woodcrest is already considered FSB-limited in many workloads.

It's not enough to explain the great disparity between Cell and Woodcrest, but it likely explains part of it.

Cell is designed to work with a memory subsystem that isn't as limiting on scalability as it is for Intel.
Coherency may also be an issue, but there are system and program details that can make it worse or make it less of an issue.
 

Jesus2006

Regular
Woodcrest has an FSB that provides ~10 GB/sec.
Cell has an on board memory controller that provides ~25 GB/sec.

Woodcrest is already considered FSB-limited in many workloads.

It's not enough to explain the great disparity between Cell and Woodcrest, but it likely explains part of it.

Cell is designed to work with a memory subsystem that isn't as limiting on scalability as it is for Intel.
Coherency may also be an issue, but there are system and program details that can make it worse or make it less of an issue.

Perhaps it's the interoperability between cores? SPUs can communicate with each other directly over the internal (extrem high bandwidht) EIB, something that "normal" SMP/Cache architectures like woodcrest just don't have.
 

Kryton

Regular
Perhaps it's the interoperability between cores? SPUs can communicate with each other directly over the internal (extrem high bandwidht) EIB, something that "normal" SMP/Cache architectures like woodcrest just don't have.

What about comparisons with HyperTransport based systems? A MOESI cache allows this if you properly manage the line ownership sensibly (i.e. avoid many shared/owned lines).
 

ebola

Newcomer
It could simply be the act of porting to the cell architecture forced improvements in the algorithm.
..Sorting traversals for cache coherency, ..and keeping each core's working set seperate; ..tweaking such that the working set really does fit in LS/L1 given that you programatically control it.
I really like the clarity the cell gives you with these issues.
 

3dilettante

Legend
Alpha
Any algorithm running on x86 would try to optimize on-chip memory usage.

The downfall for broadcast coherency is that even when such usage is optimal, coherency still generates traffic and adds latency.

In the case of the 2 Cell blade versus the 2 Woodcrests, it is a matter of 2 coherent caches for Cell for the usually non-critical PPEs, versus a more complex set of caches that on average looks like something between 8 and 4 caches for the performance-critical cores on the x86 platform.
 

ebola

Newcomer
point taken;

Where my comment is comming from: most programmers I know won't find as much coherence as is really possible when they've got an L2 to lean on, they (unfairly) cite the lack of L2 as a serious drawback to the SPU's potential.
(ok so i'm getting bitter as I'm sat here trawling through someone elses code trying to trace all the damn pointers...)

Agreed the intra-local store transfers are the biggest potential physical difference that IBM would be likely to want to exploit in a benchmark designed to show the cell's superiority.
That still involves some latency though doesn't it (and manual management..)

Woodcrest... that does at least share L2 between it's 2 cores doesn't it, so at least it's not totally stuffed sharing results between cores on the same die?
I suppose there's oooe which can hide some latency too.
 

ADEX

Newcomer
It could simply be the act of porting to the cell architecture forced improvements in the algorithm.
..Sorting traversals for cache coherency, ..and keeping each core's working set seperate; ..tweaking such that the working set really does fit in LS/L1 given that you programatically control it.
I really like the clarity the cell gives you with these issues.

I think this is one of, if not *the* major reason Cell can perform so well.
When you are programming a conventional the hardware issues are all hidden away from you. Processors don't like reading small chunks of memory randomly - they're very slow at it. OOO and large caches do a remarkably good job of hiding this from the programmer but can't hide it from the CPU itself which ends up waiting on memory much of the time.

On the SPEs nothing is hidden, SPE development forces you to think about these issues and you'll soon find out if you're reading from memory isn't efficient. The result is new code which runs a great deal faster than before.
 

MarketingGuy

Newcomer
I think a lot of folks may be missing the obvious. Ask yourself, "What's the difference from a software point of view between a peer core and a synergistic core?" :idea: The right answer will tell you not only why the peer cores have a harder time achieving peak performance but also why the completion time of running on peer cores has so much more variability. I'm just MarketingGuy -- I leave it to you smart technical people to figure this out. ;)
 

3dilettante

Legend
Alpha
In the duel of marketing buzzwords, synergistic has more letters than peer, and more letters=better performance. ;)

There are a lot of reasons, spanning from different design constraints, a more specialized platform, greater explicit control of resources, restricted coherence, restricted application space, and other variables as to why the given examples perform so well on Cell.

Other designs decades ago did the same thing, long before someone put the word synergistic to Cell's cores.
 
Top