maybe it means a single core from a core 2?
IBM just published a paper about a financial app on Cell, their end result was a Dual Cell blade beating a dual Xeon (Woodcrest, quad core) by 11X on single precision, 3X on double precision.
That's 8 high end Core2 cores V's 16 SPEs - but the margin is way bigger than the difference in number of cores.
Flops arent everything. Cell also has huge internal bandwidth, dunno how many times higher than a core duo.One quad core at 3Ghz theoretically achieves more than double the DP FLOPS of Cell so we are talking a >6x difference between theoretical and "reality" there.
Somethings sounds up to me.
What its essentially saying is that a single SPU is 50% faster than a 3Ghz Core2 at DP FLOPS despite the fact that the Core 2 has about 4-5x higher theoretical peak and a similar transistor count advantage. Sounds a bit ludicrous to me.
Flops arent everything. Cell also has huge internal bandwidth, dunno how many times higher than a core duo.
I cant speak for that specific test, but Cells architecture allows it to get closer to its theoretical peak than other processors. Thats considering you can fit the code to suit it.Internal bandwidth to what though? Between cores? Between Cores and LS? How does that compare with similar on a Core 2? Im not doubting the results and thus some architectural trait of Cell was obviously explouted to attain them. I am doubting the fairness of such a comparison given that it was compiled by IBM though. I.e. either the Core 2 was not fully taken advantage of or the results were not exactly the same on both architectures. Or perhaps this is a very special case were a slighly altered input factor would completely change the results.
Woodcrest has an FSB that provides ~10 GB/sec.
Cell has an on board memory controller that provides ~25 GB/sec.
Woodcrest is already considered FSB-limited in many workloads.
It's not enough to explain the great disparity between Cell and Woodcrest, but it likely explains part of it.
Cell is designed to work with a memory subsystem that isn't as limiting on scalability as it is for Intel.
Coherency may also be an issue, but there are system and program details that can make it worse or make it less of an issue.
Perhaps it's the interoperability between cores? SPUs can communicate with each other directly over the internal (extrem high bandwidht) EIB, something that "normal" SMP/Cache architectures like woodcrest just don't have.
It could simply be the act of porting to the cell architecture forced improvements in the algorithm.
..Sorting traversals for cache coherency, ..and keeping each core's working set seperate; ..tweaking such that the working set really does fit in LS/L1 given that you programatically control it.
I really like the clarity the cell gives you with these issues.