Cell benchmarked

DarkRage said:
Ok, so it looks like all of you believe that this comparison is fair?. That all the compilation options and code have received the same attention?. Even when reading the documents IBM states the needed effort to extract as much performance as possible from the SPEs, but not giving any details on the implementation on the 970, for example.

Well, bogging the article down with implementation details on other processors, when the article is supposed to be about Cell, doesn't make much sense. But it is not strictly as you say - for example in the TnL example they do explicitly say that the G5 implementation was a "best effort".

DarkRage said:
For example, anyone have a valid explanation, from an architectural point of view, about why a single SPE is 4 times more efficient with Ray Tracing than a 970+Altivec?

Maybe the memory architecture? I'm not an expert on RT implementations, maybe someone else can comment more fully.

edit - beaten by Zeross.
 
Have they given out the power rating of Cell yet ?

They bragged about performance/power ratio, but with no power figure its kinda hard to know that ratio.
 
V3 said:
Have they given out the power rating of Cell yet ?

They bragged about performance/power ratio, but with no power figure its kinda hard to know that ratio.


This is all we know publicly about Cell's (the SPE's) power consumption:

cell-8.gif
 
Seeing how both the G5 and Cell are made by IBM, why would they make the test "unfair" with their own hardware? It's not like the 2 chips are competing, one is for consoles.
 
xbdestroya said:
Good read and good find.

Whatever ends up happening with Cell's push for recognition, I think it likely that it will gain traction in at least a few fields. With several to choose from in which it offers superior performance (and performance per watt I'm sure) to the traditional options, it'll likely stick in one or more places.

Cryptography is one area I hadn't really thought of before.

cryptography, protein folding, weather mapping/prediction, financial modeling...
 
That's only for SPEs, what about the rest ? the EIB, FLEXIO, Memory controller, PPE, ... And what voltage have they chosen for stable operation ?
 
DarkRage said:
For example, anyone have a valid explanation, from an architectural point of view, about why a single SPE is 4 times more efficient with Ray Tracing than a 970+Altivec?
Please, go ahead, including Shifty Geezer. Maybe I have adquired suitable education to understand you ;-) .
Easy-peasy! Raytracing's a pretty simple procedure requiring only a little code, which'll fit neatly into LS's. The scenery data was generated on the fly from 2 bitmaps, which is a regular data structure that can easily be double/triple buffered with DMA's to ensure the SPE's always have code to work on. This means all the code and data is directly available on fast LS without the latency incurred by needing to work along with L2 cache. Even though the FP of the Altivec may well be comparable with the FP unit of the SPE clock for clock, it's keeping the units fed with data that SPE's are good at when the data accesses are pretty straightforward.

An interesting comparison to make would be the PPC970 versus a SPE on a typical raytracing scene, of lots of objects and more erratic data fetching. This should see SPE's advantages reduced and the attained performance be a lot less than theoretical max.

As XBD's already suggested, take a peeps at this independant companies results using Cell http://www.mc.com/cell/demo.cfm. In a task very well suited to Cell's architecture Mercury Computer Systems are getting 100x the performance that they were getting from a 'high end server core'. In contrast, Alias' cloth simulation was only getting a few times the performance of a P4 from Cell, so not every application is going to be much, much faster on cell, some'll be worse, but, strangely, those tasks that Cell was designed to do well at, it does well at!
 
V3 said:
That's only for SPEs, what about the rest ? the EIB, FLEXIO, Memory controller, PPE, ... And what voltage have they chosen for stable operation ?

Well it seems that 1v is what's been chosen for the PS3, as indicated by the 'PS3 Power Brick' thread on the main console page. The EIB, FlexIO, mem controller - no one knows for sure to my knowledge. As for the PPE, best guess is 30w or something, similar to the individual cores in the XeCPU.

I know it's not much to go on, but at least it gives us a picture of some solid power efficiency on the SPE-end of things.
 
V3 said:
That's only for SPEs, what about the rest ? the EIB, FLEXIO, Memory controller, PPE, ... And what voltage have they chosen for stable operation ?
I'm sure it's fairly safe to assume they're akin to a normal processor. Beside, looking at the SPE's, at 3.2GHz, 1 v, they're drawing 3W. 3W for 25 GFlops is very efficient. Now if we factor in the extra componentry, if Cell is drawing 2x the power of a PPC970, it's still got far greater performance per Watt. So at say 10x the FPU performance, as long as Cell isn't 10x the power draw of PPC970, it's going to have better power/performance/efficiency.
 
Alpha_Spartan said:
So do the T&L benchmarks lend credence to the assertion that Cell will handle as much T&L work in the PS3 as the EE does in the PS2?

Would it need to?

(The answer is no, of course, while RSX has vertex shaders. Although if you look in the other Cell-related thread this afternoon, it seems it might be that NVidia is indeed providing a Cg compiler for Cell)
 
Last edited by a moderator:
Shifty Geezer said:
Easy-peasy! Raytracing's a pretty simple procedure requiring only a little code, which'll fit neatly into LS's. The scenery data was generated on the fly from 2 bitmaps, which is a regular data structure that can easily be double/triple buffered with DMA's to ensure the SPE's always have code to work on. This means all the code and data is directly available on fast LS without the latency incurred by needing to work along with L2 cache. Even though the FP of the Altivec may well be comparable with the FP unit of the SPE clock for clock, it's keeping the units fed with data that SPE's are good at when the data accesses are pretty straightforward.

An interesting comparison to make would be the PPC970 versus a SPE on a typical raytracing scene, of lots of objects and more erratic data fetching. This should see SPE's advantages reduced and the attained performance be a lot less than theoretical max.

As XBD's already suggested, take a peeps at this independant companies results using Cell http://www.mc.com/cell/demo.cfm. In a task very well suited to Cell's architecture Mercury Computer Systems are getting 100x the performance that they were getting from a 'high end server core'. In contrast, Alias' cloth simulation was only getting a few times the performance of a P4 from Cell, so not every application is going to be much, much faster on cell, some'll be worse, but, strangely, those tasks that Cell was designed to do well at, it does well at!

Ok, that is a perfect explanation showing that code is not optimized for any other platform outside of Cell.
Because you can issue commands to the cache of the 970 in order to prefetch the relevant data, avoiding any latency. Even more, if access is sequential (it is a bitmap as you say) L2 and L1 caches will perform the "buffering" with the same efficiency (a different way of buffering, but with the same results)
Of course, you can forget about any optimization in the 970. But is an unfair comparison between heavily optimized code on the Cell and crappy code on the other platforms.
 
Shifty Geezer said:
I'm sure it's fairly safe to assume they're akin to a normal processor. Beside, looking at the SPE's, at 3.2GHz, 1 v, they're drawing 3W. 3W for 25 GFlops is very efficient. Now if we factor in the extra componentry, if Cell is drawing 2x the power of a PPC970, it's still got far greater performance per Watt. So at say 10x the FPU performance, as long as Cell isn't 10x the power draw of PPC970, it's going to have better power/performance/efficiency.
How did you get 3W?
 
DarkRage said:
Ok, that is a perfect explanation showing that code is not optimized for any other platform outside of Cell.
Because you can issue commands to the cache of the 970 in order to prefetch the relevant data, avoiding any latency. Even more, if access is sequential (it is a bitmap as you say) L2 and L1 caches will perform the "buffering" with the same efficiency (a different way of buffering, but with the same results)
Of course, you can forget about any optimization in the 970. But is an unfair comparison between heavily optimized code on the Cell and crappy code on the other platforms.

Actually, we know nothing about the level of optimisation on the G5. How do you know they did not implement these optimisations?
 
Alpha_Spartan said:
How did you get 3W?

The chart in post #24 of this thread.

That chart indicates the watts drawn by the individual SPE's at certain voltages and speeds.
 
can anyone go futher in explaning 20micro seconde for context changind in a spe?
it's a good value or not?
Aaron pink speaks a lot in others threads of context switching being a pain in the ass as spe are concern.
can this be relate to the supose weakness of the cell in "general purpose" processing and the lake of branching prediction?
if i misunderstand ( which i believe lol) ignore that post...
 
M'kay, got it. So will Cell underclock/overclock itself depending on the task at hand? This may not be useful in a multi-processor grid, but in a home console that also functions as a movie player etc, that may come in handy.

Secondly, there are benchies for 8 SPE's, shouldn't they be reduced to 7 since the 8th SPE (in PS3's implementation) is for redundancy?
 
Alpha_Spartan said:
M'kay, got it. So will Cell underclock/overclock itself depending on the task at hand? This may not be useful in a multi-processor grid, but in a home console that also functions as a movie player etc, that may come in handy.

Secondly, there are benchies for 8 SPE's, shouldn't they be reduced to 7 since the 8th SPE (in PS3's implementation) is for redundancy?

* 7/8 :p

Most of the benchies are based on a single SPE, anyway.
 
liolio said:
can anyone go futher in explaning 20micro seconde for context changind in a spe?
it's a good value or not?
Aaron pink speaks a lot in others threads of context switching being a pain in the ass as spe are concern.
can this be relate to the supose weakness of the cell in "general purpose" processing and the lake of branching prediction?
if i misunderstand ( which i believe lol) ignore that post...
Do you mean 20ms or 20ns, cuz 20ms seems kinda high for context switching.
 
Back
Top