Cell benchmarked

Alpha_Spartan said:
M'kay, got it. So will Cell underclock/overclock itself depending on the task at hand? This may not be useful in a multi-processor grid, but in a home console that also functions as a movie player etc, that may come in handy.

Secondly, there are benchies for 8 SPE's, shouldn't they be reduced to 7 since the 8th SPE (in PS3's implementation) is for redundancy?

The benchmarks aren't for the Cell in PS3, but just the Cell in general. Mercury Systems for example (as well as any other OEM customers that step up) will be receiving full 8-SPE chips. Also, a lot of the benchmarks are just for a single SPE rather than the full 8 - it indicates which arrangement on a per benchmark basis on the far right of the benchmark chart.

Sony definitely is working on some sort of power-savings scheme as well, as indicated by the alliance they entered into with Transmeta earlier this year. Link

Not sure if that's for Cell in it's present interation though or PS3 or what. But anyway, it's interesting info.
 
Last edited by a moderator:
liolio said:
can anyone go futher in explaning 20micro seconde for context changind in a spe?
it's a good value or not?

I don't know how it compares, but context switches are expensive for any thread that is not "in hardware", on any chip really.

liolio said:
Aaron pink speaks a lot in others threads of context switching being a pain in the ass as spe are concern.
can this be relate to the supose weakness of the cell in "general purpose" processing and the lake of branching prediction?

I don't think a lack of hardware threading support on the SPUs has much to do with its GP capability.

With a SPU, you'd be looking to put a thread/task on the SPU, executing it as much as possible without any blocks (plan out your memory accesses in advance, and bring in data pre-emptively etc.), and then when that thread/task is finished, it of course makes sense to switch. That's really the only circumstance you'd be wanting to switch on the SPU - you wouldn't be switching every other cycle or anything like that. Assuming 60 frames per second is your target, each SPU would have ~16,666 microseconds per frame to work with - so obviously if for example one task finished after 5,000 microseconds, you're not going to avoid switching to another task for the sake of the 20 microsecond wait, are you? But on the other hand, you're not going to be switching every few cycles either, since effectively the wait is, to put it in other terms, 64,000 cycles). So in summary - avoid blocking, run to completion, then switch.

That's ignoring any potential for "packing multiple threads onto a SPU" type work, which for now at least isn't really useable AFAIK.

edit - I don't actually want to prejudice the last point - Alex Chow's presentation at the Fall Processor Forum touched on this again, so maybe it is more useable than was being suggested previously. There's a good summary of his presentation on programming models for Cell here: http://www-128.ibm.com/developerworks/library/pa-fpfunleashing/
 
Last edited by a moderator:
avaya said:
They're including MADDs.

3.2 * 4 * 2 = 25.6

mmh then why aren't they including MADDs for 8bit ints ?

3.2*16*2 = 102.4 not the 51.2 they are talking about.
There's something wrong anyway.
(Side question : where is it written that they do single cycle MAD ?)
 
Ingenu said:
mmh then why aren't they including MADDs for 8bit ints ?

3.2*16*2 = 102.4 not the 51.2 they are talking about.
There's something wrong anyway.

MADD isn't there for every datatype, I don't think.
 
Alpha_Spartan said:
Do you mean 20ms or 20ns, cuz 20ms seems kinda high for context switching.
Micro seconds. That's 1/1000th a millisecond and 1000* a nanosecond. At 1GHz, 1 microsecond =1000 cycles. At 3.2 GHz, that's 3200 cycles. 20 of those is 64000 cycles. Sounds like a lot to me! Not that I imagine one to be context switching SPE's very often. That'd go against their ideal function.
 
xbdestroya said:
The benchmarks aren't for the Cell in PS3, but just the Cell in general. Mercury Systems for example (as well as any other OEM customers that step up) will be receiving full 8-SPE chips. Also, a lot of the benchmarks are just for a single SPE rather than the full 8 - it indicates which arrangement on a per benchmark basis on the far right of the benchmark chart.

Sony definitely is working on some sort of power-savings scheme as well, as indicated by the alliance they entered into with Transmeta earlier this year. Link

Not sure if that's for Cell in it's present interation though or PS3 or what. But anyway, it's interesting info.
Okay. I saw PS3 in the intro so I thought this was discussing the PS3 implementation. Others are mentioned as well though.
 
DarkRage said:
Ok, that is a perfect explanation showing that code is not optimized for any other platform outside of Cell.
Because you can issue commands to the cache of the 970 in order to prefetch the relevant data, avoiding any latency. Even more, if access is sequential (it is a bitmap as you say) L2 and L1 caches will perform the "buffering" with the same efficiency (a different way of buffering, but with the same results)
Of course, you can forget about any optimization in the 970. But is an unfair comparison between heavily optimized code on the Cell and crappy code on the other platforms.
I'm not sure how you got the idea that it's "heavily optimized code on the Cell". Is it "heavily optimized" when code is built around SPE LS?
 
Alpha_Spartan said:
M'kay, got it. So will Cell underclock/overclock itself depending on the task at hand? This may not be useful in a multi-processor grid, but in a home console that also functions as a movie player etc, that may come in handy.

Secondly, there are benchies for 8 SPE's, shouldn't they be reduced to 7 since the 8th SPE (in PS3's implementation) is for redundancy?
I remember reading that whole SPEs as parts of it will be switched off if not needed. I doubt there will be independend clock-throttling, as the EIB seems to be fixed at a factor of PPE/SPE Frequency (an asynchronous bus would be more complicated and possibly higher latency). So if theres clock-throttling its everything or nothing.

If SPEs full performance isnt needed, they will just idle often, waiting for external conditions like more data, switching of most of the logic until they can continue.
 
Zeross said:
They explained the results of the 970 in another paper I don't remember which one but the explanation is that the 970 is not fp bound in the TRE program, it is memory bound : a lot of cycles are waisted juste waiting for the memory. The SPE thanks to their fast local store are more capable of approaching their peak performance.


The TRE line (grey highlighted) @30fps shows using the cell BE, which presumably is the Broadband Engine, or 4 Cell chips coupled together... ie: 4 PPC + 32 SPEs iirc.

Is that correct? If so the Cell isn't realistically rendering that data, its probably achieving 1/4th of that or around 7.5fps...
 
Lord Darkblade said:
The TRE line (grey highlighted) @30fps shows using the cell BE, which presumably is the Broadband Engine, or 4 Cell chips coupled together... ie: 4 PPC + 32 SPEs iirc.

Is that correct? If so the Cell isn't realistically rendering that data, its probably achieving 1/4th of that or around 7.5fps...

That would be pretty decieving if they got those numbers from coupling 4 Cell chips together and comparing it to one CPU (that isn't a Cell).
 
DarkRage said:
Because you can issue commands to the cache of the 970 in order to prefetch the relevant data, avoiding any latency. Even more, if access is sequential (it is a bitmap as you say) L2 and L1 caches will perform the "buffering" with the same efficiency (a different way of buffering, but with the same results)
Well I can't comment on this because I don't know enough about PPC to know what degree of data prefetching and doublebuffering it can achieve with the L1 caches. Still if you don't take these examples as legitimate comparisions, something no-one can prove without looking at the code IBM used, how's about the Mercury demo? Do you think after years of working on conventional processors they haven't achieved optimized code, and the 100x improvement of Cell is just because of optimizations?
 
Titanio said:
(The answer is no, of course, while RSX has vertex shaders. Although if you look in the other Cell-related thread this afternoon, it seems it might be that NVidia is indeed providing a Cg compiler for Cell)

Do you have the link?
 
Lord Darkblade said:
The TRE line (grey highlighted) @30fps shows using the cell BE, which presumably is the Broadband Engine, or 4 Cell chips coupled together... ie: 4 PPC + 32 SPEs iirc.

A "cell Broadband Engine" isn't any particular configuration, so I'm not sure where you're getting that from.

The original paper makes it clear it's a 8-SPU Cell, then at 3.2Ghz IIRC. This gives us figures for a 3.2Ghz Cell vs a 2.7Ghz G5. It should be pretty obvious it's a 8-SPU Cell - a 32-SPU Cell would obviously be much greater than 35x the performance of the 2.7Ghz G5 if a 8-SPU Cell was 50x greater than a 2Ghz G5.

A dual-blade Cell was also used, but it was giving 75x the performance (at 2.4Ghz).
 
Last edited by a moderator:
Jawed said:
What memory bandwidth does this Apple G5 have?

The Apple G5s seem to have a variable FSB that is 1/3 the core clock. link

PS. Or, depending on how you want to look at it, the G5 has a fixed FSB and Pentium 4 and the like have a variable one.
 
BlueTsunami said:
That would be pretty decieving if they got those numbers from coupling 4 Cell chips together and comparing it to one CPU (that isn't a Cell).
the Terrain Demo was on a dual Cell blade. That is quite an iffy statistic to chuck in there though if they haven't normalised the performance to a standard single Cell.
 
Titanio said:
A "cell Broadband Engine" isn't any particular configuration, so I'm not sure where you're getting that from.

The original paper makes it clear it's a 8-SPU Cell, then at 2.4Ghz IIRC. The 50x performance improvement there was comparing the 2.4Ghz Cell to a 2Ghz G5. This gives us figures for a 3.2Ghz Cell vs a 2.7Ghz G5.

I guess it was a 3.2Ghz CELL.
 
Shifty Geezer said:
the Terrain Demo was on a dual Cell blade. That is quite an iffy statistic to chuck in there though if they haven't normalised the performance to a standard single Cell.

Well, looking at one's link (about the terrain Demo itself) it seems that 1 3.2Ghz Cell was rendering the Terrain Demo @ 50FPS (am I correct with that?) then if some type of optimisation from then up to now has been implimented, it could possibly be doing that framerate now.
 
Last edited by a moderator:
Back
Top