Cell benchmarked

darkblu · Dec 1, 2005

ok, this is getting out of hand.

can we cut down on the totally pointless, pure oppositional and often rude remarks?

Edge · Dec 1, 2005

I love discussing technology also, but pages and pages of posts from people cutting down CELL's DP performance, and also from the very same people who rant against CELL day in and day out, just because they now, let me get this straight, say they don't like the hype surrounding CELL, and want to inject us with a dose of reality.

Well I'm pointing out, you don't need DP performance in a game console, and that's a dose of reality. So keep that in mind.

patsu · Dec 1, 2005

These (The 2 Cell tech threads) are all great stuff. I think similar to the other long Cell thread, some comments are directed at the Cell architecture, and others at the Cell implementation DD2. The DP discussion probably fits better as a standalone architecture discussion using DD2 as a reference, but there's no need to criticize DD2 for what it's not.

DD2 is designed for PS3. Crippled/defective ones are going to Toshiba HDTVs (as I recall), and 8-SPEs for blade servers (MRI and other single-precision scientific apps that seems to need custom hardware support but aren't getting any because all the money has gone to DP and expensive supercomputing work). The IBM cell sdk forum has a few posts/suggestions from visitors.

Thanks for keeping me company guys, I'm up 50 hours and no sleep (well 2 hours here and there). Keep it going and I promise I won't be one of those Chinese MMORPG players who died in front of their monitors.

aaronspink · Dec 2, 2005

Edge said:
And in a forum labeled "Console Technology", this is important for what?

In this context:

Originally Posted by Shifty Geezer
Very good indeed even, if they're IEEE Flops. I'd like to not what the increase in size is though. Presumably quite a hefty jump in size and costs, else why didn't they use improved DP on the current Cell version? Unless it's a simple 'trick' that wasn't obvious but can get greater DP boosts at little extra cost.

It is important, as it roughly quantifies the cost of making the SPE fully support pipelined DP operations. So get that bee out of your bonnet.

Aaron Spink
speaking for myself inc.

Fafalada · Dec 2, 2005

aaron said:
In this context:

IIRC it was you that brought the particular context of DP into the discussion in the first place.

Titanio said:
Does your rate halve with every light..?

Define 'light'.
We can literally go from illumination models where each added light adds no per vertex/fragment cost at all - to models where a single light could be prohibitively expensive to even talk about realtime processing.

For that particular benchmark - the model I gave as an example earlier in this thread (3-4paralel lights) could fit given the performance stated. But it's still just a guess, and I could be way off.

AlphaWolf · Dec 2, 2005

Fafalada said:
IIRC it was you that brought the particular context of DP into the discussion in the first place.

Nope, wasn't him.

aaronspink · Dec 2, 2005

Fafalada said:
IIRC it was you that brought the particular context of DP into the discussion in the first place.

Nope, check again.

Aaron Spink
speaking for myself inc.

Shifty Geezer · Dec 2, 2005

Well Im going to push on with DP talk regardless, 'coz I'm reckless like that!

(I'd like to know why PS3's Cell isn't managing 1:2 performance SP vs DP floats).

So Aaronspink, you're saying a DP enhanced SPE occupies something like 3x the space of a conventional SPE? Or to go to DP enhanced would lose 2 SPE's from the 1:8 config? Assuming the same footprint for this DP+ Cell, would it be 1:6 or nearer 1:4?

I guess even at 1:6, if Sony want they're redundancy in they're dropping down to a 1:5 instead of a 1:7, for much less SP capacity at the gain of unneccesary DP. Which makes sense for a 'SP only' format for the console, while IBM go on to provide a DP chip for scientific applications off the back of the same architecture.

Fafalada · Dec 2, 2005

Shifty Geezer said:
I'd like to know why PS3's Cell isn't managing 1:2 performance SP vs DP floats

The SPEs DP is only partially pipelined - so with DP it's reduced to issuing once per 7cycles instead of once every cycle like with SP.
It's obvious that the only reason DP is even supported is for forward compatibility - high performance DP would be a complete waste of die-area in PS3 Cell.

aaronspink · Dec 2, 2005

Shifty Geezer said:
Well Im going to push on with DP talk regardless, 'coz I'm reckless like that! (I'd like to know why PS3's Cell isn't managing 1:2 performance SP vs DP floats).

The DP hardware in the SPEs isn't pipelined. For SP operations, there is a result latency of 7 cycles (IIRC) and an issue rate of 1 every cycle. SP operations are 4 wide SIMD.

For DP operations there is a result latency of 7 cycles and an issue rate of 1 every 7 cycles. DP operations are 2 wide SIMD. I don't know the exact hardware config of the DP MACs to know if they are sharing any logic with the SP hardware.

So Aaronspink, you're saying a DP enhanced SPE occupies something like 3x the space of a conventional SPE? Or to go to DP enhanced would lose 2 SPE's from the 1:8 config? Assuming the same footprint for this DP+ Cell, would it be 1:6 or nearer 1:4?

No, a DP MAC will occupy on the order of 2x the space of 2 SP MACs. This is do to the much greater size of the multiplier needed for a 64x64 + offsets multiplication. http://www.iccd-conference.org/proceedings/2001/12000497.pdf provides some area estimates for SP and DP floating point multipliers.

If you look at the SPU on the die photo, the non-pipelined DP MAC occupies roughly 1/2 the area for the 4 SP MAC units. IBM likely saved a significant amount of space by not making the DP MACs pipelined. IBM has two options, they can either put in 2 pipelined, DP MACs with the nessesary logic to also do 2 SP MACs or just increase the functionality of the current DP MAC block to support pipelined operation.

It would be nice if I could find die photos of two processors of the same micro-architecture, one supporting SSE and one supporting SSE2, because that would give a good idea of the area difference we are talking about.

Realistically, IBM will probably just increase the die size. Looking at the layout, to fit in the fully pipelined DP MACs they will likely require more horizontal real estate.

Aaron Spink
speaking for myself inc.

Titanio · Dec 2, 2005

Fafalada said:
Define 'light'.
We can literally go from illumination models where each added light adds no per vertex/fragment cost at all - to models where a single light could be prohibitively expensive to even talk about realtime processing.

Indeed - I was thinking of a OpenGL-style lighting model.

Fafalada said:
For that particular benchmark - the model I gave as an example earlier in this thread (3-4paralel lights) could fit given the performance stated. But it's still just a guess, and I could be way off.

Cheers, I missed your earlier post! It's as good a guess as any.

Heinrich4 · Dec 2, 2005

Titanio said:
Possibly, but I guess we can't know for sure. That detail would have been nice. I guess since it was a TnL demo, though, at least one light needs to be involved. Does your rate halve with every light..?

I imagined in case the SPE has (processing SIMD/MIMD etc) similar processing like a GPU.

And if SPE will have same "behavior" gpu early 10/15% of the total of 800MVtx/sec(shaded without effect gouraud ) for 8 lights in scene.

Gubbi · Dec 2, 2005

Heinrich4 said:
<snip>

*parse error*

Cheers

Heinrich4 · Dec 2, 2005

Gubbi said:
*parse error*

Cheers

Im edited my post.... now it is better to understand?

(sorry if not ... english is not my native language)

one · Mar 1, 2006

Just FYI, on the "Optimization of Triangle Transform and Lighting" benchmark
http://www-128.ibm.com/developerwor...reeDisplayType=threadmode1&forum=739#13794471

Re: Where to get the source of the used applications examples?
Originally posted: 2006 March 01 08:34 AM
brokensh

I assume that you are referring to the vertex transformation and lighting workload benchmarked in the CBE Performance paper. I would love to share the code with the readers on the forum, but to do so requires significant legal approval within IBM.
Short of that, the workload is fairly simple.

For each 3-component vertex and normal pair:
1) apply a 4x4 matrix transformation on the vertex w/ perspective division.
2) compute a standard OpenGL lighting equation including ambient, diffuse, and specularity.
3) clamp the resulting color and pack into a 32-bit (8-bits per component) color.
4) store the NDC coordinate and pack color back out to memory.

darkblu · Mar 2, 2006

one said:
Just FYI, on the "Optimization of Triangle Transform and Lighting" benchmark
http://www-128.ibm.com/developerwor...reeDisplayType=threadmode1&forum=739#13794471

one, what brokensh talks about is nothing cell specific. that's the general algorithm (sans the fact that lighting is calculated in a different space, but that's a detail).

one · Mar 2, 2006

darkblu said:
one, what brokensh talks about is nothing cell specific. that's the general algorithm (sans the fact that lighting is calculated in a different space, but that's a detail).

Well, the fact that they tested a general model is the confirmation of what Titanio wrote here in this thread so I posted it :smile:

chris1515 · Mar 3, 2006

http://www-128.ibm.com/developerwor...reeDisplayType=threadmode1&forum=739#13795131

On this page you can find a document about performance evaluation of the communication mechanism of the CELL
http://hpc.pnl.gov/people/fabrizio/publications.html

Cell benchmarked

darkblu

Edge

patsu

aaronspink

Fafalada

AlphaWolf

Specious Misanthrope

aaronspink

Shifty Geezer

uber-Troll!

Fafalada

aaronspink

Titanio

Heinrich4

Gubbi

Heinrich4

one

Unruly Member

darkblu

one

Unruly Member

chris1515

Similar threads