Cell benchmarked

-tkf- · Dec 1, 2005

Cell vs GPU

http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/

First we directly translated the CG code line for line to C + SPE intrinsics. All the CG code structures and data types were maintained. Then we wrote a CG framework to execute this shader for Cell that included a backend image compression and network delivery layer for the finished images. To our surprise, well not really, we found that using only 7 SPEs for rendering a 3.2 GHz Cell chip could out run an Nvidia 7800 GT OC card at this task by about 30%.

By converting this CG shader from AOS to SOA form, SIMD utilization was much higher which resulted in Cell out performing the Nvidia 7800 by a factor of 5 - 6x using only 7 SPEs for rendering. Given that the Nvidia 7800 GT is listed as having 313 GFLOPs of computational power and seven 3.2 GHz SPEs only have 179.2 GFLOPs this seems impossible but then again maybe we should start reading more white papers and less marketing hype.

PC-Engine · Dec 1, 2005

drpepper said:
PC-Engine likes to brag that the Cell doesn't have a double precision floating point capabilities, or that it's diminished in some way, yet the article points out a test using Linpack for the DP floating point benchmark. So the Cell is capable of DP-FP work?

What is PC-Engine going on about?

Learn how to read. In fact XeCPU has equal DP performance and takes up less die area than CELL.

Jawed · Dec 1, 2005

-tkf- said:
Cell vs GPU

http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/

Now that's a nice find. Very timely blog posting, too!

It does make me wonder if the original Cg is well-tuned for the NVidia hardware, though, in its AoS form :???:

Jawed

McFly · Dec 1, 2005

-tkf- said:
Cell vs GPU

Holy damn! That's impressive.

Fredi

Shifty Geezer · Dec 1, 2005

Jawed said:
It does make me wonder if the original Cg is well-tuned for the NVidia hardware, though, in its AoS form

Obviously not. As this thread is showing, all Cell's performance advantages come from crippling the comparative processor. Obviously nVidia's implementation of their language on the GPU's isn't optimized, accounting for Cell's better performance.

Seriously, the only logical explanation has to be the LS architecture helping, no? If the problem requires 'almost zero bandwidth' then it has to be an efficiency thing? Very confusing.

Following some links, here's a page of what the Cell was doing including an executable if you have Cg installed.
http://graphics.cs.uiuc.edu/svn/kcrane/web/project_qjulia.html
I was going to take a look but I doubt my lowly 4200 has the PS level needed for this.

Here's a direct link to a copy of the code
http://graphics.cs.uiuc.edu/svn/kcrane/web/project_qjulia_source.html

Definitely interesting to see this runs faster on Cell than a GPU. Could it be the difference in loops stalling pixel pipes on different iterations? If I've got this right, the same instruction is performed on each pixel a 16 pixel quad on the GPU. So if 15 of those pixels fall on an area of the fractal that only iterates once, and the 16th pixel has to be iterated 20 times, won't the other 15 pixel's of the quad and their associated pipes be twiddling their thumbs until the 16th has finished? Whereas on Cell each SPE can tackle a pixel at a time so they'll all be running all the time.

aaaaa00 · Dec 1, 2005

Shifty Geezer said:
Definitely interesting to see this runs faster on Cell than a GPU. Could it be the difference in loops stalling pixel pipes on different iterations? If I've got this right, the same instruction is performed on each pixel a 16 pixel quad on the GPU. So if 15 of those pixels fall on an area of the fractal that only iterates once, and the 16th pixel has to be iterated 20 times, won't the other 15 pixel's of the quad and their associated pipes be twiddling their thumbs until the 16th has finished? Whereas on Cell each SPE can tackle a pixel at a time so they'll all be running all the time.

It's the branching. G70 has terrible branching performance because of its large batch sizes.

http://graphics.cs.uiuc.edu/svn/kcrane/web/project_qjulia.html said:
Unfortunately, GPUs still suffer from coarse branching granularity, which means that incoherent rays will waste time waiting for each other to finish.

aaaaa00 · Dec 1, 2005

Just about all the tests in that chart are best case scenarios for CELL, especially the encryption, which is very streamable.

Considering single precision is not IEEE compliant, and double precision is only completely compliant if you do some icky hacking (see the IBM SPE ISA manual, p192.), I'm not sure this particular version of CELL has much future in scientific computing.

nAo · Dec 1, 2005

Maybe dynamic branching is what is killing G70's performance here

Shifty Geezer · Dec 1, 2005

Dunno. There doesn't seem a huge amount of branching of first glimpse, but there's certainly one conditional per iteration. There also seems to be a bug void iterateIntersect().

Is Cg available on ATi's latest R520? It'd be worth trying it with it's branching capabilities.

aaaaa00 · Dec 1, 2005

Shifty Geezer said:
There also seems to be a bug void iterateIntersect().

How is that a bug? The function doesn't return anything. It takes inout parameters.

Platon · Dec 1, 2005

In table 13 that has been postd, compairing the Cell vs other CPUs, what is the difference between the Cell BE and the Cell with 8 SPEs?...

Shifty Geezer · Dec 1, 2005

aaaaa00 said:
How is that a bug? The function doesn't return anything. It takes inout parameters.

This line

if( dot( q, q ) > ESCAPE_THRESHOLD )

dot(q,q) = Sqr(q^2) or Sqr(2)*q IIRC, so why do a dot, and why have a qp variable if you're not using it? I'm guessing this ought to be :

if( dot( q, qp ) > ESCAPE_THRESHOLD )

Fafalada · Dec 1, 2005

aaaa0 said:
Just about all the tests in that chart are best case scenarios for CELL

Nah at lest one test isn't. Matrix multiplications are vertical SIMDs worst enemy.

Titanio · Dec 1, 2005

Platon said:
In table 13 that has been postd, compairing the Cell vs other CPUs, what is the difference between the Cell BE and the Cell with 8 SPEs?...

I'm guessing the one using "Cell BE" is using the PPE also, as opposed to the others which are benched on the SPUs only, so it makes sense to make a distinction.

DarkRage · Dec 1, 2005

xbdestroya said:
Let me ask you DarkRage, what in the world would be IBM's motive to talking down the 970?

Let me ask you xbdestroya, with Apple going to Intel, how many customers do the 970 have?

970 is dead. IBM is taking care of not getting any benchmark against Power5, for example.

Platon · Dec 1, 2005

Titanio said:
I'm guessing the one using "Cell BE" is using the PPE also, as opposed to the others which are benched on the SPUs only, so it makes sense to make a distinction.

So in those benches they manged to get Cell working without using the PPE at all?...

nAo · Dec 1, 2005

Shifty Geezer said:
dot(q,q) = Sqr(q^2) or Sqr(2)*q IIRC

Emh..no

London Geezer · Dec 1, 2005

Platon said:
So in those benches they manged to get Cell working without using the PPE at all?...

I suppose they got the whole chip starting up (as the PPE is neeeded for that i believe), but for the bench only the SPE's were "doing stuff". The PPE wasn't computing anything, therefore the bench was only for the SPEs.

Jawed · Dec 1, 2005

Shifty Geezer said:
Definitely interesting to see this runs faster on Cell than a GPU. Could it be the difference in loops stalling pixel pipes on different iterations? If I've got this right, the same instruction is performed on each pixel a 16 pixel quad on the GPU. So if 15 of those pixels fall on an area of the fractal that only iterates once, and the 16th pixel has to be iterated 20 times, won't the other 15 pixel's of the quad and their associated pipes be twiddling their thumbs until the 16th has finished? Whereas on Cell each SPE can tackle a pixel at a time so they'll all be running all the time.

It's worse than that. No NVidia GPU can work on a batch as small as 16 pixels (or rather, the effective batch size can't be so small).

A 7800GTX works on batches of about 800 pixels. R520 works on batches of 16 pixels - but looking forwards to newer ATI GPUs, R520 and RV515 would appear to be the first and last that work on such small batches. Xenos works on batches of 64 pixels.

Well done for bothering to look at the code. I was leaving that to this evening...

Jawed

Guden Oden · Dec 1, 2005

DarkRage said:
Let me ask you xbdestroya, with Apple going to Intel, how many customers do the 970 have?

970 is dead. IBM is taking care of not getting any benchmark against Power5, for example.

You're just continuing to build on your house of cards with all the nonsensical conspiracy theory stuff you're presenting without any facts to back it up. I'd suggest you stop wasting yours, and our time, as you're clearly getting nowhere with it.

I suppose you're really ticked off cell performs so good as it is, but hey, it's not our problem. Get a life or something will ya?

Cell benchmarked

-tkf-

PC-Engine

Jawed

McFly

Shifty Geezer

uber-Troll!

aaaaa00

aaaaa00

nAo

Nutella Nutellae

Shifty Geezer

uber-Troll!

aaaaa00

Platon

Shifty Geezer

uber-Troll!

Fafalada

Titanio

DarkRage

Platon

nAo

Nutella Nutellae

London Geezer

Jawed

Guden Oden

Senior Member

Similar threads