GPUs vs Cell benchmarked

hey69 · Dec 1, 2005

read here the original info with the links

http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/

*-----------------------------------------------------------8OO8---------------
GPUs vs Cell

Blogged under Cell by Barry Minor on Wednesday 30 November 2005 at 7:39 pm
Recently I came across a link on www.gpgpu.org that I found interesting. It described a method of ray-tracing quaternion Julia fractals using the floating point power in graphics processing units (GPUs). The author of the GPU code , Keenan Crane, stated that â€œThis kind of algorithm is pretty much ideal for the GPU - extremely high arithmetic intensity and almost zero bandwidth usageâ€. I thought it would be interesting to port this Nvidia CG code to the Cell processor, using the public SDK, and see how it performs given that it was ideal for a GPU. First we directly translated the CG code line for line to C + SPE intrinsics. All the CG code structures and data types were maintained. Then we wrote a CG framework to execute this shader for Cell that included a backend image compression and network delivery layer for the finished images. To our surprise, well not really, we found that using only 7 SPEs for rendering a 3.2 GHz Cell chip could out run an Nvidia 7800 GT OC card at this task by about 30%. We reserved one SPE for the image compression and delivery task. Furthermore the way CG structures it SIMD computation is inefficient as it causes large percentages of the code to execute in scalar mode. This is due to the way they structure their vector data, AOS vs SOA. By converting this CG shader from AOS to SOA form, SIMD utilization was much higher which resulted in Cell out performing the Nvidia 7800 by a factor of 5 - 6x using only 7 SPEs for rendering. Given that the Nvidia 7800 GT is listed as having 313 GFLOPs of computational power and seven 3.2 GHz SPEs only have 179.2 GFLOPs this seems impossible but then again maybe we should start reading more white papers and less marketing hype.

Carl B · Dec 1, 2005

Sort of already posted by -tkf- here, but probably warrants it's own thread in all honesty.

It's a cool experiment they ran.

mckmas8808 · Dec 1, 2005

Not to sound too dumb, but how could this new information help PS3 devs make PS3 games better?

Jov · Dec 1, 2005

mckmas8808 said:
Not to sound too dumb, but how could this new information help PS3 devs make PS3 games better?

Shift the focus of where the _tru_ power is? j/k

Seriously, at least it gives ppl in general or some devs what the Cell is capable of, thus dev more balanced engines/games, more ideas to try, etc..

Bobbler · Dec 1, 2005

mckmas8808 said:
Not to sound too dumb, but how could this new information help PS3 devs make PS3 games better?

Probably not at all -- I'd just take at face value; it's an expirement.

London Geezer · Dec 1, 2005

Bobbler said:
Probably not at all -- I'd just take at face value; it's an expirement.

That's what i thought. Cell is a geometry monster, which we all knew already. Problem is that RSX will almost NEVER be geometry limited as it will be fillrate or bandwidth hungry much sooner than that.

So Cell will have to be used for something other than geometry if it is to really help RSX.

ShootMyMonkey · Dec 1, 2005

I'd be interested to know what sort of depth he's calculating those Julias to. I showed Keenan my own Julia set raycast code about 3 and a half years ago (regular old software thing for an S&V piece), and it was capable of running in realtime on a PC back then (640x480), but only if you limited the number of iterations of each point sample down to about 10-12 iterations. That makes for a pretty fast, but relatively low-detail low-quality result.

one · Dec 1, 2005

How does this code perform on a general-purpose processor such as P4/A64?

Urian · Dec 1, 2005

I ever believed that one of the great problems today is the lack of the optimization of the code adressed to components like the system GPU.

Perhaps I am going to do a mistake here but... Exist a tool like gprof for the direct coding to the GPUs?

Thowllly · Dec 2, 2005

A GF7800GT has 8x2x20x400=128Gflops of computing power (counting only mads) in the pixel shaders. So 7 SPEs should be able to beat it, but it's strange that they are as much as 5 to 6 times faster

one said:
How does this code perform on a general-purpose processor such as P4/A64?

I'm not sure, but AFAIK a 3.2Ghz dual core P4 would have the same Gflop rating as a single SPE (again counting MADs), so I don't think it would do too good in comparison.

The GameMaster · Dec 2, 2005

A Geforce 7800 GTX with 8 vertex pipelines and a clock rate of 550Mhz should have roughly 35.2 GFLOPs (another 211GFLOPs in pixel shaders assuming no texture operations). A Cell processor with 8 SPEs and a clock rate of 3.2GHz should have 218GFLOPs (7 SPEs at 3.2GFLOPs whould make it more like 179GFLOPs). So yea... I can believe the claim the Cell CPU could theoretically be 5-6 times faster at geometry than a Geforce 7800GTX, only if we are counting geometry. Only problem is that is if you to do everything that is normally done in a modern graphics processor the Cell's performance would likely be far less. After all... one of the main reasons why Sony decided to use nVidia's technology is because the Cell processor itself was inferior to even nVidia's Geforce at the time... especially when shaders are concerned.

The general consensus is that we already knew the Cell CPU would be a very powerful processor for geometry so it is not surprising to hear this kind of a claim.

It does bring up an interesting thought of using the Cell to handle the geometry and the Geforce GPU for texturing and shaders (this would be similar to how the Emotion Engine in the PS2 interacted with the Graphics Synthesizer). The only problem is that would make the vertex pipelines kinda useless in the Geforce GPU... and this would increase the difficulty of game development also.

Like most things in life... there are things that we are not being told.

ector · Dec 2, 2005

I did an SSE implementation of this thing yesterday and ran it today on a hyperthreaded P4 3GHz, using both threads. Was able to trace 256x256 at a "decent" number of iterations at about 1-3 fps depending on the parameters. Don't know what this means though, except that both Cell and the nVidia card most likely crush my P4 implementation in raw performance

Fafalada · Dec 2, 2005

TheGamemaster said:
So yea... I can believe the claim the Cell CPU could theoretically be 5-6 times faster at geometry than a Geforce 7800GTX, only if we are counting geometry.

G70 version was a pixelshader program.

Jawed · Dec 2, 2005

http://graphics.stanford.edu/~yoel/notes/

Jawed

one · Dec 3, 2005

From the comments in the blog

# Comment by Barry Minor â€” December 2, 2005 @ 12:21 pm

Juice,

I used a UP Cell bringup system.
3.2 GHz DD3.1 Cell processor with 8 good SPEs, 512 MB XDR Memory, 100 Mb network.
Pixels were 128 bit, 32 bit float per color channel.
All rendering parameters were the defaults set by the Cg program with the exception of the window size which was increased to 1024Ã—1024.

Oh yeah? Now it seems DD2 disclosed at Cool Chips was already superseded by the production model of Cell DD3.1 probably that appeared at Hot Chips with even more transistors.

Also another comment is interesting in the context of the discussion in b3d threads.

# Comment by Barry Minor â€” December 2, 2005 @ 2:08 pm

Marco,

I have heard that the dynamic branching in the current Nvidia GPUs maybe causing performance degradations in the ray-tracer and that the ATI X1800 may address some of these issues. If anyone has access to an X1800 I would be interested in hearing its 1024Ã—1024 frame rate.

No I didnâ€™t modify the code structure (removing branches, unrolling loops, etc) when porting it to Cell. Yes this could be done but I wanted to preserve the code structure so it would be a fair comparison and a simple conversion that any tool chain could achieve. Branch hints were added by the compiler and I didnâ€™t add any __BUILTIN_EXPECTs to the code.

mckmas8808 · Dec 3, 2005

Where is that from one?

nAo · Dec 3, 2005

mckmas8808 said:
Where is that from one?

it's from the very first page linked in this thread, he was replying to a comment I wrote on that blog

mckmas8808 · Dec 3, 2005

nAo said:
it's from the very first page linked in this thread, he was replying to a comment I wrote on that blog

Oh darn you are Marco. I knew that.

Shifty Geezer · Dec 3, 2005

DD3.1, eh?!

Google hasn't found anything public about changes from DD2. I'm seriously intrigued!

Mintmaster · Dec 3, 2005

After reading the original post of this thread, I also saw knew that dynamic branching was a major factor here. The 7800 can't do dynamic branching well at all. There is a PVR Voxel PS3.0 demo that came out a long time ago which does raytracing-like calculations, and ATI's X1800XT performed 5-9 times faster, just to give you an idea of how "well suited" this task is to the 7800. In fact, the author comments that the algorithm runs faster in the version without branching.

Raytracing is not a fast way to render realtime graphics, and only offers an image quality improvement when you toss in things like photon mapping. At that point, speed is far from realtime. A 10x speedup with CELL, if even possible (I doubt it due to branching and prefetch unpredictability) is far from enough, especially in a gaming scenario.

More importantly to the context of this thread, raytracing is not the domain of the CPU, so you're giving RSX a big handicap. If you did scanline rendering on CELL, it couldn't touch RSX, especially when you consider the calculations in doing perspective divide & correction, z-buffer, LOD, filtering (esp. anisotropic), and TEXTURE LOADS. This last one is a huge factor. A texture load requires memory access, and that means latency. The only way to absob this latency is to have many pixels in flight for each SPE, and that requires a lot of coding effort, especially when you consider memory thrashing if not accessed correctly. GPU's are amazing at reording memory requests for maximum bandwidth.

Note one more very important consideration: This shader has NO texture accesses at all! There are only two triangles as well for a full screen quad, so the pixel ordering is very regular as well. This is very ideal for Cell, so it's no wonder it did well. Basically you can iterate through the whole ray intersection without any interleaving or data access until you're ready to write out the final pixel.

Take home message: This is a worthless comparison for GPU vs. CPU. There are better examples out there.

GPUs vs Cell benchmarked

hey69

i have a monster

Carl B

Friends call me xbd

mckmas8808

Jov

Bobbler

Shazbot!

London Geezer

ShootMyMonkey

one

Unruly Member

Urian

Thowllly

The GameMaster

ector

Fafalada

Jawed

one

Unruly Member

mckmas8808

nAo

Nutella Nutellae

mckmas8808

Shifty Geezer

uber-Troll!

Mintmaster

Similar threads