Barry minor video of cell Demo

Nice, thanks for the link!

I couldn't comment on the texturing performance, others would be better placed to do that, but it certainly looks nice, and they seem pleased with their results, and the performance gap between the texturing and non-texturing approaches seems quite low :)

I found the data about the software caches interesting. 12-cycle cache hits? That's a lot better than another software cache implementation that was presented before, which was sported >20-cycle hits, IIRC, and they think they can do better still. I guess that's the nice thing about software caches - you can tune them to the work you're doing.
 
Very nice demo. The chromatic aberration is a little exaggerated, I don't think there's any material with a index of refraction that is that non-linear. But I guess they did that to show off that yes, it is ray-traced.
 
Good stuff. :cool:

First, colleague Mark Nutter, implemented a software cache abstraction layer for the SPE giving us the ability to both hide the complexity of DMAs and benefit from transparent data reuse.

What a great name!
 
I'll ask a couple of questions for everyone else :p

1) As texturing performance goes, how does this compare? To a regular CPU? To a GPU? Unfortunately, IBM doesn't provide a point of comparison, except that they're pleased with the result (and it seems implicit in how the article is presented that the performance may be better than expected).

2) Would a software cache with a 12-cycle hit latency on the SPEs make anyone reconsider the desireability of their use for certain tasks? Or is the implementation they used here likely not relevant beyond texturing? It just strikes me that a 12-cycle latency at face value should be quite workable - it's twice as much as a direct read, but it's less than a L2 cache hit latency on the PPE (which ranges close to 40 cycles, AFAIK? I know L1 cache is faster, direct-read LS fast, but still..:)).
 
Nice video.

"...we still have ideas pending to further reduce the 12 cycle software cache hit access time so we believe the 13% performance gap between the two shaders will continue to close."

Closing the gap even further sounds good, although only 12 cycles on 3.2Ghz is already sounding nice. Wonder how close they can push it.
 
When they say 15FPS at 1024^2, I guess I should utter an "OMG" and have my eyes go pop, but considering how jaded all of us are getting these days, spoiled by ever-increasing computing performance, it's hard to really put this in perspective. We'd really need something to compare with. Like, how much would a 3.2GHz P4 for example manage running the same basic algorithm. Naturally, the implementation and optimizations would be different, but creating the same output, what FPS could one expect to get? 1FPS? 0.5? Even less:?:

Still, it's cool stuff reading about what looks like real coders who actually try to squeeze max performance out of a chip. So much today is just programmers writing huge shitty bloated code that may or may not neccessarily do what it's supposed to.

Oblivion for example runs like crap on a fair system without the indoors sections neccessarily looking much better than say Quake4, and it actually crashes if one switches back to the desktop (or has the desktop pop to the front on its own, like when a buddy of mine called me over GTalk yesterday). Reading stuff like this makes me drool, and feel that maybe there's still hope in this world! :)
 
Pathetic texturing performance

Guys, this is absolutely horrible texturing performance.

A 13% hit down to 15fps means the texturing adds 9 milliseconds to the rendering time per frame. The algebraically ray-traced object, which needs 4 lookups per pixel, occupies less than half of the screen, and the background occupies the rest with one lookup per pixel. So we have 1024x1024x(0.5 + 0.5 * 4) = 2.6 million texture lookups. Divide by 9 milliseconds and we get...

300 Mtexels/sec. Slower than the original Geforce.

But with 10x the transistors and 25x the clockspeed :D

You have to realize they are running a very computationally intensive shader (one that needs virtually no external data, though) that takes hundreds of cycles per pixel. Horrible texturing speed barely makes a dent. I really want to see this shader run on R580. I'm thinking of purchasing a X1600, so maybe I'll make a port if I do. By the way, RSX would take 0.2 milliseconds to texture that same load, not including the refraction calculations.
 
Last edited by a moderator:
http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/

Not sure about R580 or RSX but it seems that Cell was 5-6 times faster than a 7800 GT OC for just the ray traced lighting and then they took a 13% hit after adding a 5 pass texture shader. I would guess R580 and RSX would be faster than a 7800 GT OC but there still seems a wide margin to overcome given performance didn't drop by 75-90% for texturing. They also seem confident they can reduce the hit for texturing on Cell.

Perhaps in some instances the hit is worth it.
 
Last edited by a moderator:
Mintmaster said:
300 Mtexels/sec. Slower than the original Geforce.

But with 10x the transistors and 25x the clockspeed :D
I don't see whatever point you're apparantly trying to make here. Cell isn't geared for 3D texturing, if it still had been marketed as such then your remark would have had merit.
 
Guden Oden said:
I don't see whatever point you're apparantly trying to make here. Cell isn't geared for 3D texturing, if it still had been marketed as such then your remark would have had merit.
Titanio was inquiring about texture performance.
 
Guden Oden said:
I don't see whatever point you're apparantly trying to make here. Cell isn't geared for 3D texturing, if it still had been marketed as such then your remark would have had merit.
Not only did Titanio ask about it, but did you even read the blog entry?

Title: "Cell Can’t Texture?"
First sentence: "Much has been said about Cell’s presumed inability to texture map well."
Misleading observation: "The performance penalty for using the five pass texture shader vs the lighting only shader was just 13%."

The guy is obviously making a point that Cell is just fine at texturing. If you analyze these results properly, they clearly show you why Sony desperately needed RSX and couldn't rely on Cell for graphics. It can't texture worth a damn, unless these IBM guys don't know how to code.
 
scificube said:
http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/

Not sure about R580 or RSX but it seems that Cell was 5-6 times faster than a 7800 GT OC for just the ray traced lighting and then they took a 13% hit after adding a 5 pass texture shader. I would guess R580 and RSX would be faster than a 7800 GT OC but there still seems a wide margin to overcome given performance didn't drop by 75-90% for texturing. They also seem confident they can reduce the hit for texturing on Cell.

Perhaps in some instances the hit is worth it.
This shader has an outer loop for marching the ray and an inner loop for doing the fractal iteration, not to mention a lack of texture lookups until the very end. For a shader like this, I would expect R580 to have well over 10x the performance of G71/RSX. The branching is even more intense than the PVR solid voxel shader (see the last graph on this page).

The only problem is that the shader is written in GL, and I think it uses NV extensions for SM3.0 support, so it'll require a rewrite to run on the ATI cards. I'd do it because it's not too hard and fractals are fun, but I don't have a X1K card. I'm planning on making a version that runs on the 9700 via multipassing with the stencil buffer to do outer loop branching.

Don't expect just a 13% hit on any shader other than this. This is a crazy shader which bears no relation to any graphics loads. Cell will never come close to a GPU in real situations. Consider a shader that uses a geometrical object instead of fractals, and uses the same refractions and reflections. I'd estimate a theoretical RSX without bandwidth restrictions would render the scene at 5000fps. Multiplying 15 fps by 1/0.13 (or alternatively just inverting 9 ms) gives you 115 fps.
 
Last edited by a moderator:
I guess perhaps IBM wanted to highlight the characteristics of the software cache, as an enabler for texture sizes larger than could be held at any one time in the LS. Obviously for this demo, as untypical as it may be, it's quite reasonable to add texturing.

I'd wonder also, if performance would be the same in a pure texturing benchmark - does the performance scale linearly?

Mintmaster said:
The only problem is that the shader is written in GL, and I think it uses NV extensions for SM3.0 support, so it'll require a rewrite to run on the ATI cards.

If you were doing so to measure performance against the Cell implementation, realise that with a best-effort rewrite it would probably perform better still - IIRC, IBM mentioned at the time that the algorithmics etc. remained unchanged from the Cg implementation. One wonders where performance would be if it was rearchitected from scratch for Cell.
 
You sort of have a point, but take a gander at the original GPU vs. Cell blog entry. Here's what he said:
Barry Minor said:
No I didn’t modify the code structure (removing branches, unrolling loops, etc) when porting it to Cell. Yes this could be done but I wanted to preserve the code structure so it would be a fair comparison and a simple conversion that any tool chain could achieve. Branch hints were added by the compiler and I didn’t add any __BUILTIN_EXPECTs to the code.
Barry's trying to say that Cell is 5-6x faster than a GPU doing math using only the standard tool chain and no hardcore rewrites. He's not talking about a Cell oriented algorithm retooling. I was just talking about a straight conversion to an X1K setup, using HLSL instead of Cg so changes are minimal and without special optimizations.
 
FWIW, Barry Minor says the whole source code might be made available in the next version of the Cell SDK (and if not, probably the software cache anyway). Should make it possible to assess that cache with other kinds of task, which would be neat. For example, even if a SPU was significantly slower than a PPE for tree traversal, it could still be a win to have them do that if your data is parallelisable. Though a software cache wouldn't necessarily be the only answer to doing that on the SPUs, just the one closest to how things might be done on a PPE (and possibly the easiest, if the cache was written for and its use was fairly transparent ;))
 
Titanio said:
FWIW, Barry Minor says the whole source code might be made available in the next version of the Cell SDK (and if not, probably the software cache anyway).

I wonder if any Sony devs or third party devs will try to do what Barry is doing, but in a real game?
 
Back
Top