Cell benchmarked

Jawed · Nov 30, 2005

ralexand said:
The question I have though is can cell and gpu be used in conjunction with each other that will lead to a synergistic increase in performance.

With high bandwidth between Cell and RSX, I guess it's a prime design goal to make the two cooperate. Either that or the fact that Cell has that bandwidth by design, anyway (i.e. irrespective of the implementation of Cell in PS3), meant it was inevitable that RSX would get a mirror image of that interface for high bandwidth.

It's just frustrating getting no decent evidence for where this is really going to be cool. Procedural textures and CPU-tessellation always seemed like the most likely fields for cooperation to me.

Jawed

Edge · Nov 30, 2005

Jawed said:
Compared to what? A GPU? I'd be interested in comparative data for a GPU if you have some.

I would be interested in how you think high floating point rating on CELL does not make a good gaming processor, but MS went out of their way to devote a single PPE (with the ability to lock parts of the cache), and it's powerful VMX unit to graphics? Going by what you have shared, that would be wasteful for MS to do that, as by your argument, all transformation and lighting is done on the GPU. Or by any chance high bandwidth and math capablity can be used for something that aids the GPU besides transformation and lighting?

Jawed going by your many posts which I have read and respected because you are very knowledgable, I can't see how or why you are downplaying these benchmarks here? Maybe you just like a good debate?

Jawed · Nov 30, 2005

Titanio said:
Well do we know the paper peak transform rate of a 7800GTX, for example? Or better, do we have a benchmark? The benched peak of one SPU is >200m verts/sec.

The problem being that the benchmark stated in the paper is vague - what's it doing?

I'm not saying Cell would make a good replacement for vertex shading on the GPU, but you would think it would speed it up to do some in parallel with the GPU if possible.

I sort of agree - but I'm suspicious because the load/store overhead that's implicit in a CPU implementation of a vertex shader (i.e. FLOPs lost to those operations getting data into and out of registers) is being glossed over.

The 217MVPS figure stated in the article implies 118FLOPs per vertex - I dare say that's one mother of a vertex shader rather than a simple transform and light, or it's not running very efficiently. There's no efficiency stated for this benchmark.

Jawed

Jawed · Nov 30, 2005

Edge said:
I would be interested in how you think high floating point rating on CELL does not make a good gaming processor, but MS went out of their way to devote a single PPE (with the ability to lock parts of the cache), and it's powerful VMX unit to graphics?

Xenon "looks" more tailored to gaming than Cell - but that's because of the obviously tight integration that exists between Xenon and Xenos. Well, that's how it seems to me.

Cache-locking, by the way, appears to be a "standard" part of some variants of PPC - Cell PPE appears to be able to do so.

Going by what you have shared, that would be wasteful for MS to do that, as by your argument, all transformation and lighting is done on the GPU. Or by any chance high bandwidth and math capablity can be used for something that aids the GPU besides transformation and lighting?

The intention with XB360 appears to be that Xenon will do the absolute minimum of geometry/graphics - Xenos is more independent in this respect than older, SM3-like, GPUs (i.e. what we're expecting RSX to be).

Jawed going by your many posts which I have read and respected because you are very knowledgable, I can't see how or why you are downplaying these benchmarks here? Maybe you just like a good debate?

I was really excited by the thread title, only to be somewhat disappointed that the article doesn't really relate to gaming.

Aside from gaming, I can't wait to see how GPGPU and Cell develop over the next few years - this is genuinely exciting stuff.

http://graphics.stanford.edu/~yoel/notes/

They're very close competitors. Cell looks like it is far better suited to a broad range of scientific, streamable, apps. The coming generations of GPUs, on the other hand, appear to be spectacularly well-suited to ultra-high bandwidth streaming applications.

By comparison, ordinary general purpose CPUs such as P4 or A64 are really not very interesting (to me) benchmarks for this high-end stuff. So the article is interesting, just frustratingly narrow in its compass.

Jawed

Mythos · Nov 30, 2005

macabre said:
If there are any open questions about how a single cell does the Terrain demo:

How much of ray casting [cell/rsx] will go into a full featured game?

ihamoitc2005 · Nov 30, 2005

T&l

Jawed said:
The problem being that the benchmark stated in the paper is vague - what's it doing?

I sort of agree - but I'm suspicious because the load/store overhead that's implicit in a CPU implementation of a vertex shader (i.e. FLOPs lost to those operations getting data into and out of registers) is being glossed over.

The 217MVPS figure stated in the article implies 118FLOPs per vertex - I dare say that's one mother of a vertex shader rather than a simple transform and light, or it's not running very efficiently. There's no efficiency stated for this benchmark.

Jawed

Single PS2 Vector Unit at 300mhz with 2.4 Gflops is 75 million vert/sec for basic geometry transform.

Titanio · Nov 30, 2005

Jawed said:
The problem being that the benchmark stated in the paper is vague - what's it doing?

I sort of agree - but I'm suspicious because the load/store overhead that's implicit in a CPU implementation of a vertex shader (i.e. FLOPs lost to those operations getting data into and out of registers) is being glossed over.

The 217MVPS figure stated in the article implies 118FLOPs per vertex - I dare say that's one mother of a vertex shader rather than a simple transform and light, or it's not running very efficiently. There's no efficiency stated for this benchmark.

Jawed

True, we don't know what TnL means exactly here. It may not necessarily be a raw throughput demo, but something more sophisticated. But you could compare to paper peaks on GPUs if you wished - that may be conservative. As for efficiency or the load/store issue, I'm not sure if load/stores for registers are on the same pipeline as floating point ops? They might be, I can't recall.

Panajev2001a · Nov 30, 2005

ihamoitc2005 said:
Single PS2 Vector Unit at 300mhz with 2.4 Gflops is 75 million vert/sec for basic geometry transform.

It is better to quote Perspective corrected Transform which is 60 MVertices/s for VU1 Perspective Transform reduced to 5 cycles using the EFU's FDIV in addition to Lower Pipeline's FDIV) and about 42.85 MVertices/s for VU0 (7 cycles for the Perspective Transform) IMHO than Geometric Transform (no Perspective Divide).

Edit: basic Transformation is a Matrix Multiplication against a Vector, divide 1 by w and multiply all components of the vector... in SoA form is a bit more involved, but I am too tired

.

Edit #2: way too tired... I quoted myself instead of editing the post

.

Shifty Geezer · Nov 30, 2005

Mythos said:
How much of ray casting [cell/rsx] will go into a full featured game?

Well if you're writing a flight-sim, I guess it might be quite widely used

Inane_Dork · Nov 30, 2005

mckmas8808 said:
Why do you see this as PR?

I think it's obvious the goal is to promote and sell Cell. That's PR to me. It's factual, but PR can have fact in it.

And if half of MS couldn't understand that article then I guess that $250,000 per programmer is going to waste. Must be why my OS keeps crashing.

For one, I kinda doubt they're spending that much per programmer. Two, if your OS keeps crashing, I'm sorry about that, but there are millions of people without that problem. Don't act or imply crashing is the norm.

Third, and most importantly, you missed the most critical aspect of what I meant: managers.

Jawed · Nov 30, 2005

Titanio said:
As for efficiency or the load/store issue, I'm not sure if load/stores for registers are on the same pipeline as floating point ops? They might be, I can't recall.

I'm thinking it's more of an issue in terms of latency between these operations. You can't execute math on registers until the registers have been loaded, and you can't re-use registers until they've been stored.

Between reading pipeline diagrams (and explanatory texts) for Xenon VMX, Cell PPE and Cell SPE I have to admit to being somewhat blurry on this topic - I'm hoping someone will come up with some concrete answers, which is why I raise the question.

I'm wondering whether a simple vertex shader could be bandwidth-bound (register bandwidth) on SPEs, i.e. very few operations on very high quantities of data. If that were so, then it might not be a good benchmark. Not knowing the complexity of the shader is annoying.

Any insight would be appreciated...

Jawed

ihamoitc2005 · Nov 30, 2005

Panajev2001a said:
It is better to quote Perspective corrected Transform which is 60 MVertices/s for VU1 Perspective Transform reduced to 5 cycles using the EFU's FDIV in addition to Lower Pipeline's FDIV) and about 42.85 MVertices/s for VU0 (7 cycles for the Perspective Transform) IMHO than Geometric Transform (no Perspective Divide).

Edit: basic Transformation is a Matrix Multiplication against a Vector, divide 1 by w and multiply all components of the vector... in SoA form is a bit more involved, but I am too tired .

Edit #2: way too tired... I quoted myself instead of editing the post .

I intended to provide baseline figure but I feel you are right. Perhaps Vu0 rating for perspective transform of 42.85 million vertices/sec is most applicable for SPE discussion no?

mckmas8808 · Dec 1, 2005

Inane_Dork said:
For one, I kinda doubt they're spending that much per programmer. Two, if your OS keeps crashing, I'm sorry about that, but there are millions of people without that problem. Don't act or imply crashing is the norm.

Well I was joking about the OS crashing thing.

scificube · Dec 1, 2005

How about this: Let RSX handle T&L and other traditional vertex tasks and task Cell to provide for or do displacement mapping on the side.

Sort of a two pass rendering deal where Cell displaces surfaces and then passes on the displaced geometry to RSX for T&L, texturing, rasterization etc.

Seems a good way to consume some ot that latent vertex power Cell has if possible to me.

aaronspink · Dec 1, 2005

xbdestroya said:
Let me ask you DarkRage, what in the world would be IBM's motive to talking down the 970?

Define IBM... There is certainly motive for the Cell related people to take down 970.

I mean no doubt Cell got some 'best-case' stuff, but I don't think anyone went so far as to artificially cripple the competition.

In Linpack they certainly did. There are reported linpack results for the P4 in the 70-80% efficiency range yet they only managed <50%.

There are many ways to cripple the competition, one such way is to not optimize the competition, nor even optimize to what is publically available. These oversights put all their performance comparisons in doubt, because the question has to be, what level of optimization did the non-cell competitors receive in these benchmarks.

Aaron Spink
speaking for myself inc.

Fafalada · Dec 1, 2005

Jawed said:
]The 217MVPS figure stated in the article implies 118FLOPs per vertex - I dare say that's one mother of a vertex shader rather than a simple transform and light, or it's not running very efficiently.

A trivial vertex transform + 3 diffuse directional lights + Ambient (probably the most common lighting scheme used this generation) comes out at roughly 108FLOPs.

And I certainly wouldn't call that a "mother of a VS", it's one of the simplest VS examples out there.

fearsomepirate · Dec 1, 2005

TTP said:

The double-precision performance of this thing is balls compared to the single-precision. Considering the SPE's are 32-bit, that's not what I like to hear. I'd like to see those monster gains hold for double precision as well. I mean, I'm sure my CFD code would run faster than on my Pentium, but still...

aaronspink · Dec 1, 2005

Shifty Geezer said:
The current SPE's perform DP float work at 1/10th their SP performance (which also takes shortcuts to be faster, so isn't as precise as other SP float computers).

Actually, 1/14th. SP math goes at 4 per cycle. DP math goes at 2 per 7 cycles.
[/QUOTE]

Aaron Spink
speaking for myself inc.

darkblu · Dec 1, 2005

in that table ttp quoted, i think there's a slight misunderstanding in the linpack1k figures for the intel: the DP performance of sse2 and iirc sse3 is 2 flops per clock, so 7.2Gflops at DP is the theoretical maximum of the cpu @ 3.6GHz. where they get that 14.4Gflops peak in "Table.9" is a mystery to me, plus i highly doubt that sse2/3 can achieve 2 flops/cycle sustained at linpack1k DP.

Jawed · Dec 1, 2005

Fafalada said:
A trivial vertex transform + 3 diffuse directional lights + Ambient (probably the most common lighting scheme used this generation) comes out at roughly 108FLOPs.

Ah well, caught out by looking up stuff that's too old, RacorX3:

http://www.gamedev.net/reference/articles/article1807.asp

RacorX5 on that page looks more like the kind of complexity you're talking about, though the lighting isn't the same.

Jawed

Cell benchmarked

Jawed

Edge

Jawed

Jawed

Mythos

ihamoitc2005

Titanio

Panajev2001a

Shifty Geezer

uber-Troll!

Inane_Dork

Rebmem Roines

Jawed

ihamoitc2005

mckmas8808

scificube

aaronspink

Fafalada

fearsomepirate

Dinosaur Hunter

aaronspink

darkblu

Jawed

Similar threads