A bit of info on Cell's physic's abilities.

The demonstration application would scale linearly on a grid of P4s, too. That's the nature of the demo! In fact it would prolly scale with a better linear multiplier, as 8 SPEs are not 8x faster than 1 SPE.

So something is going on in their code that just doesn't add-up.

Is it shit code or are we seeing the effects of memory bandwidth crunch?

Jawed
 
Jawed said:
The demonstration application would scale linearly on a grid of P4s, too. That's the nature of the demo! In fact it would prolly scale with a better linear multiplier, as 8 SPEs are not 8x faster than 1 SPE.

So something is going on in their code that just doesn't add-up.

Is it shit code or are we seeing the effects of memory bandwidth crunch?

Jawed

As i said, they're Alias. Never really bothered with making "fast" code. So personally i wouldn't count on them to code for Cell (when even game developers are nagging about it, and they're more knowledgeable in making code that runs fast than Alias ever will) and make it run fast.
And maybe being In-Order affects the Cell chip more than we expected.
 
Jawed said:
The demonstration application would scale linearly on a grid of P4s, too. That's the nature of the demo! In fact it would prolly scale with a better linear multiplier, as 8 SPEs are not 8x faster than 1 SPE.

Of course, but at what cost? I mean for the same number of transistors, how many Cell chips/SPUs could you have? (As an answer, I figure it'd be roughly 6, or 48 SPUs). You might as well say that 1,000 P4s would be better than one Cell ;)

Jawed said:
So something is going on in their code that just doesn't add-up.

Is it shit code or are we seeing the effects of memory bandwidth crunch?

If you're asking if it could be improved, the answer is probably, as with most code. But we could hardly have any idea as to how. If the code was built from scratch, perhaps, would we see better performance? We're also looking at the work of Alias people here, perhaps with help from IBM, so bear that in mind if comparing to previous examples that came from the chip's creators.
 
Last edited by a moderator:
~50% improvement of a 3.2GHz SPE compared to a 3.6GHz P4 is not too bad; to be expected in some ways considering the P4's relatively poor float performance. On the other hand I have read of Dual Xeon servers struggling to push 10GFLOPs in the real world, so theoretically you would expect SPEs to do more. But alas we all know theoretical performance is just that, theoretical peak.

Admittadly I was expecting more from a single SPE. The hype gets your expectations pretty high at times. For the server/workstation market, where this test is being run, I am not sure this speed up is enough. Looking at your current HW infrastructure and software (and possibly having to not only buy new software but to train your employees on it), it may be cheaper to go the addon route (like Clearspeed or some other type of expansion card aimed at compensating for the relatively poor performance of the x86 chips in these areas). In the context of multicore+multiprocessor Opterons and Xeons (not to mention dual P4 and dual core Athlond64 chips at a street prices relatively close to their single core counterparts) I am sure some IT guys are hesitant.

I would have really liked to see this test done compared to a 4 or 8 way Opteron. That would be a much more relavant comparison in many ways as the Opterons and Xeons of the world, not the P4, are most likely CELLs competition in the server and workstation market. My guess is an Opteron with a quad processor (dual cores) arrangement would be comparable, maybe even faster.

The poor PPE performance is odd considering the PPE theoretically is well over 1.5x as powerful as the SPEs in peak theoretical floating point. Between 512K cache and extra silicon for pipeline effeciency this is confusing... unless VMX was not utilized (could it be relying on the FP unit instead?), the PPE is complete boinked, or something else is going on. Unless the DD1 PPE is messed up, something else has to be going on. VMX runs pretty well on Macs, so those results are kind of confusing. :???:
 
Jawed said:
Presumably this is a DD1 Cell, judging from the 8 SPEs and 2.4GHz clock.
DD1 tends to have higher clockspeed than DD2. Cell Broadband Engine in those Cell blades is so-called DD2.
Jawed said:
The demonstration application would scale linearly on a grid of P4s, too. That's the nature of the demo!
How about bandwidth? They use Local Storage intensively.
 
Last edited by a moderator:
Acert93 said:
The poor PPE performance is odd considering the PPE theoretically is well over 1.5x as powerful as the SPEs in peak theoretical floating point. Between 512K cache and extra silicon for pipeline effeciency this is confusing... unless VMX was not utilized (could it be relying on the FP unit instead?), the PPE is complete boinked, or something else is going on. Unless the DD1 PPE is messed up, something else has to be going on. VMX runs pretty well on Macs, so those results are kind of confusing. :???:

It says it's PPE VMX.

edit - and thanks, One, that's pretty interesting and perplexing..
 
rabidrabbit said:
I'd guess this level of cloth simulation would be limited to games with maybe two to four characters on screen.

Cheers.

Not a big deal than I guess, and the power is probably better used for other stuff in most games (e.g. particles and post fx), although I'd love to see a Virtua Fighter with detailed cloth en hair simulation...
 
london-boy said:
As i said, they're Alias. Never really bothered with making "fast" code. So personally i wouldn't count on them to code for Cell (when even game developers are nagging about it, and they're more knowledgeable in making code that runs fast than Alias ever will) and make it run fast.
And maybe being In-Order affects the Cell chip more than we expected.
Well, the "Fast code" issue would relate to the P4 equally then. Its not like they were trying to cripple the SPEs in comparison to the P4.

Overall I would NOT put this type of test in the context of game development. This "test" is more geared at the server/workstation market. Obviously it is not an exhaustive benchmark, so I wont knock them too hard, but it would have been much more meaningful against an Opteron. Alas, in that situation you may be using double percision anyhow, so things are bound to change.

Very interesting results, no question, but what do they REALLY tell us? Probably not much in regards to the server market and much less about the PS3. It does confirm the SPEs work and work very well at that.
 
Acert93 said:
Well, the "Fast code" issue would relate to the P4 equally then. Its not like they were trying to cripple the SPEs in comparison to the P4.

True, with the caveat that the code may have originally been resident on Intel. Unlike other cases we've seen, this is one where the code was ported to Cell. A thorough/careful port, perhaps, but still a port. I can only guess the original architecture was Intel.
 
pipo said:
Cheers.

Not a big deal than I guess, and the power is probably better used for other stuff in most games (e.g. particles and post fx), although I'd love to see a Virtua Fighter with detailed cloth en hair simulation...
This really does not tell us much about games. Game engines use a lot of hacks and short cut. Believability--not accurate simulation--is the goal. If you can reduce the number of interaction points, cut down on self collision, etc... and retain the same level of believability in a game then so be it.

Titiano said:
It says it's PPE VMX.
Yeah... so either their PPE is boinked (less likely IMO) or the tester/Alias is doing something wrong (more likely). The PPE is quite a bit more robust than the SPEs, so unless they had a problem with the 32-bit VMX units there is something going wrong here.

Doesn't matter though, in a server market IF you are going to use CELL servers you are going to specialize for the SPEs for the very fact every processor has 8 of them. I would not even waste time writing for a PPE when the real gold is in writing for the SPEs--8x the return!
 
Jawed said:
Well I'm glad you guys are impressed.

This is rubbish in my eyes.

Jawed

You are better as a GPU..."critic"...

Seriously, "rubbish"?

How many P4s would you need to match this? 5? 6? 7? 8? More? How much silicon is that? How much cost? Really, get some perspective Jawed. The standards you're setting here I doubt are ones you'd apply elsewhere.

Nevermind, it's pointless.
 
Last edited by a moderator:
The perfect linear scalling is very impressive.

But I don't understand why going from 1xSPE to 2xSPEs results in a 30% improvement...

Its almost like they are missing a step in that very linear graph. Would make more sense to me if that that baseline 1 SPE figure it is actually the PPE+1xSPE producing that output.
 
Titanio said:
You are better as a GPU..."critic"...

Seriously, "rubbish"?

How many P4s would you need to match this? 5? How much silicon is that? How much cost? Really, get some perspective Jawed. The standards you're setting here I doubt are ones you'd apply elsewhere.
A P4 setup is going to be a lot cheaper.

For example, there are X2 (dual core AMD64) processors with 154M transistors (147mm^2). An X2 has 60% less transistors than a DD2 CELL. Manufacturing considerations taken into consideration, I am sure you could get 2 dual core (4 total processors) CHEAPER than a CELL. And that does not even consider the other costs (XDR, completely new setups, etc). That puts it into perspective, does it now?

This is a server/workstation test. So I think in that vein you really need to put things into perspective. These results are good, but when put into context of their market they have a very large uphill battle.

If I had a choice of a Xeon or Opteron multicore rig or CELL, at this point I would have to go with the former for a number of reasons (including performance value).
 
Acert93 said:
This "test" is more geared at the server/workstation market. Obviously it is not an exhaustive benchmark, so I wont knock them too hard, but it would have been much more meaningful against an Opteron. Alas, in that situation you may be using double percision anyhow, so things are bound to change.
I think the point here is Cell SPE has Local Storage which can be explicitly controlled by a programmer so its strength will still be able to stand against Opteron. As Alias writes
The most notable features of the Cell processor architecture are the IBM® Power-based RISC CPU core (PPE), the 8 SIMD vector units (SPEs) and the impressive bandwidth that can be leveraged to obtain maximum performance from the architecture.
 
All these comparisons are nigh-on useless for comparing Cell to a P4 (quantitively); although you get a bit of implementation detail for the Cell version, we don't have a clue how optimised their base-line P4 implementation is (does it use SIMD etc.).

Also, I would really like to see the comparative speed-up graph of P4 vs. Cell performance for 1 or 2 simultaneous simulations...
 
Acert93 said:
A P4 setup is going to be a lot cheaper.

For example, there are X2 (dual core AMD64) processors with 154M transistors (147mm^2). An X2 has 60% less transistors than a DD2 CELL. Manufacturing considerations taken into consideration, I am sure you could get 2 dual core (4 total processors) CHEAPER than a CELL. And that does not even consider the other costs (XDR, completely new setups, etc). That puts it into perspective, does it now?

If we're comparing what was compared in the report, how much is 7 or 8 3.6Ghz P4s going to cost you? Or 5 if you want to hold the Cell down at 2.4Ghz? And that's assuming you get a linear scaling across the P4s.

And with regard to the AMDs, could you get 4 of them for cheaper than a Cell? Assuming it had similar results to the P4..

I would be surprised if one single Cell cost more than these alternatives.
 
Last edited by a moderator:
Anyone remember this pic:

slide118al.jpg


Jawed
 
one said:
I think the point here is Cell SPE has Local Storage which can be explicitly controlled by a programmer so its strength will still be able to stand against Opteron. As Alias writes
Its not like Opteron L2 and L3 cache is slow, and Opterons are quite a bit more beefy than a P4. So I don't really get your point. From a market perspective CELL is going to have to beat up on Opterons, not just be a competitant foe. CELL may be able to do this, but this benchmark does not tell us that. Considering the sizable advantage between an Opteron and P4, looking at a multi-processor/multi-core Opteron arrangement should be very compelling for many reasons (not just performance in this one area, but performance in a wide array of metrics and cost, software, infrastructure, training, and other reasons as well).

Titiano said:
If we're comparing what was compared in the report, how much is 7 or 8 3.6Ghz P4s going to cost you? Or 5 if you want to hold the Cell down at 2.4Ghz? And that's assuming you get a linear scaling across the P4s.
Hey, you brought up cost ;)

It costs Intel $40 to make a chip. Further, there is very little additional cost for Intel to add a second core (see: multicore chips are not 2x as expensive, not even close). Further, with over 80% of the desktop market (a 200M unit/year market) Intel has an insane advantage in cost.

And this does translate into street price. Dual core Xeons, Opterons, P4s, X2s, etc are on the market. The MBs and Memory are readily available and CHEAP.

This can be purchased today--have it delivered tomorrow if you want. And it will work with your suite of tools and plug right into your network. Even more you can hire thousands of blokes who can effeciently work with it in a productive environment using the software that is out right now.

How much is a CELL workstation again? How fast of a processor? (You keep assuming there are faster CELL workstations readily available in the market right now. No point comparing a P4 of today with a CELL of tomorrow). How much memory? Availability? What about all the other stuff? Can I get overnight mail and on my door TOMORROW?

Comparing right now, today, is relevant. Down the road? Your guessing as much as anyone else. But in a year when CELL workstations become available Intel and AMD would have progressed and moved on as well (quad core processors). And that still does not answer the cost issue.

Does anyone have an idea how much a 2.4GHz CELL workstation costs today? I can get a P4 3.6GHz chip for $340--so setting up a farm of those would be dirt cheap (even cheaper if I went with Athlon64s!) I can also setup a 8 processor Opteron today as well if I am so inclined and need the power.

How much such a setup will cost when CELL arrive on the mass market (and at what speeds and configuration) is unknown, but as history shows AMD/Intel will react accordingly.

Anyhow, beyond the cost of the machine (which a P4 is dirt cheap, I would be surprised if anyone would suggest a CELL workstation would be cheaper than such) there are all the other costs mentioned.

So, how much does a 2.4GHz CELL cost right now anyhow?
 
Jawed said:
Anyone remember this pic:

slide118al.jpg


Jawed
Yes. It showed outstanding FFT performance. And Alias's FIRST attyempt to port an EXPERIMENTAL cloth solver to Cell has shown a 5x improvement in real-world results.

Question : Does the .pdf mention anywhere that they're working in single precision? Not that I could see. Offline graphics uses DP AFAIK. Certainly Laa-Yosh has repeatedly mentioned the SP performance of Cell won't be much us in offline rendering work, though I don't know if that's restricted to rendering or modelling also. As we know Cell's DP performance isn't anywhere near as strong as it's SP, so it seems likely to me we're seeing a DP process. And again, it's a first take on an experimental solver. People are jumping to conclusion.

Much as it's nice to imagine Cell will offer an amazing 20+ times improvement in absolutely everything, some things like double precision work aren't likely to show that much advance. Though it'll be interesting to see if later work on Cell finds more optimizations.
 
Back
Top