A very interesting article about CELL programming

Titanio · Mar 13, 2007

[maven];946326 said:
An alternative, slightly more provocative formulation would be: CELL is such a complicated architecture that getting good performance out of it takes great effort / redesign..

Well, the relative performance they're seeing here is a little better than "good" might imply..that effort was very well rewarded here.

chris1515 · Mar 13, 2007

Titanio said:
Well, the relative performance they're seeing here is a little better than "good" might imply..that effort was very well rewarded here.

I don't think 22 times faster is negligible.:smile:

On game side, the transition was probably painful for PS3 programmer and the example help to understand the problem of multiplateform developer. But I am curious to see exclusive PS3 game of 2008-2009. They will probably rock.

Shifty Geezer · Mar 13, 2007

[maven];946326 said:
And on that note, I would be quite interested in the speed difference between running the straight-forward C-version on the P4 and CELL.

That's the key question, but still somewhat irrelevant, because we know Cell isn't the easiest processor in the world to develop for. Yes, Cell is harder to develop for. But when you do develop for it, you have more performance. If you're finding it hard to develop for, and getting only the same performance, then there's an issue using Cell. Chances are you're not developing for it correctly in those cases

Plus when you're optimizing for Cell, just to get algorithms to work quickly, it's because you want that performance. I can't imagine a case where someone wants to implement a Breadth First Search but just wants to implement it as easily as possible without caring how quickly it runs. "I've got a fast processor here, but rather than develop specifically for it and use it's speed, I want to use a generic solution that runs much slower, and I'm finding this processor can't handle that generic solution at all well." If that's what you want, use a Pentium 4! If you're using Cell, it's because you want the performance, and articles like this show it's obtainable and in ways that many intially thought wouldn't map well to the system. I take them as very positive, showing the ability for human ingenuity to adapt to any challenge.

The real concern for the consoles is whether devs can afford to invest in developing techniques for Cell. Otherwise Cell's performance in generic algorithms does become an issue over and above it's top-end performance. You're not going to find those sorts of metrics in public papers on the whole. Who's going to release a story 'We ran a BFS implementation on Cell and found it wasn't very fast at all?' when they could be writing 'We got a 22x speedup'?

pjbliverpool · Mar 13, 2007

chris1515 said:
You can probably optimize the x86 version but I'm sure that you will not be as fast than CELL with the Xeon.

Not as fast on a single core P4 but could you get it running as fast on a quad core Conroe with equally heavy optimisation?

If so then aside from the obvious (I assume) cost advantage, Cells performance advantage would dissapear. Say, you either get a decent portion of optimised Cell performance from a none optimised (easy to implement) quad implementation, or you put the same amount of effort into optimising for the quad as you would into the Cell and attain similar performance.

Is that a realistic scenario?

[maven] · Mar 13, 2007

Shifty Geezer said:
The real concern for the consoles is whether devs can afford to invest in developing techniques for Cell. Otherwise Cell's performance in generic algorithms does become an issue over and above it's top-end performance. You're not going to find those sorts of metrics in public papers on the whole. Who's going to release a story 'We ran a BFS implementation on Cell and found it wasn't very fast at all?' when they could be writing 'We got a 22x speedup'?

I wholeheartedly agree. My post was aimed at these common but unfair apples to oranges comparisons that are often presented. Usually the main point of such a paper is on the particular implementation of an algorithm for a certain architecture, but unless the paper includes a comparative valuation of the work done, it is very likely to be rejected. The authors are also unlikely to expound the same amount of effort on this "other / alternative implementation", simply because it isn't the focus of what they're interested in the first place.

In other news, there is no free lunch.

Shifty Geezer · Mar 13, 2007

[maven];946370 said:
The authors are also unlikely to expound the same amount of effort on this "other / alternative implementation", simply because it isn't the focus of what they're interested in the first place.

Not necessarily. If theyir research is mapping algoithms to Cell, it won't. But if the research is to find the best possible implementations, it will, within the limits of their ability to develop for different architectures. eg. The Mercury imaging results, where obviously they've got the fastest method they could find for the standard architecture becuase that's they're livelihood, and when they moved to Cell and it outperforms the old system by a lot, it's not for want of optimization on the standard CPU.

They might inflate the numbers to encourage sales of new technology though!

chris1515 · Mar 13, 2007

pjbliverpool said:
Not as fast on a single core P4 but could you get it running as fast on a quad core Conroe with equally heavy optimisation?

If so then aside from the obvious (I assume) cost advantage, Cells performance advantage would dissapear. Say, you either get a decent portion of optimised Cell performance from a none optimised (easy to implement) quad implementation, or you put the same amount of effort into optimising for the quad as you would into the Cell and attain similar performance.

Is that a realistic scenario?

I don't know but all explanation are on the following document:

http://hpc.pnl.gov/people/fabrizio/papers/ipdps07-graphs.pdf

They compare the CELL with P4/opteron with their in house BFS implementation and a dual core Woodcrest with a scalable pthread BFS implementation.

The CELL implementation is faster than the pthread implementation on a dual core Woodcrest. But the difference of performance may vary depending of the graph arity. When the CELL is 22 times faster than Xeon, it is 12 times faster than dual core Woodcrest with graph arity of 200.

With a graph with smaller arity, the difference of performance is smaller between CELL and conventional architecture.

patsu · Mar 13, 2007

I think the P4 optimization should be pretty good since they use it in-house too. Even if fully optimized, I don't expect a P4 to outrun a 128-CPU BlueGene for a BFS search problem of decent size.

The question is what Shifty raised. In game programming, I think it's more about near real-time programming instead of computational throughput. Developers have to get tons of things done in between frames, and then maintaining the resulting code.

All I want to know end of this year, early next is which developers, cross-platform or not, are willing to devote the extra time to surprise the gamers... because he will get my money.

homy · Mar 13, 2007

Here is a nice comparison graph from the paper.

Datasegment · Mar 13, 2007

The algorithm noted here is very well conceived, but one important factor was missed. Somebody touched on it earlier (that the CELL has 9 cores)...

Basically when the Woodcrest, Pentium 4 HT or any of the other single chip, multiple core processors are performing this operation, they are fully committed to the task. The CELL PPE core is, however, waiting for the completion of the SPE's in order to use the results. If this operation were pipelined early enough in (for example) a real time game or physics simulation, the two HT threads on the PPE core could happily continue to process other tasks until the results were available. Although memory bandwidth would be somewhat compromised due to the frequent load/store DMAs triggered by the SPE units, there should still be plenty of bandwidth left for the main core threads to continue.

This basically means that in its current implementation, aside from setting up the initial graph, the whole latency of the operation could be hidden by pipelining other tasks on the PPE core.

Another implementation would be to redesign the algorithm to take into account the two other threads and gain, perhaps, some more bandwidth? This would give *true* performance figures for a fully optimized CELL implementation.

3dilettante · Mar 13, 2007

That point is not valid with the Pentium 4 with HT.

You also might be underestimating how much the memory traffic from the SPEs would impact the performance of the PPE, and just how much of its resources are actually available.

LunchBox · Mar 17, 2007

Thanks for posting the links to the article!
It's a great find!
great read!

"Nerve-Damage" · Mar 17, 2007

Nice article!!

pjbliverpool · Mar 17, 2007

Could this run on G80? And if so, assuming a highly optimised implementation like Cell recieved, what kind of performance could we expect?

Datasegment · Mar 20, 2007

3dilettante said:
That point is not valid with the Pentium 4 with HT.

You also might be underestimating how much the memory traffic from the SPEs would impact the performance of the PPE, and just how much of its resources are actually available.

P4 with HT - what was I thinking - Sorry!

I did mention the impact on the main threads of memory transfers, but I don't believe I underestimated them. Bear in mind the PPE core has its own L1 cache memory setup for data/instruction backed up by a fairly substantial L2, so I'm absolutely certain that anybody who spends the amount of time required to optimize this algorithm to this level will be very aware of the extra power available (albeit with a little more work) and make use of it as efficiently as possible

D.

A very interesting article about CELL programming

Titanio

chris1515

Shifty Geezer

uber-Troll!

pjbliverpool

B3D Scallywag

[maven]

Shifty Geezer

uber-Troll!

chris1515

patsu

homy

Datasegment

3dilettante

LunchBox

"Nerve-Damage"

pjbliverpool

B3D Scallywag

Datasegment

Similar threads