A very interesting article about CELL programming

[maven];946326 said:
An alternative, slightly more provocative formulation would be: CELL is such a complicated architecture that getting good performance out of it takes great effort / redesign..

Well, the relative performance they're seeing here is a little better than "good" might imply..that effort was very well rewarded here.
 
Well, the relative performance they're seeing here is a little better than "good" might imply..that effort was very well rewarded here.

I don't think 22 times faster is negligible.:smile:

On game side, the transition was probably painful for PS3 programmer and the example help to understand the problem of multiplateform developer. But I am curious to see exclusive PS3 game of 2008-2009. They will probably rock.
 
Last edited by a moderator:
[maven];946326 said:
And on that note, I would be quite interested in the speed difference between running the straight-forward C-version on the P4 and CELL.
That's the key question, but still somewhat irrelevant, because we know Cell isn't the easiest processor in the world to develop for. Yes, Cell is harder to develop for. But when you do develop for it, you have more performance. If you're finding it hard to develop for, and getting only the same performance, then there's an issue using Cell. Chances are you're not developing for it correctly in those cases ;)

Plus when you're optimizing for Cell, just to get algorithms to work quickly, it's because you want that performance. I can't imagine a case where someone wants to implement a Breadth First Search but just wants to implement it as easily as possible without caring how quickly it runs. "I've got a fast processor here, but rather than develop specifically for it and use it's speed, I want to use a generic solution that runs much slower, and I'm finding this processor can't handle that generic solution at all well." If that's what you want, use a Pentium 4! If you're using Cell, it's because you want the performance, and articles like this show it's obtainable and in ways that many intially thought wouldn't map well to the system. I take them as very positive, showing the ability for human ingenuity to adapt to any challenge.

The real concern for the consoles is whether devs can afford to invest in developing techniques for Cell. Otherwise Cell's performance in generic algorithms does become an issue over and above it's top-end performance. You're not going to find those sorts of metrics in public papers on the whole. Who's going to release a story 'We ran a BFS implementation on Cell and found it wasn't very fast at all?' when they could be writing 'We got a 22x speedup'?
 
You can probably optimize the x86 version but I'm sure that you will not be as fast than CELL with the Xeon.

Not as fast on a single core P4 but could you get it running as fast on a quad core Conroe with equally heavy optimisation?

If so then aside from the obvious (I assume) cost advantage, Cells performance advantage would dissapear. Say, you either get a decent portion of optimised Cell performance from a none optimised (easy to implement) quad implementation, or you put the same amount of effort into optimising for the quad as you would into the Cell and attain similar performance.

Is that a realistic scenario?
 
The real concern for the consoles is whether devs can afford to invest in developing techniques for Cell. Otherwise Cell's performance in generic algorithms does become an issue over and above it's top-end performance. You're not going to find those sorts of metrics in public papers on the whole. Who's going to release a story 'We ran a BFS implementation on Cell and found it wasn't very fast at all?' when they could be writing 'We got a 22x speedup'?

I wholeheartedly agree. My post was aimed at these common but unfair apples to oranges comparisons that are often presented. Usually the main point of such a paper is on the particular implementation of an algorithm for a certain architecture, but unless the paper includes a comparative valuation of the work done, it is very likely to be rejected. The authors are also unlikely to expound the same amount of effort on this "other / alternative implementation", simply because it isn't the focus of what they're interested in the first place.

In other news, there is no free lunch.
 
[maven];946370 said:
The authors are also unlikely to expound the same amount of effort on this "other / alternative implementation", simply because it isn't the focus of what they're interested in the first place.
Not necessarily. If theyir research is mapping algoithms to Cell, it won't. But if the research is to find the best possible implementations, it will, within the limits of their ability to develop for different architectures. eg. The Mercury imaging results, where obviously they've got the fastest method they could find for the standard architecture becuase that's they're livelihood, and when they moved to Cell and it outperforms the old system by a lot, it's not for want of optimization on the standard CPU.

They might inflate the numbers to encourage sales of new technology though!
 
Not as fast on a single core P4 but could you get it running as fast on a quad core Conroe with equally heavy optimisation?

If so then aside from the obvious (I assume) cost advantage, Cells performance advantage would dissapear. Say, you either get a decent portion of optimised Cell performance from a none optimised (easy to implement) quad implementation, or you put the same amount of effort into optimising for the quad as you would into the Cell and attain similar performance.

Is that a realistic scenario?

I don't know but all explanation are on the following document:

http://hpc.pnl.gov/people/fabrizio/papers/ipdps07-graphs.pdf

They compare the CELL with P4/opteron with their in house BFS implementation and a dual core Woodcrest with a scalable pthread BFS implementation.

The CELL implementation is faster than the pthread implementation on a dual core Woodcrest. But the difference of performance may vary depending of the graph arity. When the CELL is 22 times faster than Xeon, it is 12 times faster than dual core Woodcrest with graph arity of 200.

With a graph with smaller arity, the difference of performance is smaller between CELL and conventional architecture.
 
Last edited by a moderator:
I think the P4 optimization should be pretty good since they use it in-house too. Even if fully optimized, I don't expect a P4 to outrun a 128-CPU BlueGene for a BFS search problem of decent size.

The question is what Shifty raised. In game programming, I think it's more about near real-time programming instead of computational throughput. Developers have to get tons of things done in between frames, and then maintaining the resulting code.

All I want to know end of this year, early next is which developers, cross-platform or not, are willing to devote the extra time to surprise the gamers... because he will get my money. :)
 
The algorithm noted here is very well conceived, but one important factor was missed. Somebody touched on it earlier (that the CELL has 9 cores)...

Basically when the Woodcrest, Pentium 4 HT or any of the other single chip, multiple core processors are performing this operation, they are fully committed to the task. The CELL PPE core is, however, waiting for the completion of the SPE's in order to use the results. If this operation were pipelined early enough in (for example) a real time game or physics simulation, the two HT threads on the PPE core could happily continue to process other tasks until the results were available. Although memory bandwidth would be somewhat compromised due to the frequent load/store DMAs triggered by the SPE units, there should still be plenty of bandwidth left for the main core threads to continue.

This basically means that in its current implementation, aside from setting up the initial graph, the whole latency of the operation could be hidden by pipelining other tasks on the PPE core.

Another implementation would be to redesign the algorithm to take into account the two other threads and gain, perhaps, some more bandwidth? This would give *true* performance figures for a fully optimized CELL implementation.
 
That point is not valid with the Pentium 4 with HT.

You also might be underestimating how much the memory traffic from the SPEs would impact the performance of the PPE, and just how much of its resources are actually available.
 
Could this run on G80? And if so, assuming a highly optimised implementation like Cell recieved, what kind of performance could we expect?
 
That point is not valid with the Pentium 4 with HT.

You also might be underestimating how much the memory traffic from the SPEs would impact the performance of the PPE, and just how much of its resources are actually available.

P4 with HT - what was I thinking - Sorry!


I did mention the impact on the main threads of memory transfers, but I don't believe I underestimated them. Bear in mind the PPE core has its own L1 cache memory setup for data/instruction backed up by a fairly substantial L2, so I'm absolutely certain that anybody who spends the amount of time required to optimize this algorithm to this level will be very aware of the extra power available (albeit with a little more work) and make use of it as efficiently as possible ;)

D.
 
Back
Top