Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

Jawed · Nov 19, 2005

Titanio said:
On SPUs accessing other LS - can anyone confirm the PPEs role here, if any? Can't one SPU put something on the EIB, and another pick it up?

Yep. PPE has basically nothing to do with it.

Re. looping/branching - I wasn't aware looping was not available in SPU code. In fact I thought it was - you just don't have any branch prediction, it'll always assume that the branch is taken. There are branch hints, though, and of course, ways to avoid branching and looping - loop unbundling being one way for the latter, as he describes. Using that doesn't mean you couldn't use a loop if you wanted, though, and were confident of the behaviour of the loop. Assuming I'm not mistaken about the simple availability of loops in SPU code? I need to spend more time with that simulator

SPEs have looping. It's an 18 cycle penalty for getting loops wrong I think. The lack of branch prediction simply means the programmer has to do all the work, providing code that does what a branch predictor would do.

Just the same as the programmer has to do all the work with the LS, rather than having the convenience of a cache (even if that convenience comes with certain restrictions).

You could call the SPE a hardware engineer's revenge on programmers for being so lazy all these years and soaking up everything that Moore's law has so far provided...

Jawed

ERP · Nov 19, 2005

Originally Posted by ADEX
PowerPC has enough default registers to keep the pipelines flowing and modern compilers are quite capable of doing any reordering necessary.

Unfortunately instruction scheduling isn't something that "modern" compilers have to do well, because "modern" processors do a lot of the job for them.

Instruction scheduling on an in order core with large instruction latencies is a HARD problem, because the ruleset is large, and it's generally dones as a collection of heuristics in the compiler back end.

IME "modern" compilers do a totally crappy job at hiding instruction latency on both X360 and PS3. In some cases I don't think they have enough information to do a good job, If you have a simple linear series of operations on a memory sequence on one of these cores your looking at interleaving many (10+) copies of that loop to hide the latency. A C Compiler in most cases simply can't ascertain if that unroll is a good optimisation, it doesn't know how much data might be being worked on and it isn't allowed to "pad" the data to avoid multiple exit conditions inside the loop. That leaves two options, add hints to the language or use some sort of profile guided optimisation.

The only real reason that modern compilers are competitive with "hand coded assembly language" on processors is that for most code cache misses dominate performance, and as long as the code is similar in size and reads about the same amount of memory it will perform in a similar fashion.

Bowie · Nov 19, 2005

Can any programmers here answer my previous question on how the SPEs compare in general purpose code to the MIPS core in the EE. I just want a frame of reference on the leap in performance from last generation to the next.

pc999 · Nov 20, 2005

one said:
Though I don't like employing simile in this kind of talk, Cell is an orchestra, as SCEA research suggests, rather than a solo piano or a violin trio. Admittedly it would be more laborsome and costly to write a full symphony or train a full orchestra than to write a solo or train a violin trio.

About Cell I dont know but an orchestra score/training is a hell till get it right.Thanks.

The GameMaster · Nov 20, 2005

I would just like to state for the record that "Super Computers" do not make for good "Gaming Computers".

Super Computers are built for specific purposes and are not for everyday use.

randycat99 · Nov 20, 2005

Yeah, cuz the game development tools and hardware graphics acceleration has been pretty, erm, sparse for historic supercomputers.

flick556 · Nov 20, 2005

I can't help but think when people say general purpose code they are talking about legacy previously written code. From that perspective I can agree, x86 processors have dedicated hardware features and mature compilers to allow them to execute most any code.

Their are billions of lines of preexisting code and many developers rely on libraries of preexisting code and compiled libraries as opposed to writing everything anew. I think any new processor design will be faced with this hurdle, and a large burden is placed on compilers and middleware. The reason x86 can execute most any code has as much to do with the years of compiler optimizations as it does with the hardware itself.

Cell should receive a nice developer following through IBM, Sony, Open Source, and all the console developers. It will take time but high quality compilers and libraries for cell will emerge. I think other processor designers should be threatened by this new design, because flops do mean something.

Companies do some shady things to make their flop ratings look higher than they really are and some architectures have really low efficiency. That does not mean that flops are no longer important, it just means many flop ratings are deceitful.

SedentaryJourney · Nov 21, 2005

GP code? I have the answer!

Here it is: General purpose code=bad cell code...that is all.

aaronspink · Nov 22, 2005

Titanio said:
Cell misconception #1: the PPE needs to be involved in SPU tasks/tasking. It doesn't. It can if you want it to, or it can stay out of the way entirely if you want it to.

Um, from all the documentation available, this isn't true. The SPUs do have limited atonomy but must be configured and setup via the PPE in order to run any code. The SPUs are not turing complete.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Shifty Geezer said:
Every floating point processing unit in the Cell added together. That's PPE + 7 SPEs worth. Cell should be able to maintain very good efficiency in feeding these in many situations due to the LS and structured data access too, so I'd expect the real-world float performance efficiency to be higher than most processors too in these situations.

You are making large assumptions about programming efficiency, you can't reasonably make. The efficiency of Cell will likely reasonably low. The programming model IS quite complex. Structured data access is a lot harder to do than talk about.

The primary issue is that anything you do to speed up a cell style processor will be equally applicable to a non-cell style processor, but not visa versa.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Shifty Geezer said:
*sigh* There's plenty of other people who seem to think Sony, IBM and Toshiba got together to design a new CPU and produced something useless. Do you honestly believe them that stupid and incompetant?

There are people that think Intel and HP got together to design a new CPU and produced something useless. Do you honestly believe them that stupid and incompetant?

There are people that think IBM, Apple, and Motorola got together to design a new CPU and produced something useless. Do you honestly believe them that stupid and incompetant?

Just because several big companies get together doesn't mean they will produce something spectacular.

Cell is an interesting design but it does have some serious issues and compromises that may be detremental. To gloss over these and assume that it will all be overcome is folly.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Titanio said:
You could implement a cache in software for a SPU if you wanted, although I'm not sure what the results would be like.

Pretty bad since all data access for an SPU has to be DMA'd in.

There is no hardware cache, correct. But on either system I think you really want to avoid or be clever about main memory accesses. On Xenon you're going to want your code to be very cache-aware and you'll want your SPU code on Cell to be very aware of local memory. The SPU forces you really to make memory management explicit and to try and map it out in advance.

The explicit memory management has both pluses and minuses. The primary minus being a complicated programming model and an inefficient use of storage space.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Shifty Geezer said:
Like if someone says to you they're building a car that'll go 180 MPH, maybe you won't believe it. But if they tell you they've got a 250 BHP engine will they'll sports modify, and a 600 kg tubular steel frame, you can see that their target is in the realm of possibility. Likewise when an architect designs a bridge and sets a weight limit of 5 tonnes, you don't need to put 5 tonnes on the bridge to be sure it won't collapse. A thorough understanding of material properties and forces means engineering limits can be known without having to test.

Which is why there have been very catastrophic bridge failures and collapses over the years. Bridges involve some complicated physics and physical interactions with the world. You have to make sure that simple things like 20 MPH winds won't cause oscillations which result in the bridge tearing itself apart as has been captured on film.

To get a CPU to do useful float maths, you need to get data into the logic circuits. That's the limiting factor of efficiency. If you have no local storage or cache, all that data has to be shipped in from RAM, which has a huge latency, and so the APU stalls often. If you can provide data from a 'nearer' memory space, you can keep the APU busier. If the data is extremely local and you have a pipeline that's prefetching the data, you can keep the APU running full tilt. That's what the design of the SPE is very good at. That's why it was developed the way it was. The efficiencies have also been shown in a few real-world applications, ranging from marginal to excellent across a few different applications which is what we'd expect seeing as no solution is perfect for all applications.

You're pretty much describing DAXPY workloads, which just about any processor can churn through. The problem is that most real workloads may have sequences that will act like DAXPY workloads, but those sequences are surrounded by much more complex code.

As for the concerns whether these potential efficiencies will find their way into games, I answer 'yes' as I know developers aren't incompetant and won't try to run code unsuited to the hardware (well, most anyway!).

You are assuming that the hardware will be suited for the codes which may or may not be correct. You can't just make an array inversion into not an array inversion.

Looking are the architecture, what reason have you to doubt that? If you compare Cell's capacity to feed it's floating point APUs versus say XeCPU, why do you think it hasn't an efficiency advantage? Or at least, where do you think the limits are that would prevent the theoretical efficiencies being achieved?

Cell has a complex programming model with no dynamic data access between the processing elements, nor direct access to data storage requiring programmers to jump through myriad hoops just to get data into the SPU. It will also require private copies in each SPU of any data that is used by more than one SPU. It doesn't allow an efficient method of sharing data structures nor allowing more than one SPU to update a data structure in an efficient and programmer friendly manor.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

mckmas8808 said:
*Titanio has stated that on IBM boards people are suggesting that each SPE can run multiple threads right? So can each SPE run say 2 threads (like the Xenon runs 6 but they switch) at the sametime? If this is true could the CELL be seen as having over 10 threads at once?

Yes, in the same way that XCPU can have infinite threads. The reality is that you get 2 general purpose threads and 7 attached media processors.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Gholbine said:
Luckily games involve comparitively very little general purpose computing. It's the number crunching involved in physics and graphics that really hit the CPU, and Cell has these areas covered.

I think you might be overestimating the amount of number crunching involved. You might want to take a look at some of the HPC applications out there. Even in the most physics heavy HPC codes, there is a supprising amount of what could be described as number crunching code.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Edge said:
Do you have proof of that? I been following CELL's news since the beginning and I have never heard that.

Then you haven't been paying attention...

A pure streaming processor passes it's results from one processor to the next,

Then by your definition, a streaming processor is useless because it processes ephemeral vapor by passing it through various processing element to be spit into the ether.

but EACH SPE's have full access to external memory,

SPE's have NO access to external memory. All data into and out of an SPEs must be moved explicitly by a DMA engine. There is no method for access to external memory from within an SPE program except by configuring a DMA descriptor and loading it into the DMA engine.

The SPE's are far from being streaming processors ONLY, as they are general purpose processors, but with 300 GB/sec of internal bandwidth, and 256 KB of localized SRAM per SPE, the SPE's can be used very well for streaming algorithms, and if used as such, would blow the Xbox 360 CPU out of the water.

The SPEs ARE streaming processors, being practically the very definition of a stream processor. They have severly limited integer, logical, and control flow capabilities. To describe them as general purpose mis-characterises them and does an injustice to their designed intent.

Aaron Spink
speaking for myself inc.

aaronspink · Nov 22, 2005

Edge said:
Well each SPE is dual pipelined, so using your logic, each SPE can support dual threads. I don't consider that dual threading, and your example is incorrect also, as dual threading has to consider being able to run dual threads on the ALU, which it cannot, as one thread running, has to stop the other thread. Dual threading on the PPE's is only for faster context switches, and thus only supports 10 to 20 percent increase at most.

Dual threading does not mean dual execution!!!

Your example also, would be very limited to an algorithm, supporting both DIFFERENT types of execution, which I would think would be a RARE occurance.

I believe you need to take a couple courses in computer architecture and maybe study up on the technical papers. Until then, you might not want to try to correct other people who have a little better grasp on the technical issues.

Aaron Spink
speaking for myself inc.

London Geezer · Nov 22, 2005

Wow Aaron, the last 9 posts were yours!! Would be nice to have a "merge" option on these boards.

aaronspink · Nov 22, 2005

Edge said:
What ever, my who point of using the term secondary thread, was just to point out, that dual registers to support dual threading, does not mean dual execution. Dual threading supports at most 10 to 20 percent performance advantage.

Actually, dual threading supports at most a 100% performance advantage. Real world workload performance will however vary. But there are certain classes of opperations, mostly surround the traversals of linked lists which will achieve a 100% performance increase with dual threading.

Sun's UltraSPARC T1 with CoolThreads chip does FOUR threads per core! That's much better!

And Tera supported 128 threads per core. Where are they now?

Aaron Spink
speaking for myself inc.

Brimstone · Nov 22, 2005

aaronspink said:
Actually, dual threading supports at most a 100% performance advantage. Real world workload performance will however vary. But there are certain classes of opperations, mostly surround the traversals of linked lists which will achieve a 100% performance increase with dual threading.

And Tera supported 128 threads per core. Where are they now?

Aaron Spink
speaking for myself inc.

Aaraon do you have an opinion/insight on the impact of Transmeta's Longrun2 technology will have on a multicore design like CELL? It's unlike the single single fat design of the Curosoe processor.

Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

Jawed

ERP

Bowie

pc999

The GameMaster

randycat99

flick556

SedentaryJourney

aaronspink

aaronspink

aaronspink

aaronspink

aaronspink

aaronspink

aaronspink

aaronspink

aaronspink

London Geezer

aaronspink

Brimstone

B3D Shockwave Rider

Similar threads