The SPE as general purpose processor

supervegeta · Mar 24, 2006

Shifty Geezer said:
Thread title : The SPE as general purpose processor

Part of that discussion is how well SPE can handle branch code. Barbarian points out it could be very good in some cases.

Is you that are talking about Cell as a whole without specifing if you are talking about the whole processor or just about the spe's.

I can't argue this. I don't know what SPE's branch prediction code is like. But Barbarian has said SPE can execute branch prediction code as well as a general purpose CPU.

Now this is the real debate and one you need to take up with Barbarian. Give your reasons why you think SPE's can't handle branch prediction as a fast as a P4, and he gives his reasons why he thinks it can. Or you give evidence of eprforamnce of a software branch predictor versus a hardware branch predictor. Just saying 'SPE's are no good at this' without giving reasons or evidence isn't contributing intelligently to the debate.

I am not going to repeat the same thing is said on the most part of the technical sites :

arstechnica said:
"Don't bother suggesting that the PS3 can use its SPEs for branch-intensive code, because the SPEs lack branch prediction entirely"

This is what arstechnica say for example , if barbarian think othervise then provide some proof instead of just wild guessing.

And i am not going to argue more on this since more technical people than me alredy stated what i said.

That's the discussion of this thread - How good are SPE's at running different types of code. Don't go saying 'that's a waste of SPE power' as an argument against software branch predictions because that's not talking about how well SPE's execute different types of code.

How good they perform is directly related to how efficent they can work.

Crossbar · Mar 24, 2006

supervegeta said:
I am not going to repeat the same thing is said on the most part of the technical sites :

This is what arstechnica say for example , if barbarian think othervise then provide some proof instead of just wild guessing.

That Arse does not have a clue.

supervegeta · Mar 24, 2006

Crossbar said:
I am an idiot and i don't have a clue.

If you are an idiot why bother posting ?

Frank · Mar 24, 2006

supervegeta said:
This is what arstechnica say for example , if barbarian think othervise then provide some proof instead of just wild guessing.

For SPE's the compiler does the branch prediction, not the processor itself. But it can still use branch predicion.

Crossbar · Mar 24, 2006

supervegeta said:
This idiot don't have a clue.

Sorry, did you not post this quote?

Originally Posted by arstechnica
"Don't bother suggesting that the PS3 can use its SPEs for branch-intensive code, because the SPEs lack branch prediction entirely"

supervegeta · Mar 24, 2006

DiGuru said:
For SPE's the compiler does the branch prediction, not the processor itself. But it can still use branch predicion.

Not with the same efficency.

rounin · Mar 24, 2006

LOL. Arstechnica is "proof" now.

supervegeta · Mar 24, 2006

rounin said:
LOL. Arstechnica is "proof" now.

ha ha ha wow the spe's lacking branch prediction is a fact and it's my point.

As for the credibility of Arstechnica it sure have more credibility than some anonymous random poster on an internet board.

Frank · Mar 24, 2006

supervegeta said:
Not with the same efficency.

That depends entirely on the compiler. Theoretically, it can be almost perfect, in practice it probably gets all the loops right, and most of the conditionals. Which is a bit worse than hardware prediction. And while the compiler will improve over time, it will never get it right all the time. That's what ERP was saying.

supervegeta · Mar 24, 2006

DiGuru said:
That depends entirely on the compiler. Theoretically, it can be almost perfect, in practice it probably gets all the loops right, and most of the conditionals. Which is a bit worse than hardware prediction. And while the compiler will improve over time, it will never get it right all the time. That's what ERP was saying.

Then it is not exactly that easy but i leave at more technical people like 3dilettante arguing on this.

Crossbar · Mar 24, 2006

I think people calling other people idiots should read some documents over here before citing Arstechnica.

http://www-128.ibm.com/developerworks/power/cell/downloads_doc.html

The language extension piece was specifically cut from this document.

http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E

Crossbar · Mar 24, 2006

supervegeta said:
Then it is not exactly that easy but i leave at more technical people like 3dilettante arguing on this.

Sounds like a wise move.

arjan de lumens · Mar 24, 2006

The software branch prediction in the SPE relies on the branch hint appearing at least 11 cycles before the branch instruction it is supposed to act on, and only one hint can be active at a time. This is OK for loops but not very useful for high-conditional-branch-rate AI calculations (if you need to insert 11 stale cycles just to do the branch hint, you are actually better off eating a 50%-probability 18-cycle branch mispredict penalty instead.)

As fro the compiler inserting hints etc: Do not overestimate the power of compiler technology. Intel did just that, leading them straight into the Itanium disaster.

Robert.L · Mar 24, 2006

supervegeta said:
This is what arstechnica say for example , if barbarian think othervise then provide some proof instead of just wild guessing.

Right... i think a dev that is working with ps3 Barbarian is doing a little bit more then wild guessing

Crossbar · Mar 24, 2006

arjan de lumens said:
As fro the compiler inserting hints etc: Do not overestimate the power of compiler technology. Intel did just that, leading them straight into the Itanium disaster.

True, it has some similarites with Itanium. For example in some cases it's most efficient to execute the code in both branches (letting the the dual issues work) and execute the branch late to yield time for the branch prediction and thereby avoid penalties.

It's probably true it will take some time until compilers can do this efficiently.

SPM · Mar 24, 2006

Shifty Geezer said:
That wasn't clear to me. Being good at integer math doesn't really solve the idea that SPE's are good at 'integer workloads' either as it doesn't take into account the other non-maths functions, which is the crux of the discussion, though I think it took a bit of a detour there with someone doubting SPE's int-math performance.

It's one of those subjects that really can only be solved iwth benchmarks I think. Take a useful real-world routine that you want the integer performance for from a normal processor and port it verbatim for SPE, with maybe some memory management beyond pure on-demand fetching. There's a software cache solution IIRC that could be employed. Then see how well SPE copes with branching, random memory accessing compared with the PPE and see if it is fast enough to be useable or that slow as to be a 'last resort'.

The SPE isn't bad for branching - provided you are not branching outside of the SPE's local memory in which case you end up having to load code from main memory - the equivalent of a cache miss on the PPE (actually a lot worse because the code cannot be executed as it is loaded). Most critical code sections are tightly bound loops and so this is usually not a problem. A program which is properly coded to run within the SPE's local memory will probably run faster than the PPE, because the programmer can guarantee through the coding that there is no access to main memory.

For random memory access from main memory again the SPE can do this efficiently using gather-scatter list DMA if it involves accessing data in reasonably large blocks, especially if it can process the previous block while the next block is being processed.

What the SPE can't do well is handling large pointers and stacks. Large data structures like stacks have to be in main memory and while the SPE can efficiently load blocks of data into local memory for processing, it won't be efficient for loading or testing individual bytes or words in a stack if the rest of the block doesn't need to be processed. So for large data structures the SPE will be handicapped. This will affect sort, search and collision detection algorithms. For example it may be possible to rewrite sort or search algorithms to process smaller parts of the data structure in parallel, or include collision detection checks while carrying out the geometric processing of objects for animation so that the object does not need to be loaded again to test for collision.

SPM · Mar 24, 2006

Shifty Geezer said:
That wasn't clear to me. Being good at integer math doesn't really solve the idea that SPE's are good at 'integer workloads' either as it doesn't take into account the other non-maths functions, which is the crux of the discussion, though I think it took a bit of a detour there with someone doubting SPE's int-math performance.

It's one of those subjects that really can only be solved with benchmarks I think. Take a useful real-world routine that you want the integer performance for from a normal processor and port it verbatim for SPE, with maybe some memory management beyond pure on-demand fetching. There's a software cache solution IIRC that could be employed. Then see how well SPE copes with branching, random memory accessing compared with the PPE and see if it is fast enough to be useable or that slow as to be a 'last resort'.

The SPE isn't bad for branching - provided you are not branching outside of the SPE's local memory in which case you end up having to load code from main memory - the equivalent of a cache miss on the PPE (actually a lot worse because the code cannot be executed as it is loaded). Most critical code sections are tightly bound loops and so this is usually not a problem. A program which is properly coded to run within the SPE's local memory will probably run faster than the PPE, because the programmer can guarantee through the coding that there is no access to main memory.

For random memory access from main memory again the SPE can do this efficiently using gather-scatter list DMA if it involves accessing data in reasonably large blocks, especially if it can process the previous block while the next block is being processed.

What the SPE can't do well is handling large pointers and stacks. Large data structures like stacks have to be in main memory and while the SPE can efficiently load blocks of data into local memory for processing, it won't be efficient for loading or testing individual bytes or words in a stack if the rest of the block doesn't need to be processed. So for large data structures the SPE will be handicapped. This will affect sort, search and collision detection algorithms. For example it may be possible to rewrite sort or search algorithms to process smaller parts of the data structure in parallel, or include collision detection checks while carrying out the geometric processing of objects for animation so that the object does not need to be loaded again to test for collision.

scificube · Mar 24, 2006

Crossbar said:
True, it has some similarites with Itanium. For example in some cases it's most efficient to execute the code in both branches (letting the the dual issues work) and execute the branch late to yield time for the branch prediction and thereby avoid penalties.

It's probably true it will take some time until compilers can do this efficiently.

Is the started off the the 3-way select statment? How close can these be in code?

I wondering is this could be used to avoid branch penalties with a lot of nested loops. The would be a cost in executing a lot of ultimately useless ops though no?

Barbarian · Mar 25, 2006

arjan de lumens said:
The software branch prediction in the SPE relies on the branch hint appearing at least 11 cycles before the branch instruction it is supposed to act on, and only one hint can be active at a time. This is OK for loops but not very useful for high-conditional-branch-rate AI calculations (if you need to insert 11 stale cycles just to do the branch hint, you are actually better off eating a 50%-probability 18-cycle branch mispredict penalty instead.)
As for the compiler inserting hints etc: Do not overestimate the power of compiler technology. Intel did just that, leading them straight into the Itanium disaster.

Yes, the compiler will help but it won't do magic. Like I said even taking a 50% chance with 18 cycle penalty is not THAT bad. A P4 hardware predictor just minimizes this chance, but if it mispredicts the pipeline bubble can be more than 35 cycles, so there will be quite a penalty, just not as often.
On the other hand certain type of branches can be converted to a SELECT instruction (which computes both sides and selects the result), or replaced by MIN,MAX,AND,OR etc appropriate combinations which compute the same result as the branch. Some of these are very successfully converted by the compiler, which is quite handy.

Crossbar · Mar 25, 2006

scificube said:
Is the started off the the 3-way select statment? How close can these be in code?

I wondering is this could be used to avoid branch penalties with a lot of nested loops. The would be a cost in executing a lot of ultimately useless ops though no?

The select instructions can probably be just next to each other (if there is a case where it would make sense), but I guess you wonder if it may cause any stalls, that I don't know, I have not seen any information about that.

I found these guidelines by Barry Minor(probably posted elsewhere in this forum as well, I am sorry if that is the case and someone gets upset) but anyway it seems to concern your question and belong to the context of this thread.

In general it is best to code the SPEs using a few simple rules:

1) Try to avoid function calls (inline)
2) Try to remove branches by using spu_select instructions
3) Unroll loops to produce large blocks of independant ops
4) Interleave blocks of unrelated code to reduce dependencies

You have 128 registers and a 6 cycle dual issue pipeline so make those registers work for you. Branches produce walls that stop the compiler from moving code around and therefore scheduling it well. Computing both sides of a branch and then using spu_select to choose the correct answer may seem like a waste of compute cycles but you're more than likely going to get the other half of the branch for free as the additional ops just fill the pipeline bubbles.

Follow those 4 rules and the first compile of your SPE code will more than likely have a Cycles per Instruction (CPI) of less than 1.0.

The SPE as general purpose processor

supervegeta

Crossbar

supervegeta

Frank

Certified not a majority

Crossbar

supervegeta

rounin

supervegeta

Frank

Certified not a majority

supervegeta

Crossbar

Crossbar

arjan de lumens

Robert.L

Crossbar

SPM

SPM

scificube

Barbarian

Crossbar

Similar threads