The SPE as general purpose processor

No probs, thanks for your thoughts!

I was only wondering if there was a limitation on ussing select instructions in close proximity to one another in code for example like you how can't safely put a branch hit within 11 instructions of another branch hint.

I aslo wanted to know if compilers would attept to dual issue the instructions from at least two of the patchs of the select instruction as I thought that is what you implied by saying "letting dual issue work." This, or at least is this a possible target for compiler optimization.

Ultimately, I was just curious as to whether the select instr. could cover the instances where branches are tighly wound at the cost of some extra ops being executed where as the branch hint would cover instances where control statements are not so tightly packed and this penalty could be avoided.
 
Last edited by a moderator:
Actually the select intruction is not a "3-way" instruction, it's just a kind of 3 operand instruction, basically a conditional assignment which choses from 2 sources and uses a bitmask as a "condition pattern".

spuselect4pt.jpg


When you use the select statement you actually never do a branch, so the rules of the branch prediction instruction does not apply.

In my post, where I refered to the Itanium, i did not think about the select instruction, it was a general branch case I had in mind. Though the select instruction also have similarities with the Itanium, but it was some time since I studied the Itanium instruction set, it's not that hot anymore. ;)
 
scificube said:
I wondering is this could be used to avoid branch penalties with a lot of nested loops. The would be a cost in executing a lot of ultimately useless ops though no?
Loops aren't usually a type of problem where you'd need to worry about cost of their 'end' branches. However, if you have a tight loop with a ton of tiny nested loops inside it, you should be rethinking your algorithm first before worrying about branching speed of end loop conditionals. And that goes equally for any CPU you'd run it on, not just SPE.

The purpose of predication(what 'select' is used for) is to remove actual branch decisions that are not really unnecessary but the compiler generates them by default. Eg. to give the simplest (most obvious) example:
"if (a>d) a = b else a = c;" can be resolved by select instruction, rather then resorting to costly branches (again, this will tend to be faster by predication on any modern CPU, no matter how good their branch predictor is).

Of course this example is a trivial case, and certain compilers are smart enough to sometimes recognize these automatically. What is much more interesting case of optimization is removing branches inside loops - anyone that's ever done serious loop optimization will tell you that branching inside a loop basically kills any attempts to optimize properly.
If you can use predication in those places, you will still get a speedup even if your CPU has 0 branch penalty(hence why many of us used predication on VUs, even though branches were not costing us anything there, and we didn't have an actual select instruction for it).
 
Last edited by a moderator:
arjan de lumens said:
The software branch prediction in the SPE relies on the branch hint appearing at least 11 cycles before the branch instruction it is supposed to act on, and only one hint can be active at a time. This is OK for loops but not very useful for high-conditional-branch-rate AI calculations (if you need to insert 11 stale cycles just to do the branch hint, you are actually better off eating a 50%-probability 18-cycle branch mispredict penalty instead.)
At least you have a choice, if you have some usefull work to do after you know the branch target you can insert it ... on most general purpose processors all you can do with code like this is just accept you are screwed whatever you do.
 
MfA said:
At least you have a choice, if you have some usefull work to do after you know the branch target you can insert it ... on most general purpose processors all you can do with code like this is just accept you are screwed whatever you do.

Most general purpose processor will have a real BTB that supplies the instructions at the branch target which means that your branch bubble will be at most 1 cycle.

You're right about that it is possible to hide the delay by having enough useful work. But it's like having an eleven cycle branch delay slot, if you're out of work you're screwed

Cheers
 
Last edited by a moderator:
Gubbi said:
Most general purpose processor will have a real BTB that supplies the instructions at the branch target which means that your branch bubble will be at most 1 cycle.
BTBs are fine for deeply nested loops ... when the coherency between branches is from one frame to another you can completely forget about hitting in the BTB though. Static branch prediction has the same overhead when it gets it right.
 
MfA said:
BTBs are fine for deeply nested loops ... when the coherency between branches is from one frame to another you can completely forget about hitting in the BTB though. Static branch prediction has the same overhead when it gets it right.

The only advantage a software BTB has over a hardware one is in the case of indirect, computed, branches, these are mostly found in *big* switch statements (like switch based interpreters) and virtual function calls (with poor coherence). I don't think that is a workload that suits the SPEs well anyway

BTW. I think you're confusing the BTB with the BHT in your above statement. The BTB only deals with the destination of a branch, not whether it's taken or not.

Cheers
 
To an extent, but either will only be filled with relevant information if there is some short range coherency. Decision trees are a worst case scenario period, they will flush these buffers straight down the john.
 
Last edited by a moderator:
Gubbi said:
The only advantage a software BTB has over a hardware one is in the case of indirect, computed, branches, these are mostly found in *big* switch statements (like switch based interpreters) and virtual function calls (with poor coherence). I don't think that is a workload that suits the SPEs well anyway
Cheers

Not necessarily. If you're running some collision detection code on the SPU, you might need to do a 'virtual call' to a proper collision primitive pair, ie box-box, box-sphere etc. Software prediction could be quite handy in a case like this. There are other ways to do this of course, but it's good to know there's a cheap and easy option.

One thing that I find annoying on the SPU is that I need to predict even unconditional branches. This is an area where hardware predictors definitely have an edge.
 
That is not about prediction, it's about caching ... if the branch hasn't been taken recently enough to be in the BTB prediction won't save you, one thing is certain if you rely on a BTB ... every branch will cause a pipeline stall at least once. Split branches are actually a pretty nice thing to have, why have a pipeline stall at all when you have the time to tell the processor ahead of time where the jump will go?

AFAIR most desktop processors can handle some amount of function pointer indirection, so a single virtual call isn't relevant for performance (don't have indirection 10 levels deep of course).
 
Last edited by a moderator:
Ha I was Right!!!

The fact that Sony is using an SPE for most of the OS functionality proves I was right about the SPE having good general purpose performance. OS driven API code is not floating point intensive!!!

Try to rebut that 3dilettante! :D :D :D

Of course the PPE will be needed to handle interrupts, but that is not a general purpose code issue.

I love being vindicated! :D
 
Edge said:
The fact that Sony is using an SPE for most of the OS functionality proves I was right about the SPE having good general purpose performance. OS driven API code is not floating point intensive!!!

Try to rebut that 3dilettante!
It might not be good at it. It could be adequate. It all depends what the OS is doing and how much effort SPE has to go to to do that. If it's managing no more than a 66 MHz 486 in OS performance, it won't be any good ;) Reason to have a SPE take the job rather than PPE could be purely a matter of not wanting to drop a heavy thread onto PPE and limit it's availability for non-OS tasks.
 
Shifty Geezer said:
It might not be good at it. It could be adequate. It all depends what the OS is doing and how much effort SPE has to go to to do that. If it's managing no more than a 66 MHz 486 in OS performance, it won't be any good ;) Reason to have a SPE take the job rather than PPE could be purely a matter of not wanting to drop a heavy thread onto PPE and limit it's availability for non-OS tasks.

I agree, in that one could argue it's easier (less overhead to the game code) to interleave data transfers as oppose to interleaving threaded execution.

The nice thing with the SPE and it's isolated localized execution, that any low data transfer workloads but execution heavy will have far less impact on a game's main loop, than if the code ran on the PPE.

The way I see it, more like a 486 running at 3.2 GHz. :D
 
Edge said:
The fact that Sony is using an SPE for most of the OS functionality proves I was right about the SPE having good general purpose performance. OS driven API code is not floating point intensive!!!

What little I've read about the reserved SPE (translated from japanese by babelfish, no less) didn't indicate the mix of functionality being offloaded.

I'm not sure it was mentioned that the SPE would be tasked with handling API issues, only that the OS would reserve the SPE.

Is there another source that gives more detail on what is being run on the SPE?

Try to rebut that 3dilettante! :D :D :D

I can't. I don't have the information currently about what exactly the OS does with the reserved SPE.

There are some good reasons to reserve the SPE for the OS, since that would allow it to be securely mapped to memory considered sensitive to the OS.

It would also avoid the awkward situation where a game sucks up 7 SPEs and leaves nothing for the OS when something important like networking and encryption (compute intensive, and also security-conscious) comes along.

Of course the PPE will be needed to handle interrupts, but that is not a general purpose code issue.

Until there is more information, I'm reluctant to assume that this is all the PPE will be doing for the OS. Something tells me it does quite a bit more.

I love being vindicated! :D

I'm happy for you, though not yet convinced.

edit:

The reserved SPE seems to indicate that the PPE can be a significant bottleneck to the overall performance of CELL.

It has already been demonstrated in the game server demonstration that if the PPE is bogged down too much that performance plateaus with four SPEs.
If anything can be done to keep the PPE on task as much as possible, it would probably be a win overall.

The reserved SPE doesn't strike me as a particularly elegant way of doing things, but taking any load off the PPE is better than holding up the entire CELL.
The price of an SPE idling when the OS demands are low is much less than having all 7 SPEs waiting for the PPE to catch up.
Whether an SPE that leverages 128 128-bit vector registers, SIMD operations, long (unpredicted) pipeline, and local store is the best fit for API code (which often contents itself with 16 or fewer registers of 32 bits or less) is probably irrelevent if it can keep CELL as a whole functional.
I have my doubts that this what is being offloaded onto the SPE, however.
 
Last edited by a moderator:
3dilettante said:
I have my doubts that this what is being offloaded onto the SPE, however.

I agree a compromise is being made, in the very least we know that OS's have to handle interrupts and the SPE cannot handle that. The consensus in that other thread though is that a single SPE is overkill for 'typical' OS services, on the assumption that 5 percent of a typical processor overhead is for OS services. My guess would be if Sony wants to offer a higher level of OS functionality, like game play video encoding, or video chat during gameplay, then the needs of a 3.2 GHz SPE will be more fully realized.

It's also my assumption that we will not get a better understanding on the balance of code between the PPE, and SPE to satisfy this argument.
 
Last edited by a moderator:
Edge said:
I agree a compromise is being made, in the very least we know that OS's have to handle interrupts and the SPE cannot handle that. The consensus in that other thread though is that a single SPE is overkill for 'typical' OS services, on the assumption that 5 percent of a typical processor overhead is for OS services. My guess would be if Sony wants to offer a higher level of OS functionality, like game play video encoding, or video chat during gameplay, then the needs of a 3.2 GHz SPE will be more fully realized.

My suspicion is that the reserved SPE is primarily there for Digital Rights Management rather than performance issues, although it could handle the sound, de/compression, and PS2 emulation as well, taking the workload of these off the PPE. In that context, it may be completely locked out from the user along with access to the "OS" code that handles these.
 
SPM said:
My suspicion is that the reserved SPE is primarily there for Digital Rights Management rather than performance issues, although it could handle the sound, de/compression, and PS2 emulation as well, taking the workload of these off the PPE. In that context, it may be completely locked out from the user along with access to the "OS" code that handles these.

I was just thinking about this possibility. The SPE's local store is much more secure than main memory, and if the SPE runs isolated any internal data is wiped on completion.

There's much less chance for some clever hacking to get at the data stored there, though that wouldn't cure the problem that the elements like keys for the DRM are stored elsewhere when not in use. (possibly in a "trusted computing module" of some sort)

In addition, if Sony decides to change some or part of the DRM scheme or its keys, the programmability of the SPE allows it to adapt in midstream.
 
Let's look at large trees you would want to traverse. Instead of having a single process do all that and jump all over memory, you can partition that tree. Start at the top, and instead of going down all the way for each lookup, process the top levels of all of them and make a list of follow-up actions, sorted by branch. And stream those to the next set of SPE's, who all load that specific partition into local memory.

Which would almost certainly speed things up for general purpose processors as well. It's just, that the penalty/improvement is bigger for SPE's.
 
Back
Top