The SPE as general purpose processor

Jaws said:
Your original 22.4 Ginst/sec calculation for 7 SPUs, was it integer MATH or not? You made no assertion to it being an average in your initial post, therefore indicating a peak.

Also, re average, your subsequent post referred to a higher number being an average? Well that sounds contradictory now...

And finally, you've taken this long to come clean!

I don't want to narrow down the instuctions to just those that do integer math, as we are talking general purpose performance.

You right, I did say peak earlier, and even my average can be called into question, but I want to lean away from peak, as I don't want people to misunderstand my claim to be as meaningless as a MIPS rate, as that many times in the past was about executing only the fastest instruction over and over again to get your peak rate. If we are discussing general purpose code, then you have to consider a mix of different instructions.

I guess the real answer is to run a series of general purpose benches on CELL's SPEs, and see how it compares to other general purpose processors. Other than that, any claims made here, can be debated to the end of time.

This long to come clean? Yesterday I went cross country sking all day in Manning Park, an incredible beautiful area of British Columbia, Canada (3 hr drive from Vancouver), with hundreds of square miles of mountains and trees. It's nice to have a life away from this board every now and then. :) My muscles are sore now.
 
Last edited by a moderator:
Edge said:
...
This long to come clean? Yesterday I went cross country sking all day in Manning Park, an incredible beautiful area of British Columbia, Canada (3 hr drive from Vancouver), with hundreds of square miles of mountains and trees. It's nice to have a life away from this board every now and then. :) My muscles are sore now.

I mean you had an earlier opportunity to clarify but you didn't. Anyway, that sounds very scenic!

And finally for the record,

Edge said:
...
*maximum* 22.4 billion integer instructions per second throughput!!!!!...

...is actually wrong because the *maximum* would be 44.8 Ginst/sec for 7 SPUs! ...and yes, this is just for *any* integer instruction, it's also equivalent to a MIPS number for 7 SPUs... just for the record...
 
Jaws said:
...is actually wrong because the *maximum* would be 44.8 Ginst/sec for 7 SPUs! ...and yes, this is just for *any* integer instruction, it's also equivalent to a MIPS number for 7 SPUs... just for the record...

Just trying to remember, but was the R3000 used in the PS2's EE dual issue?
 
Edge said:
Just trying to remember, but was the R3000 used in the PS2's EE dual issue?

I'm guessing it would be, purely because a single issue processor would be very inefficient...
 
It is really hard finding information on the R3000, but it was introduced at 25 MHz, and I find ratings of 20 MIPS for that, hinting at single issue, so a ~300 MHz R3000 in the PS2's EE, would be rated roughly at 300 MIPS, compared to CELL's SPE total of ~44 BIPS. Even each SPE would be roughly on the order of 20 times more powerful at running general purpose code, all the while having far greater localization of data with 128x128-bit registers, and 256 KB SRAM, versus R3000 32x32-bit registers (16 times less than a SPE!!!), and 64 KB i-cache, and 64-KB d-cache. So easily more than 20 times more powerful. The R3000 in the PS2 runs the bulk of all general purpose processing, so using that comparison, each SPE looks to be a monster at general purpose processing, especially since each SPE does not have the OS as an overhead to it's processing, which the R3000 on the PS2 has to deal with.
 
Edge said:
While your general purpose processor has choked on it's floating point workload, CELL has finished long ago, and has excess amount of cycles to be used for integer work.

That's the price of being a general purpose processor, lower peak numbers for the sake of more consistent overall performance.

Specialization is always a way to get much more efficiency and performance, though it comes at a price of reduced performance outside of the target workload.

Altogether, CELL is a general purpose processor, thanks to the PPE and SPEs working together. An SPE in isolation is not general purpose.

It has a rich set of integer instructions, that can handle ANY integer workload of a typical general purpose processor. You have seven of them to do the work, all with excellent localized resources, 128-bitsx128 registers, 256 KB SRAM. A dedicated DMA engine to handle your external memory load/stores. Running at 3.2 GHz.

Being an essentially-universal computation machine does not make it general purpose.

A lot of the peak capability only works if the SPE's vector processing abilities are utilized well, and that's a pretty special workload. The register set, local store, and internal pathways are optimized for vector loads and high data parallelism. The DMA engine's characteristics assume that there's a lot of batching and streaming going on.
That is not a general case.

The SPE's are not going to be running wordprocessors, but games.

Then they have a specialized purpose and are not general.

It's a processor to meet the integer and floating point needs of a game. Sure you have isolated one issue that will run better on some other processors, but the solution may require a different approach, or run on the PPE. CELL is never about the SPE's alone, but the synergy of a chip of different components with different strengths. Just like a GPU is added to a PC to enhance it's overall abilities.

This thread concerns the SPE's use as a general purpose processor, not all of CELL. CELL as a whole is general purpose with an emphasis on compute-intensive workloads.
 
3dilettante said:
Then they have a specialized purpose and are not general.

IBM called them Synergistic Processing Elements to reflect their nature. The architecture and the broad instruction set backs this up. The SPE's instruction set is not lacking in integer math operations, logical operations, and flow control operations, all the instructions necessary for general purpose computing.

They are powerful at floating point, but that does not mean they are weak at general purpose. Just because you have one strength does not make everything else a weakness. The large register set is as useful for holding lots of integer values, as it is for holding floating point values, and same with the local store. C compilers allow general C code to run on the SPE's. There is no limits to the type of C code you can use, so that's proof, the SPE's are more than capable at handling general purpose code.

Synergistic Processing Elements - specialized and yet also general purpose. IBM/Toshiba/Sony themselves makes this claim. They are the designers.
 
Last edited by a moderator:
Edge said:
The SPE's instruction set is not lacking in integer math operations, logical operations, and flow control operations, all the instructions necessary for general purpose computing.

Word-aligned word-sized load and store would be nice for one. ;)
 
Last edited by a moderator:
aaaaa00 said:
Word-aligned word-sized load and store would be nice for one. ;)
Sure, but going from byte aligned to word aligned didn't make CPUs less general purpose IMO. The SPEs are just taking one more step.
 
It is really hard finding information on the R3000, but it was introduced at 25 MHz, and I find ratings of 20 MIPS for that, hinting at single issue
R3000 is single issue, but EE core is a R5900, which was dual issue (and a bunch of other things R3000 isn't, such as being native 64bit).
 
Crossbar said:
Sure, but going from byte aligned to word aligned didn't make CPUs less general purpose IMO.

Word as in machine word. Machine words on the SPE are 4 bytes. All instructions are 4 bytes long. Normal non-vector operands are 4 bytes as well.

But the only load/store instructions are for 16 byte vectors, which makes scalar operations somewhat annoying (especially scalar stores).

Are you aware of any widespread general purpose CPU where the only load/store instructions are for vector operands?

The SPEs are just taking one more step.

Yeah, one more step backwards. ZING! ;) (Yes, I am kidding.)
 
Last edited by a moderator:
aaaa0 said:
But the only load/store instructions are for 16 byte vectors, which makes scalar operations somewhat annoying (especially scalar stores).
Scalar operations are somewhat annoying, period. Thank the greatness of pure vertical SIMD and IBM's preferred slot paradigm nonsense.
It's only slightly better then doing scalar ops with AltiVec or SSE.

Anyway, we don't need word sized L/S access though, 128bit with word sized mask&rotate could work just fine on local store arch (well, masking is required, I'd add rotate only because of inconvenience of ISA not having component access).
 
Fafalada said:
Anyway, we don't need word sized L/S access though,

Right of course, never said it was impossible, just annoying. ;)

Fafalada said:
128bit with word sized mask&rotate could work just fine on local store arch (well, masking is required, I'd add rotate only because of inconvenience of ISA not having component access).

Plus scalar store requires read-modify-write. Annoying. :)
 
Last edited by a moderator:
aaaaa00 said:
Word as in machine word. Machine words on the SPE are 4 bytes. All instructions are 4 bytes long. Normal non-vector operands are 4 bytes as well.

But the only load/store instructions are for 16 byte vectors, which makes scalar operations somewhat annoying (especially scalar stores).

Are you aware of any widespread general purpose CPU where the only load/store instructions are for vector operands?
No I am not aware of any GP CPU like that, but my point was that there are not a lot of people compaining about that many CPUs today requires variables to be word aligned for efficient memory access, which coul be perceived as annoying when working with data of byte size, the SPEs need the data to be aligned at 16 bytes. Same, same but different. ;)
 
Edge said:
C compilers allow general C code to run on the SPE's. There is no limits to the type of C code you can use, so that's proof, the SPE's are more than capable at handling general purpose code.

Capable of running general C code, perhaps. Capable of running general C code at anything approaching acceptable performance levels in most cases, absolutely not.

Run a PERL interpreter on a SPE, run a compiler, run a spreadsheet, run any open source program written in C. Performance will suck in 9 out of 10 cases.

Edge said:
Synergistic Processing Elements - specialized and yet also general purpose. IBM/Toshiba/Sony themselves makes this claim. They are the designers.

The manufacturers putting a positive spin on their product... How surprising..

Cheers
 
Last edited by a moderator:
Crossbar said:
No I am not aware of any GP CPU like that, but my point was that there are not a lot of people compaining about that many CPUs today requires variables to be word aligned for efficient memory access, which coul be perceived as annoying when working with data of byte size, the SPEs need the data to be aligned at 16 bytes. Same, same but different. ;)

Not the same thing.

Even though modern general purpose CPUs require alignment for efficient access, this alignment requirement is the same size as the machine word.

This is not true of the SPE, the memory alignment requirement is 4x that of the machine word (there are no scalar 16-byte operations), and hence is evidence in favor of the SPE's lineage being a vector oriented co-processor, and not a general purpose CPU.
 
Last edited by a moderator:
aaaa0 said:
Scalar store requires read-modify-write. Annoying.
Well that's for hw ppl to mull over. VUs were implemented with 128bit memory granularity too, but they allow load/store of 32bit components with masking.
Anyway I still say the real problem is not load/store mechanism, it's the whole preferred slot shenanigans. If you had component access and broadcasting, scalar L/S's wouldn't bloat your code the way they do now, and scalar ops would actually be easy.

Gubbi said:
Run a PERL interpreter on a SPE, run a compiler, run a spreadsheet, run any open source program written in C. Performance will suck in 9 out of 10 cases.
Define suck :p Besides SPE may not be the only CPU core you could say this about.
 
Last edited by a moderator:
Fafalada said:
Define suck :p Besides SPE may not be the only CPU core you could say this about.

Let's say 1/5th the speed of the PPE. :)

That would make it less than 1/10th the speed of a state of the art x86 CPUs at these tasks.

Cheers
 
A search for "Cell SPE instruction timings" turned up http://cag.csail.mit.edu/crg/papers/eichenberger05cell.pdf, which has some info on SPE timings and scheduling rules.

Some highlights:
  • The SPE requires a branch hint to appear at least 11 cycles before an actual taken branch to avoid branch mispredicts (the mispredict penalty is 18 cycles, which is only barely better than a Northwood Pentium4). It also cannot queue up hints, so it cannot actually sustain more than 1 taken branch every 11 cycles (for a comparison, the Pentium4 can sustain a rate of 1 branch per clock cycle).
  • The integer arithmetic and the FP arithmetic are issued into the same pipe. The other pipe only handles load/store and branches. If you are not vectorizing your code, you can only achieve 1 scalar operation per clock.
  • The integer instructions and load/store have quite high latencies compared to e.g. a similarly-clocked Pentium4 (SPE: add=2 cycles, P4: add=0.5 cycles, SPE: load=6 cycles, P4: load=2 cycles).
The SPE has been optimized first and foremost for large vector operations, and for those it does a good job. For tasks that cannot be coerced into the large-vector-operation paradigm, I would however fully expect, say, a Dual-Core A64 to easily keep up with even a full set of 8 SPEs.
 
arjan de lumens said:
The integer instructions and load/store have quite high latencies compared to e.g. a similarly-clocked Pentium4 (SPE: add=2 cycles, P4: add=0.5 cycles, SPE: load=6 cycles, P4: load=2 cycles).[/list]

Those are P4 Northwood numbers, Prescott is worse but still far better than a SPE.

Cheers
 
Back
Top