The SPE as general purpose processor

I would also recommend to everybody to read the 5 documents (well, at least the first 3) about compilation in Cell, that you can find in http://www.research.ibm.com/cell/

Clearly, together with ISA analysis, SPEs are designed for single precision fp vector calculation. Any other usage, like integers, is a workaround of the floating point vector capabilities. You can find something like a = b[i+1]+c[i+2] can be extremely dangerous for performance (by the way, it is something I think it was already discussed by DeanoC and aaaaa00 or ERP long time ago).

There are no magic tricks in Cell, it is the same technology we can find in other top-of-the-line products. Just it has been tuned for a very specific usage, sacrificing other areas. Integer operations are significantly slower in SPEs than in any other similar core (Xenon, P4 or AMD64). However, fp vector calculation is very strong. It is a very interesting pay-off.

SPEs can execute almost everything. But nobody has said it can be done efficiently. Even more, IBM is saying it can not be in many cases.
 
DarkRage said:
I would also recommend to everybody to read the 5 documents (well, at least the first 3) about compilation in Cell, that you can find in http://www.research.ibm.com/cell/

Clearly, together with ISA analysis, SPEs are designed for single precision fp vector calculation. Any other usage, like integers, is a workaround of the floating point vector capabilities. You can find something like a = b[i+1]+c[i+2] can be extremely dangerous for performance (by the way, it is something I think it was already discussed by DeanoC and aaaaa00 or ERP long time ago)..


Erm, the SPEs have the same amount of ALUs as FPUs...
 
DarkRage said:
Clearly, together with ISA analysis, SPEs are designed for single precision fp vector calculation. Any other usage, like integers, is a workaround of the floating point vector capabilities.

Erm, did you even read the documentation? Try reading section 5 of the SPU Instruction Set Architecture Manual and tell me that integer operations are a "workaround".

DarkRage said:
Integer operations are significantly slower in SPEs than in any other similar core (Xenon, P4 or AMD64).


You can't get much faster than single cycle throughput of an instruction. Unless one of those other processors has a >128-bit integer datapath that I don't know about, the SPEs are just as fast for the majority of integer operations.
 
add n to (x) said:
You can't get much faster than single cycle throughput of an instruction. Unless one of those other processors has a >128-bit integer datapath that I don't know about, the SPEs are just as fast for the majority of integer operations.

Not only that, but you have seven of them at 3.2 GHz each. Anyone who thinks the SPE's are not integer monsters along with the generally accepted "floating point monsters", has not read the documentation on CELL SPE's like you said. You're looking at a *maximum* 22.4 billion integer instructions per second throughput!!!!!
 
Linux runs on a lot of processors that don't support pre-emptive multitasking in hardware. It helps when the processor does, but it's not a requirement. And neither are the other things mentioned. And I think it fits the bill of a current, full-fledged OS pretty well.

So, while the PPE might run a very small micro-kernel of at most a few kB in size to handle dispatching, interrupting and page switching, everything else (32+ MB) can run on an SPE. And while that SPE wouldn't run the whole OS, most other platforms use dedicated hardware and/or other small processors for dedicated things as well.

Like, the processor in your keyboard. Would we want the main CPU to spend time handling that as well?
 
Edge said:
Not only that, but you have seven of them at 3.2 GHz each. Anyone who thinks the SPE's are not integer monsters along with the generally accepted "floating point monsters", has not read the documentation on CELL SPE's like you said. You're looking at a *maximum* 22.4 billion integer instructions per second throughput!!!!!

i would like to know how many integer operation can xenon do instead...does anyone know that?
 
Btw, there are very many computations even such dedicated hardware as an GPU has to do that are totally integer. Indexes, for a start. You cannot build ANY kind of processor that cannot handle those.
 
danteye said:
i would like to know how many integer operation can xenon do instead...does anyone know that?
Well xenon has 2 integer/fixed units if I’m not mistaken and AltiVec can also do integers if I remember correctly and since not much is known about xecpu’s Altivec it’s hard to say how many GOPS xecpu does …but there where some slides of some presentation on 360 architecture that rated each core of xecpu at 6400 MIPS .
 
danteye said:
i would like to know how many integer operation can xenon do instead...does anyone know that?

If we assume an integer instruction per cycle than 9.6 billion integer instructions per second. I think the cores on the 360 GPU being more complex (dual integer units per core), probably averages more like 1.2 to 1.4 integer instructions per cycle, giving 11.5 to 13.44 billion integer instructions per second.

The two rates I just gave a quite meaningless for figuring out overall throughput, but gives an ideal to localized performance.

My original point was to simply point out the SPE's are no weaklings to integer performance, and with good programming can outperform the 360 CPU in that area, especially considering the PPE on CELL adds an extra 3.2 to to 4.48 (if 1.4 instruction per cycle max considered) billion instructions per second bringing the previous 22 billion total to 25 to 27 billion total integer instructions per second.

Hopefully this dispels the myth that CELL is not good at integer work!
 
Last edited by a moderator:
danteye said:
i would like to know how many integer operation can xenon do instead...does anyone know that?

7 SPUs * 1 int instruction per core * 3.2 GHz ~ 22.4 Ginst/sec

Also add the PPE. I beleive the PPE can't dual issue 2 int instruction per/cycle,

1 * 1 int instruction per core * 3.2 GHz ~ 3.2 Ginst/sec

Cell ~ 22.4 + 3.2 ~ 25.6 Ginst/sec (integer)

XeCPU ~ 3*3.2 ~ 9.6 Ginst/sec (integer)
 
Edge said:
...I think the cores on the 360 GPU being more complex (dual integer units per core), probably averages more like 1.2 to 1.4 integer instructions per cycle, giving 11.5 to 13.44 billion integer instructions per second.
...

I don't think the PPE can dual issue 2 integer instructions per cycle. The peak would still be 1 int inst/ cycle.
 
When people say "integer" they mean "everything that is not a flop".

The most important integer operations are conditionals, branches, loads, and stores. Not integer math and bitwise operators.

So to be clear, just counting the number of integer math instructions that an SPE can execute per second doesn't give you a clear picture of how good or bad an SPE is at integer operations.
 
aaaaa00 said:
When people say "integer" they mean "everything that is not a flop".

Not always.

The most important integer operations are conditionals, branches, loads, and stores. Not integer math and bitwise operators.

They're all important depending on what you're doing. Being explicit to what your referring to is best course...

So to be clear, just counting the number of integer math instructions that an SPE can execute per second doesn't give you a clear picture of how good or bad an SPE is at integer operations.

We've had this discussion before and I agree it's confusing. But I think integer and FP are self evident. Just be explicit with what you're referring to...
 
aaaaa00 said:
When people say "integer" they mean "everything that is not a flop".

The most important integer operations are conditionals, branches, loads, and stores. Not integer math and bitwise operators.

Obviously if you are talking about integer instructions, you have to include integer math or bitwise operators. Fine by me to include conditionals, branches, loads and stores, as that actually handled on the secondary pipeline in the SPE. Even if some of those are multi-cycle execution times, you still have the main pipeline available doing work.

Also note, we are not discussing operations per second, as with integer math with 8-bit values in a 128-bit register will be 16 operations per cycle, or 358 billion operations per second in a seven SPE CELL chip running at 3.2 GHz. I have to check, but forget if the SPE's has an 8-bit parallel operation on those 128-bit registers. Maybe 16-bit instead?
 
Last edited by a moderator:
Edge said:
If we assume an integer instruction per cycle than 9.6 billion integer instructions per second. I think the cores on the 360 GPU being more complex (dual integer units per core), probably averages more like 1.2 to 1.4 integer instructions per cycle, giving 11.5 to 13.44 billion integer instructions per second.

The two rates I just gave a quite meaningless for figuring out overall throughput, but gives an ideal to localized performance.

My original point was to simply point out the SPE's are no weaklings to integer performance, and with good programming can outperform the 360 CPU in that area, especially considering the PPE on CELL adds an extra 3.2 to to 4.48 (if 1.4 instruction per cycle max considered) billion instructions per second bringing the previous 22 billion total to 25 to 27 billion total integer instructions per second.

Hopefully this dispels the myth that CELL is not good at integer work!


In fact i have asked this bacause i always read about the strenght of xenon over cell in integer operation, but i did'nt know exactly how many int operation xenon could do!
 
danteye said:
In fact i have asked this bacause i always read about the strenght of xenon over cell in integer operation, but i did'nt know exactly how many int operation xenon could do!

But i've also read on IBM.com that every spe can do 4*32 bit operation per cycle, that means 90 GOPS for seven spe at 3.2GHZ. Is it right?

And i've read that every spe can do 4 single precision floating point operation per cycle that means 12.5 Gigaflops per spe, but this results is in contrast with the 25 gigaflops of ibm documents...

How can you explain this??
 
danteye said:
How can you explain this??
I think it's important to distinguish between instructions and operations in these kind of discussions, as people sometimes confuse the two, others might mean one thing without saying so explicitly and another might think they mean the other - also without saying so, and so on.

Each core in the Xenon CPU does two instructions per cycle. Peak! It will likely be (much) less in reality. Either of these instructions can be ONE of: integer op (math, bit manipulation etc), branching/load/store, float math, or VMX math.

SO, you can't issue two VMX instructions per clock, or TWO float math etc. But you could have one VMX math and a load/store. There might be other restrictions as well that applies, I haven't read any detailed technical docs on this subject, and these infos might be restricted access anyway...

Now, a VMX instruction might be multiple operations all in one single instruction. Such as multiply 3.141593 with four different 32-bit numbers packed into one 128-bit register or somesuch; this is called SIMD, which stands for single instruction multiple data. So operations count might be higher than two per cycle, but instructions won't ever be higher than 2.

As for Cell, it was long said the PPE core also had dual-issue capabilities, and Jaws thinks that might not be the case, well, who knows for sure really. :) In any case, each SPE has one instruction pipe that exclusively deals with floats and one that exclusively deals with integers/everything else. I think these co-issue as well, but it might be one at a time only. In any case, the instruction versus operation distinction applies here as well. Cell SPEs have integer SIMD instructions, I don't know if Xenon/VMX does, but it's likely that's the case. After all, good ol' x86 has had it since the late 90s when MMX appeared on the scene with much thunder and little else. ;)
 
Edge said:
Obviously if you are talking about integer instructions, you have to include integer math or bitwise operators. Fine by me to include conditionals, branches, loads and stores, as that actually handled on the secondary pipeline in the SPE. Even if some of those are multi-cycle execution times, you still have the main pipeline available doing work.

aaaaa00 is right.

Repetition: Back in the day floating point operations were expensive multi cycle operations carried out by off-chip (and expensive) co-processors. Back then it made sense to look at the performance of these bolt-on devices in isolation, hence people talked about floating point performance. The main CPU would do the program control flow, calculating adresses for loading and storing values etc.. - all work done on integer registers. Therefore integer performance was used to describe the performance of the main CPU.

Back then FP operations were so expensive (both high latency and low throughput) as to render almost all other house-keeping tasks (integer) of a compute heavy program irrelevant.

So back then it made sense to characterize a system by its floating point and integer performance. Floating point performance being decisive for the overall system performance.

Fast forwarding from the 80s:

Then these co-processors got integrated onto the CPU dies... Then they got pipelined.

The huge increase in throughput (from 1 every 50th cycle to 1 every cycle) and the massive reduction in FP op latency (from 50+ cycles to 3-5) now meant that the remaining "integer" performance of the system started to be more and more important.

Modern CPUs added short vector support (SIMD) in their FPUs. These do not only handle arithmetic with floating point numbers but also arithmetic with various bit-width integers.

While the arithmetic part of the CPU has enjoyed a massive increase in throughput, the remaining, the part described as "integer" with the old fashioned nomenclature, has not. Simply because it is a lot harder.

The main CPU still has to do all the program flow, the calculation of addresses, loading and storing values. Program flow is inherently sequential in nature, load/store is increasingly limited by bandwidth and, more important, latency.

Today it would be better to characterize a CPU by:
1. Arithmetic (integer or floating point)
2. Program control flow (branch resolution, predication etc.)
3. load/store (bandwidth, latency and number of transactions/cycle)

If you absolutely insist of describing a modern CPU by "floating point" and "integer" performance you should *not* count integer arithmetic towards integer performance since that would tell you nothing about 2.) and 3.) of a particular CPU.


Using the above to characterize modern CPUs:
Xenon (1 core): 1 - very high, 2 - high, 3 - high
PPE: 1 - very high, 2 - high, 3 - high
SPE: 1 - very high, 2 - low, 3 - *mixed
Merom/Conroe: 1 - very high, 2 - very high, 3 - very high
A64: 1 - high, 2 - very high, 3 - high to very high

(*mixed*): very high bandwidth, somewhat low latency in local store, high latency out of LS with an archaic (1960s) memory model.

Cheers
Gubbi
 
Last edited by a moderator:
Guden Oden said:
I think it's important to distinguish between instructions and operations in these kind of discussions, as people sometimes confuse the two, others might mean one thing without saying so explicitly and another might think they mean the other - also without saying so, and so on.

Each core in the Xenon CPU does two instructions per cycle. Peak! It will likely be (much) less in reality. Either of these instructions can be ONE of: integer op (math, bit manipulation etc), branching/load/store, float math, or VMX math.

SO, you can't issue two VMX instructions per clock, or TWO float math etc. But you could have one VMX math and a load/store. There might be other restrictions as well that applies, I haven't read any detailed technical docs on this subject, and these infos might be restricted access anyway...

Now, a VMX instruction might be multiple operations all in one single instruction. Such as multiply 3.141593 with four different 32-bit numbers packed into one 128-bit register or somesuch; this is called SIMD, which stands for single instruction multiple data. So operations count might be higher than two per cycle, but instructions won't ever be higher than 2.

As for Cell, it was long said the PPE core also had dual-issue capabilities, and Jaws thinks that might not be the case, well, who knows for sure really. :) In any case, each SPE has one instruction pipe that exclusively deals with floats and one that exclusively deals with integers/everything else. I think these co-issue as well, but it might be one at a time only. In any case, the instruction versus operation distinction applies here as well. Cell SPEs have integer SIMD instructions, I don't know if Xenon/VMX does, but it's likely that's the case. After all, good ol' x86 has had it since the late 90s when MMX appeared on the scene with much thunder and little else. ;)


ok, i understand!! thank you very much!

Last question: how could you explain me the fact that on ibm documents there is written that a spe can do 4 flops for cycle that means 13 gflops instead of 25? maybe every spe can do 2 istructions for cycle?
 
danteye said:
ok, i understand!! thank you very much!

Last question: how could you explain me the fact that on ibm documents there is written that a spe can do 4 flops for cycle that means 13 gflops instead of 25? maybe every spe can do 2 istructions for cycle?

It depends on how you how count. A SPE can do a 4-way fused multiply-add every cycle which is 4 muls and 4 adds, or 8 ops. Some count the mul-add as one op and then you only get 4.

Cheers
 
Back
Top