The SPE as general purpose processor

danteye · Mar 10, 2006

Gubbi said:
It depends on how you how count. A SPE can do a 4-way fused multiply-add every cycle which is 4 muls and 4 adds, or 8 ops. Some count the mul-add as one op and then you only get 4.

Cheers

ok i understand! thanks you!

Shifty Geezer · Mar 10, 2006

Edge said:
25 to 27 billion total integer instructions per second.

Hopefully this dispels the myth that CELL is not good at integer work!

Nope. I have here a processor that runs at 3 GHz, with 32 'ADD' units that add, can only add, and that's all this processor does. It performs 96 Billion integer instructions per second. Is it good at integer work?

Others have explained well the entirety of Integer work's demands, but I felt the obvious illustration worth adding to show the flaw in your conclusion.

j^aws · Mar 10, 2006

Shifty Geezer said:
Nope. I have here a processor that runs at 3 GHz, with 32 'ADD' units that add, can only add, and that's all this processor does. It performs 96 Billion integer instructions per second. Is it good at integer work?
...

You have a processor with a crippled ISA.

SPUs and PPEs have a full ISA which should be obvious.

Edge's point was clearly that SPUs didn't suck at integer MATH.

Gubbi · Mar 10, 2006

Jaws said:
Edge's point was clearly that SPUs didn't suck at integer MATH.

Not clear at all.

Only his last post detailed integer arithmetic.

Before that he just talked about the SPUs not having inferior integer performance.

Cheers

j^aws · Mar 10, 2006

Gubbi said:
Not clear at all.

Only his last post detailed integer arithmetic.

Before that he just talked about the SPUs not having inferior integer performance.

Cheers

Well the fact that he mentioned integer "instruction" was a dead giveaway as referring to math instructions. Maybe I've seen this too often on this forum for it to be obvious...

Gubbi · Mar 10, 2006

Jaws said:
Well the fact that he mentioned integer "instruction" was a dead giveaway as referring to math instructions. Maybe I've seen this too often on this forum for it to be obvious...

Could very well be that he meant arithmetic, my mistake then. But I was certainly mislead by him mentioning "integer performance" and "integer work" in the above posts then, which traditionally has a different meaning (see above) when you discuss in an integer vs floating point context.

Cheers

DarkRage · Mar 10, 2006

Come on, you are counting integer operations within the vector capabilities as integer general purpose capabilities.

They are not.

Again, something as usual as a = b[i+1]+c[i+2] is described by IBM as a significant number of operations needed, together with performance penalties, and you are using the full vector for it, with the other 3 32-bits "ALUs" in the vector doing useless work.

j^aws · Mar 10, 2006

DarkRage said:
Come on, you are counting integer operations within the vector capabilities as integer general purpose capabilities.

They are not.

Again, something as usual as a = b[i+1]+c[i+2] is described by IBM as a significant number of operations needed, together with performance penalties, and you are using the full vector for it, with the other 3 32-bits "ALUs" in the vector doing useless work.

Well what do you expect? SPUs are unified scalar/vector processors. If it's doing scalar work, then it can't be doing vector work and vice versa... this is why you have so many on a die...

add n to (x) · Mar 10, 2006

Ok, so you're talking about regular C-like syntax, such as addressing a scalar array of values.

Now it's true that the SPEs can only load and store quadwords (128-bits) at a time, so accessing individual scalar values will introduce some overhead:

Loading a scalar value from local store will cause the entire quadword that contains that value to be loaded, and a rotation in order to put the value into the "preferred slot" within the register (so that's one additional instruction).

Writing to a scalar value in local store will introduce a couple of extra instructions, since the quadword you're writing to has to be loaded, the value inserted into the correct place within the quadword and then written back to memory (that's an additional two instructions).

Now three extra instructions isn't exactly what I'd call "significant". And with the large (128) register file, general variables like indices, loop counters etc. aren't flushed to memory very often.

Gubbi · Mar 10, 2006

add n to (x) said:
Now three extra instructions isn't exactly what I'd call "significant". And with the large (128) register file, general variables like indices, loop counters etc. aren't flushed to memory very often.

Fundamentally it is kind of pointless to argue whether or not a SPU can be a stand alone/general purpose CPU. It is not intended to work like that, it is very clearly intended to crunch through regular sized chunks of data at great speed, and it is very good at that.

While it can act (almost, it only has a slave MMU, it cannot set up it's own translation tables) completely as a fully fledged CPU it is deficient (feature wise and therefore performance wise) in a whole bunch of ways to ever make it useful as a GP CPU.

I think you're right when you say that converting scalar stores into read-modify-writes is insignificant, I too consider this to be the least of its deficiencies. The archaic memory model (complete with lack of automatic memory coherence) is what ultimately renders the SPU braindead as a general purpose CPU.

Cheers

PeterT · Mar 10, 2006

Has there actually been a consensus in the earlier thread about what "general purpose" is?
And if so, are there any general purpose tasks that really require all that much performance?

I agree with you points in principle for a very specific definition of "general purpose" - I just don't see how that kind of work is very relevant at all when discussing SPEs in a console setting. So, this thread's main point of contention may be interesting technically, but it has little bearing on PS3.

Shifty Geezer · Mar 10, 2006

Jaws said:
Edge's point was clearly that SPUs didn't suck at integer MATH.

That wasn't clear to me. Being good at integer math doesn't really solve the idea that SPE's are good at 'integer workloads' either as it doesn't take into account the other non-maths functions, which is the crux of the discussion, though I think it took a bit of a detour there with someone doubting SPE's int-math performance.

It's one of those subjects that really can only be solved iwth benchmarks I think. Take a useful real-world routine that you want the integer performance for from a normal processor and port it verbatim for SPE, with maybe some memory management beyond pure on-demand fetching. There's a software cache solution IIRC that could be employed. Then see how well SPE copes with branching, random memory accessing compared with the PPE and see if it is fast enough to be useable or that slow as to be a 'last resort'.

add n to (x) · Mar 10, 2006

Is the SPE a "general-purpose-processor" in the same sense that the PPE or Xenon CPU or a P4 or whatever is? No, of course not. Those were designed to simplify the writing of code for them by providing features like MMU, caches etc. Some of them implement more advanced features such as OOOE & branch-prediction in order to speed things up even more. The SPEs were designed to give maximum performance on a particular set of workloads with the minimum number of transistors. But they can also run "general-purpose-code" (whatever that is) to a certain extent. Yes, you have to manage the local store yourself instead of relying on cache, but for a lot of tasks they're surprisingly quick even given pretty poor C code (in SPE terms anyway). Calling them braindead is unjustified in my opinion.

I'll stop now before I get myself into trouble

Shifty Geezer · Mar 10, 2006

PeterT said:
Has there actually been a consensus in the earlier thread about what "general purpose" is?

Not that I've seen

. Perhaps just a case on non-maths-processing counts? Bitwise transformations, comparisons, load/stores, that all have nothing to do with adding and multiplying (and all combinations thereof) numbers count as general purpose computing.

So I guess a processor workload can be divided into
Floating Point Math - Ability to Add+Mul etc. decimal values
Integer Math - Ability to Add+Mull etc. integer values
General Purpose - Everything else

'Integer Performance' is maybe a catchall for everything not Floating Point Math related. AFAIK the term was coined by MS in response to Cell, as it not? Has it been used before then? They certainly went to no effort to define the term!

Perhaps an example of 'general purpose' performance would be a bubble sort? That's all load/store/compare. Write a bubble sort to sort 500 Kb of data (greater than LS) in PPE and SPE and see how they perform. Would that be a fair comparison?

deathkiller · Mar 10, 2006

Shifty Geezer said:
Write a bubble sort to sort 500 Kb of data (greater than LS) in PPE and SPE and see how they perform. Would that be a fair comparison?

No, because PPE have 512KB L2 cache

.

Off-topic: I have found the TRE Demo Movie http://www.kevinevans.net/ibmcell/tre_demo_movie.html old?

ERP · Mar 10, 2006

'Integer Performance' is maybe a catchall for everything not Floating Point Math related. AFAIK the term was coined by MS in response to Cell, as it not? Has it been used before then? They certainly went to no effort to define the term!

Perhaps an example of 'general purpose' performance would be a bubble sort? That's all load/store/compare. Write a bubble sort to sort 500 Kb of data (greater than LS) in PPE and SPE and see how they perform. Would that be a fair comparison?

Integer Performance usuall refers to performance in none FP situations, and it long predates MS as a term. In th old days integer math was the limiting factor, these days it's basically free.

A bubble sort would be a trivial test. But you really need to be running a large application to get a good picture, a lot of general perfromance is dictated by cache architecture, and the processors ability to hide Load/Store and instruction latencies. The Majority of an applications code does nothing more than shuffle data and a lot of the data is generally not ideally structured for the cache. The execution time becomes dominated by the cache misses. This is why intel and AMD have invested so much of there R&D in improving there cache. It's OK to say things like well restructure the data so it's more cache friendly but in the real world it's often not practical.

j^aws · Mar 10, 2006

Shifty Geezer said:
That wasn't clear to me.

Well let me explain why it should've. Firstly he was talking about integer instructions per second which implied maths. Secondly, he only used 1 instruction per cycle per SPU, even though the SPUs are dual issue. Thirdly I even subsequently derived these numbers... I've probably seen it too many times for it to be obvious though...

...Being good at integer math doesn't really solve the idea that SPE's are good at 'integer workloads' either as it doesn't take into account the other non-maths functions, which is the crux of the discussion, though I think it took a bit of a detour there with someone doubting SPE's int-math performance.
...

I think for the sake of confusion, integer instructions should mean integer maths, especially on a 3D site. And FP instructions for FP maths. I think Gubbi summarised it well earlier.

It's one of those subjects that really can only be solved iwth benchmarks I think. Take a useful real-world routine that you want the integer performance for from a normal processor and port it verbatim for SPE, with maybe some memory management beyond pure on-demand fetching. There's a software cache solution IIRC that could be employed. Then see how well SPE copes with branching, random memory accessing compared with the PPE and see if it is fast enough to be useable or that slow as to be a 'last resort'.

Devs will be experimenting with different algorithms and what works best. Hopefully in the next few years, we'll see the fruits of that labour...

Shifty Geezer · Mar 10, 2006

Jaws said:
Devs will be experimenting with different algorithms and what works best. Hopefully in the next few years, we'll see the fruits of that labour...

For sure, SPE's general purpose computing performance isn't really an issue because that's not what they're going to be used for. It'd still be nice to have a comparison with a conventional processor just to know how well SPE's can cope, and how much of a reason there is for a PPE over another SPE or two, which is the point of this thread after all!

Edge · Mar 10, 2006

Shifty Geezer said:
Nope. I have here a processor that runs at 3 GHz, with 32 'ADD' units that add, can only add, and that's all this processor does. It performs 96 Billion integer instructions per second. Is it good at integer work?

Others have explained well the entirety of Integer work's demands, but I felt the obvious illustration worth adding to show the flaw in your conclusion.

You're just being silly. I can drag all kinds of specialized processors in for the sake of endless argument.

You stick with your 32 ADD unit processor for your next console, and I will stick with CELL.

Shifty Geezer · Mar 10, 2006

Edge said:
You're just being silly.

Yep

. Just saying that all the Int operations in the world doesn't mean good performance if they're not meaningful, useful operations. A processor capable of a trillion int ops per second is not a good int perform if only 20,000 of those ops can load data in registers where they are needed.

I was mixing Int Performance with General Processing though, so if you meant it as just maths, and we know Cell has a full Int maths ISA, your reasoning was fair and I just didn't follow the plot very well!

The SPE as general purpose processor

danteye

Shifty Geezer

uber-Troll!

j^aws

Gubbi

j^aws

Gubbi

DarkRage

j^aws

add n to (x)

Gubbi

PeterT

Shifty Geezer

uber-Troll!

add n to (x)

Shifty Geezer

uber-Troll!

deathkiller

ERP

j^aws

Shifty Geezer

uber-Troll!

Edge

Shifty Geezer

uber-Troll!

Similar threads