The SPE as general purpose processor

Shifty Geezer said:
Yep :p. Just saying that all the Int operations in the world doesn't mean good performance if they're not meaningful, useful operations. A processor capable of a trillion int ops per second is not a good int perform if only 20,000 of those ops can load data in registers where they are needed.

I was mixing Int Performance with General Processing though, so if you meant it as just maths, and we know Cell has a full Int maths ISA, your reasoning was fair and I just didn't follow the plot very well!

I made it quite clear I was not talking about operations per second, and I did bring up a good point that dual pipelines helps in keeping up the number of instructions executed per cycle.

If you have something to contribute, then please do.
 
I always thought "general purpose" was the collection of tasks for which there is no consensus on the best way to do things.

A word processor is general purpose because people have never found a way to define a superior way of doing word processing.
Same with operating systems, unfortunately.


Graphics is special purpose because you have primitives and a well-defined pipeline of tasks.
Same with signal analysis, there are well-defined algorithms and transformations that are obviously the way to go.

A general purpose processor would be able to run tasks that can be programmed in any number of ways and run most of them at least reasonably well. Unfortunately, this will always come at the price of lower overall performance.

I think the SPEs don't quite fit because of their design constraints that can seriously impact how they perform on code and algorithms that don't fit within a pretty well-defined range.
They can run general-purpose code, but not without jumping through hoops or running slowly.
 
3dilettante said:
They can run general-purpose code, but not without jumping through hoops or running slowly.

Jumping through hoops? Can you define that?

Running slowly? The SPE's are anything but slow, so I really would like to hear your reasoning behind that.
 
Edge said:
Jumping through hoops? Can you define that?

Running slowly? The SPE's are anything but slow, so I really would like to hear your reasoning behind that.

I guess my idea of jumping through hoops in my definition is having to program or design in code with properties that have no algorithmic reason to exist for the sake of acceptable performance. All processors have gotchas like that, though the SPEs have more in being geared towards a specific kind of work.

Instruction ordering is pretty important in SPE code, otherwise its limited ability at dual issue instructions is wasted, potentially a significant fraction of its peak execution abilities.

The local store is of fixed size and it exists as its own memory space, meaning tasks must stream well and be of a certain size. This is something the programmer must always keep in mind, even though algorithmically it shouldn't matter. All processors have such wrinkles, they are just more pronounced and present in the software model of the SPE.
Coherency must be explicitely maintained, and future Cell variants with a larger local store will need a recompile to use it.

Memory accesses are more optimal if they pull in batches of data, it's the sweet spot for the EIB and DMA engines. There are tasks that don't really like to do that, which either mean a lot of performance goes wasted or the programmer has to get real creative in structuring code for the sole reason of catering the architecture.

SPEs don't have complex branch prediction hardware, though branch hints can be inserted into the code. Branch prediction hardware is still on average better than static prediction, and the SPE pipeline is a long one.

Most processors with such a long pipeline would have a robust predictor in place, but in the case of the SPEs, designers figured that the more specialized target applications would not benefit from the hardware costs incurred. Nobody is going to want to run branchy code on an SPE. It can run it, but that workload isn't the target for the design.

There are tasks that are inherently branchy and difficult to predict at compile time, and the SPE will always run far below its peak in such situations. Since the SPE is not targeted at such workloads, this really isn't much of a problem.

That's a lot of the compute-intensive tasks out there that run very well on an SPE, but there are still tasks that exceed the bounds of the SPE's comfort zone, which means that performance suffers. Future variants will do better.

Is the SPE general purpose in that it can run (almost) any kind of code? I'd say no, being general purpose would have sacrificed a lot of peak capability, and the designers had a specific target in mind.

I'd say an SPE is pretty much a universal computing machine because it can process anything another machine can, but that doesn't mean it does everything well.
 
Yes, if you don't code properly for the SPE's they can run at a fraction of their overall performance, but we are not discussing programmer efficiencies which maybe present or not.

You can get vastly different performance out of any processor depending on the capabilities of the programmer.
 
Is there any data in already published benchmarks of Cell that suggest integer (or non-floating point scalar workload, in the context of this thread) performance?
 
Shifty Geezer said:
'Integer Performance' is maybe a catchall for everything not Floating Point Math related. AFAIK the term was coined by MS in response to Cell, as it not? Has it been used before then? They certainly went to no effort to define the term!

I remeber that term being used heavily in print magazines mid 90's back when the PowerPC vs Pentium wars really got heated. Especially when it was the PPC604 vs the PentiumPro, a lot of people were being educated about the int vs float performance.
 
Edge said:
Yes, if you don't code properly for the SPE's they can run at a fraction of their overall performance, but we are not discussing programmer efficiencies which maybe present or not.

You can get vastly different performance out of any processor depending on the capabilities of the programmer.

Yes, but the SPEs are much more finicky than most general purpose processors. That's just the way they are targeted.

Even the designers have pointed out the target applications, which are a large but relatively specialized range of workloads.
No amount of programmer skill is going to convert a branchy workload with unpredictable memory accesses and a working set larger than local store into something that the SPE can run well.

You could run a lot of general purpose code on the SPEs, but nobody in their right mind would do it.

That's the distinction I make. A processor can be UNIVERSAL in its computing ability, but its quirks can make it so it isn't general.
 
one said:
Is there any data in already published benchmarks of Cell that suggest integer (or non-floating point scalar workload, in the context of this thread) performance?
Cryptografy is non-floating point vector workload:

AES ECB encryp. 128b key 1.03 Gbps 2.06Gbps (one SPE) 2x (one SPE)
AES ECB decryp. 128b key 1.04 Gbps 1.5Gbps (one SPE) 1.4x (one SPE)
TDES ECB encryp. 0.13 Gbps 0.17 Gbps (one SPE) 1.3x (one SPE)
DES ECB encryp. 0.43 Gbps 0.49 Gbps (one SPE) 1.1x (one SPE)
SHA-1 0.9 Gbps 2.12 Gbps (one SPE) 2.3x (one SPE)

Apart from that you have unoptimized (using software cache for all data) OpenMP spec benchmarks, I don't know how many (if any) of them are non-floating point scalar workload (1x = PPU):


eiche13.gif


310.wupwise_m and 311.wupwise_l quantum chromodynamics
312.swim_m and 313.swim_l shallow water modeling
314.mgrid_m and 315.mgrid_l multi-grid solver in 3D potential field
316.applu_m and 317.applu_l parabolic/elliptic partial differential equations
318.galgel_m fluid dynamics analysis of oscillatory instability
330.art_m
and 331.art_l neural network simulation of adaptive resonance theory
320.equake_m and 321.equake_l finite element simulation of earthquake modeling
332.ammp_m computational chemistry
328.fma3d_m
and 329.fma3d_l finite-element crash simulation
324.apsi_m
and 325.apsi_l solves problems regarding temperature, wind, distribution of pollutants
326.gafort_m
and 327.gafort_l genetic algorithm code
 
Last edited by a moderator:
3dilettante said:
No amount of programmer skill is going to convert a branchy workload with unpredictable memory accesses and a working set larger than local store into something that the SPE can run well.

While your general purpose processor has choked on it's floating point workload, CELL has finished long ago, and has excess amount of cycles to be used for integer work. It has a rich set of integer instructions, that can handle ANY integer workload of a typical general purpose processor. You have seven of them to do the work, all with excellent localized resources, 128-bitsx128 registers, 256 KB SRAM. A dedicated DMA engine to handle your external memory load/stores. Running at 3.2 GHz. The SPE's are not going to be running wordprocessors, but games. It's a processor to meet the integer and floating point needs of a game. Sure you have isolated one issue that will run better on some other processors, but the solution may require a different approach, or run on the PPE. CELL is never about the SPE's alone, but the synergy of a chip of different components with different strengths. Just like a GPU is added to a PC to enhance it's overall abilities.
 
Last edited by a moderator:
3dilettante said:
No amount of programmer skill is going to convert a branchy workload with unpredictable memory accesses and a working set larger than local store into something that the SPE can run well.

Hmm, that's a very broad and general statement. I think what can or can't be "cast" to algorithmics suited to a SPE is a very open question, and it is dependent totally on programmer skill.
 
Jaws said:
Well let me explain why it should've. Firstly he was talking about integer instructions per second which implied maths.
How does "instuctions" imply math? If he meant that, he should have said arithmetic somewhere in there.

Branching, loading, storing, etc. are all instructions, and they'll average much longer than 1 per cycle in an SPU, unless your code can be parallelized to hide latency and is very predictable (e.g. a heavily repeated loop). I'd venture more often than not that this isn't the case.
 
Last edited by a moderator:
Mintmaster said:
How does "instuctions" imply math? If he meant that, he should have said arithmetic somewhere in there.
...

Firstly he was talking about integer instructions per second which implied maths. Secondly, he only used 1 instruction per cycle per SPU, even though the SPUs are dual issue. Thirdly I even subsequently derived the number for 22.4 Ginst/sec (integer, SPU). Fourthly, purely looking at Ginst/sec for the PPE and SPU with dual issuing would result in double the number.

If you still don't get it then that's fine but we've discussed this already.
 
Jaws said:
Firstly he was talking about integer instructions per second which implied maths.
To fill in the hole in my knowledge, what are general program instructions called such as comparisons, loops, branches, load/stores etc? I don't think I'm the only one wrongly thinking these were being classed under integer.
 
Shifty Geezer said:
To fill in the hole in my knowledge, what are general program instructions called such as comparisons, loops, branches, load/stores etc? I don't think I'm the only one wrongly thinking these were being classed under integer.

Comparisons, I'd call logical anyway. Loops and brancing, maybe control.
 
Shifty Geezer said:
To fill in the hole in my knowledge, what are general program instructions called such as comparisons, loops, branches, load/stores etc? I don't think I'm the only one wrongly thinking these were being classed under integer.

I believe it was already answered in the thread and thus the confusion. Gubbi's large post explained it well. Without context or being explicit seems to be the problem.
 
Jaws said:
Firstly he was talking about integer instructions per second which implied maths. Secondly, he only used 1 instruction per cycle per SPU, even though the SPUs are dual issue. Thirdly I even subsequently derived the number for 22.4 Ginst/sec (integer, SPU). Fourthly, purely looking at Ginst/sec for the PPE and SPU with dual issuing would result in double the number.

Well I meant any instruction that was not floating point related. On most processors, the integer instructions are the instructions that run on the integer unit, but the SPE is a integer/floating point unit. I also talking about the AVERAGE instruction rate, for a TYPICAL 'integer' program. My guess the instruction rate would be roughly 1 instruction per cycle, and that is taking into account the dual-issue. Some 'integer' instructions are multi-cycle, so that affects the AVERAGE.

As far as I know any instruction that does not operate on floating point data, has always been considered an integer instruction, because it ran on integer units. Now the SPE being a sort of hybrid makes this a bit confusing. I can see why IBM wants to use the word synergy when describing those processors. Synergistic Processing Elements.
 
Last edited by a moderator:
Instead of defining what is "general purpose", why not just state the SPE's can run a program, that does not deal with operating system specific duties, like interrupts, and global memory management, because it lacks the hardware to do those things.

The SPE's have a rich instruction set like any modern day processor, and so can run any program that could run on any other machine. What would stop it? Performance might be the issue, but that could vary a great deal depending on the algorithm used, and capabilities of the programmer.

Some of the claims being made here is like stating an integer unit cannot produce floating point results, because it lacks the hardware, but there are routines that generate floating point results on integer processors. Sure they take many cycles to produce their results, but it can be done.

Anyway the whole operating system question is a non-issue as CELL has a processor to deal with the that. So the real question, can the SPE's run every other non-OS specific task, and the answer is a resounding YES. The SPE's have already proven themselves to be powerful rendering engines (IBM's Terrain Render fly-by demo), and yet have no specific rendering hardware.

Specialized and also general purpose. Another synergy.
 
Last edited by a moderator:
Edge said:
Well I meant any instruction that was not floating point related. On most processors, the integer instructions are the instructions that run on the integer unit, but the SPE is a integer/floating point unit. I also talking about the AVERAGE instruction rate, for a TYPICAL 'integer' program. My guess the instruction rate would be roughly 1 instruction per cycle, and that is taking into account the dual-issue. Some 'integer' instructions are multi-cycle, so that effects the AVERAGE.

Your original 22.4 Ginst/sec calculation for 7 SPUs, was it integer MATH or not? You made no assertion to it being an average in your initial post, therefore indicating a peak.

Also, re average, your subsequent post referred to a higher number being an average? Well that sounds contradictory now...

And finally, you've taken this long to come clean!
 
Back
Top