The SPE as general purpose processor

3dilettante said:
The registers, data path, and execution model are set up to get peak performance from a vectorized workload. That is not a general case.
While the point about SPE itself is valid, the way you worded this is just silly. Remind me - what was the last CPU where execution model was set-up to get peak performance in a general case?
If we used this as definition, we haven't been using general purpose CPUs for over 20 years now - and quite possibly never (if we really wanted to nitpick).
 
Fafalada said:
While the point about SPE itself is valid, the way you worded this is just silly. Remind me - what was the last CPU where execution model was set-up to get peak performance in a general case?
If we used this as definition, we haven't been using general purpose CPUs for over 20 years now - and quite possibly never (if we really wanted to nitpick).

now that you mention it, Faf, the last 'optimized for the general case' (read unoptimised) cpu from the x86 line i seem to remember was the 386. ..or maybe the 286. anyways, the 486 already showed some decisive optimisations ; )
 
The SPE's are very powerful processors, that can run any code, except running a modern OS. The performance level will depend on the skills of the programmers. One of the main benefits of the SPE's is that there is seven of them in PS3's CELL.

Andy Sites, Producer of Untold Legends: Dark Kingdom:
For instance, he stated that the physics system of the game might be handled by one processor, the calculations behind the interactive water would be handled by another and the audio running on a third. The same could be said for the particle system, which will probably run on its own processor.

Four different jobs being handled in their own programming space. You can call that specialized, or general purpose, but in the end, the label does not matter. Again, four different jobs being done, and this is a first generation game, it will only get better from here. Dino himself said that the second or third generation games will run on the SPE's and he said the SPE's are fast as hell. He's a developer, and should know what he is talking about.
 
Last edited by a moderator:
Shifty Geezer said:
2^128. That's actually quite a big number. Useful if you're writing a program to count all the atoms in the universe.

Are you sure about that? There are estimated to be 10^80 atoms in the universe which is WAY more than 2^128. Of course this is OT. :p
 
You're right. I think the rough-and-ready conversion of powers of 2 to powers fo ten is about a third the exponent., so 2^128 ~ 10^ 40. It'll certainly be far short of the 10^70 or 10^80 needed to count the universe's atoms. Sony really missed the ball on this one :cry:
 
Fafalada said:
While the point about SPE itself is valid, the way you worded this is just silly. Remind me - what was the last CPU where execution model was set-up to get peak performance in a general case?
If we used this as definition, we haven't been using general purpose CPUs for over 20 years now - and quite possibly never (if we really wanted to nitpick).

Perhaps a better wording would have been that a heavily vectorized workload belongs to a well-defined minority of work types and that processors that are are on the more general side of the general/special purpose continuum usually have to target a much wider niche.

Unless heavily vectorized, no workload on an SPE will ever get anything close to peak performance. General processors get much closer to peak without making such a large underlying assumption about the task. The cost of course is that their peaks are way lower than that of an SPE or other more specialized core.

As a more specialized processor, the SPE has the luxury of shaving off most of the tasks more general processors can't ignore. The SPE either can't run such tasks, or it does so with such a performance deficit that nobody would want it to.
 
3dilettante said:
Perhaps a better wording would ...

Your have not provide any new info for quite a few posts now. You're just repeating the same opinion over and over again. I not going to bother repeating what I think again, just to have you follow through, and repeat the same thing again.

Let me know if you have any facts to back up your opinion, and then please post it.

With the SPE's being synergistic processors, one could argue you can't make a clear argument for either side, as it's strength is in both, and some where in between.
 
3dilettante said:
Unless heavily vectorised, no workload on an SPE will ever get anything close to peak performance. General processors get much closer to peak without making such a large underlying assumption about the task. The cost of course is that their peaks are way lower than that of an SPE or other more specialised core.

The reverse is actually true. For the type of application that the SPE is designed - media streaming, vector processing and parallel processing type applications, the SPE can get much closer to it's peak performance than a general purpose CPU has a hope in hell of doing. The SPEs are designed to run these in the SPE's local memory, so no external memory laitance, no cache misses, no bus contention etc.
 
SPM said:
The reverse is actually true. For the type of application that the SPE is designed - media streaming, vector processing and parallel processing type applications, the SPE can get much closer to it's peak performance than a general purpose CPU has a hope in hell of doing.

I think most of us agree. The SPE is superfast at tasks it was designed to do. General purpose computing (which is the subject of the thread) not being one of those tasks.

Cheers
Gubbi
 
SPM said:
The reverse is actually true. For the type of application that the SPE is designed - media streaming, vector processing and parallel processing type applications, the SPE can get much closer to it's peak performance than a general purpose CPU has a hope in hell of doing. The SPEs are designed to run these in the SPE's local memory, so no external memory laitance, no cache misses, no bus contention etc.

I should have separated the two sentences so that it made more sense.
I was saying that SPEs get their peak performance from vectorized loads.

The second sentence was supposed to state that general purpose processors have a more uniform performance profile that gets them closer to peak on software that is not so heavily vectorized, though their peaks are much, much lower.
 
Gubbi said:
I think most of us agree. The SPE is superfast at tasks it was designed to do. General purpose computing (which is the subject of the thread) not being one of those tasks.

Cheers
Gubbi
Agreed.

But then again, more than 95% of the code of any general purpose task is non-critical. It doesn't matter very much (if anything) how efficiently it is executed. Just that it is, within a reasonable amount of time, often measured in milliseconds.

But nobody is preventing you to rewrite the few, small critical parts to take full advantage of the vector processing. Like, a much faster sorting (grouping) or filtering algorithm. Which is actually a benefit.

So, just like a general purpose processor is like that proverbial jack of all trades, a specialized one like the SPE is like the apprentice where it isn't critical, and master where it is.

The general purpose part of just about any current processor is "fast enough". It needs boosts like SSE to be able to improve it's overal speed. SPE's just take that to the next level.

;)
 
3dilettante said:
I should have separated the two sentences so that it made more sense.
I was saying that SPEs get their peak performance from vectorized loads.

The second sentence was supposed to state that general purpose processors have a more uniform performance profile that gets them closer to peak on software that is not so heavily vectorized, though their peaks are much, much lower.
A general purpose processor can, by it's very nature, never archieve something resembling peak performance, as at all times most of it's units/transistors are idle. But that would depend on your definition of "peak performance". You'll never get the best bang for your transistors.
 
Btw, we might have a slight miscommunication between "instructions per second/clock", and "workload done per second". I readily agree, that general processors are very good at the first, whatever the workload. They spend most of their transistors nowadays to get that IPS/IPC as high as possible. OOO, branch-predicion, large caches, etc. But that says nothing about the total amount of work you can do in that time. Or about the efficiency of the whole processor, if you look at the transistors needed to perform that task.
 
DiGuru said:
Btw, we might have a slight miscommunication between "instructions per second/clock", and "workload done per second". I readily agree, that general processors are very good at the first, whatever the workload. They spend most of their transistors nowadays to get that IPS/IPC as high as possible. OOO, branch-predicion, large caches, etc. But that says nothing about the total amount of work you can do in that time. Or about the efficiency of the whole processor, if you look at the transistors needed to perform that task.

I was leaning more towards the IPC side of the definition, where more general processors chew through instructions that, for the most part, are not designed to match a given workload's characteristics.

It's not very elegant, since a ton of general work could just as easily be expressed as a small number of more specialized instructions, but performance is more uniform with a chip that can churn generic code in many situations versus a chip that excels in a more specialized task and does poorly elsewhere.

That's why in my mind "general purpose" code is the category of workloads where there isn't an underlying assumption about parallelism or data format.

In well-defined workloads like physics and graphics, there is a strong consensus about what is the best set of primitives and operations to be performed, which means operations and instructions can be tailored to match them.

Then there's the complicating factor that modern general purpose CPUs have added SSE and Altivec units, though they haven't supplanted the IPC driven part of the cores.
 
Consider this:

Let's start with a highly specialized chip as a GPU. They can very well do a lot (most) of the general purpose work. They have to, to be able to function. While they might calculate texture elements (pixels) as a float, they do that for things like AA. The old ones didn't do that. They calculated the position of the pixel within the grid. And as any GPU has to be able to address memory locations to be able to function, they can all do that. Or most other general purpose logic and integer math. They just don't expose that general possibility through their API (programming interface).

And while SPE's aren't wired up (literally) to execute certain tasks like writing in the page table, there is no other reason why they cannot do that task. While GPU's only offer a limited set of possibilities. They are (at least seen from the outside of the chip/driver) limited in the things you can make them do, programmatically. SPE's aren't, as far as we know. They can do bit slicing and stuff. In other words: they can address and manipulate all the available bits in the memory space of the machine in any way conceivable. Which is the most basic way to determine if they are "general purpose".
 
Last edited by a moderator:
DiGuru said:
They can do bit slicing and stuff. In other words: they can address and manipulate all the available bits in the memory space of the machine in any way conceivable. Which is the most basic way to determine if they are "general purpose".

I agree, the SPE's are not crippled in working on "general purpose" problems, so then the question becomes how is the performance compared to modern "general purpose" processors. I don't think there is an easy answer. The SPE's have lots of registers, lots of localized memory, lots of external bandwidth, and high clock rate, which goes a long way showing they are no slouches at any code that they will run. The skill of the programmers, and the algorithm's used will have a huge effect on this also.

A 100 MHz 486 is a general purpose processor, but one SPE would run circles around that chip on any problem, largely because of the huge clock rate difference.

The answer to how fast the SPE compare to a modern "general purpose" processor will vary greatly depending on the problem, and the number of SPE's used.
 
DiGuru said:
And while SPE's aren't wired up (literally) to execute certain tasks like writing in the page table, there is no other reason why they cannot do that task. While GPU's only offer a limited set of possibilities. They are (at least seen from the outside of the chip/driver) limited in the things you can make them do, programmatically. SPE's aren't, as far as we know. They can do bit slicing and stuff. In other words: they can address and manipulate all the available bits in the memory space of the machine in any way conceivable. Which is the most basic way to determine if they are "general purpose".

That makes them universal, not general purpose (so long as you ignore things like software permissions and interrupts).

An 8-bit embedded processor from a coffee maker could be made to do any operation you want within its memory space with enough work. That doesn't mean it's magically general purpose.
 
I've tried thinking of what specialized processor out there has this combination:

Group 1
128x128-bit registers (16 KB worth)
256 KB local store memory

Group 2
3.2 GHz clock rate

I can't think of a single specialized processor that has this level of localized memory + extremly high clock rate. Some specialized processors have more local store (some TI DSPs have large secondary caches, but less localized memory closer to the processor), but none that I know of have a greater register bank (per processor), and I'm hard pressed to find a single specialized processor that operates much above 1 GHz.

Because no other specialized processor has this combination which lends itself to high performance (either specialized or general purpose), makes the SPE's unique. Synergistic unique.

The SPE's can run a variety of code, and even though the fastest x86 processor might be faster in some areas, you have SEVEN SPE's on PS3's CELL to breakup the workload, if the workload lends itself to that.
 
Last edited by a moderator:
Edge said:
none that I know of have a greater register bank (per processor)
Well, the itanic has 128 128-bit INT registers and 128 128-bit FP registers I believe... Or that's what I seem to recall reading ages ago anyway, I could well be wrong on this point. Then again, isn't itanic perpetually stuck in 2-ish GHz territory?
 
Guden Oden said:
Well, the itanic has 128 128-bit INT registers and 128 128-bit FP registers I believe... Or that's what I seem to recall reading ages ago anyway, I could well be wrong on this point. Then again, isn't itanic perpetually stuck in 2-ish GHz territory?

Itanium is actually an interesting point of comparison for the SPE.

In-Order core.
256K L2
Exteremly reliant on SIMD for good performance.
High potential FP throughput
 
Back
Top