The SPE as general purpose processor

randycat99 said:
The Itanium comes from the era of shorter pipeline design, prior to the P4 age, no?


Merced, the first Itanium design was disclosed around the time of the Pentium Pro's introduction.

This was likely quite a bit before the beginnings of the Pentium 4 project.

At the time, it was feared that the complexities of Out of Order chips would prove to be unsustainable. They were apparently at least a decade early in their assessment.
 
3dilettante said:
That makes them universal, not general purpose (so long as you ignore things like software permissions and interrupts).

An 8-bit embedded processor from a coffee maker could be made to do any operation you want within its memory space with enough work. That doesn't mean it's magically general purpose.
So, your single distinction between the two is the workload average? But in that case, none of the x86/PPC general purpose processors in use nowadays qualify for that. Because you can speed all those up by rewriting critical sections of your code to use the more specialized units like SSE, or even the floating point units. It's just that not many developers do so, as long as their code also has to run on all other processors of that kind, even the ones that don't support those possibilities.

No, when we go by that definition, the processor in a coffee maker would be the most general purpose one, if you look at the code it executes. And it would be very efficient if you look at the amount of transistors it uses to do all that.

The moment you start implementing things like floating point in hardware is where it becomes harder to draw the line: does that make for a more averaged workload, or does it mean that more units/transistors are spend on specialized tasks to improve the IPC? And will all those transistors be used when running programs that are written years ago, for quite different processors?

And OOO is only beneficial in the transistor budget if you have a fixed instruction set and have to run any program not written to take full advantage of the processor. So, in my opinion, OOO and such are more like stopgap "solutions", spending a large amount of transistors to increase the speed of such a very badly balanced workload a small bit.

If you are in the position to come up with a new design that doesn't have to run all that obsolete code just a bit faster, you can do it much better and spend those transistors on things that offer a much improved overall speed. And even offer more processors on that die for the same amount of transistors, all with about the same speed on average for code that uses all those possibilities to the max.

Which is my definition of a good general purpose processor.

;)
 
Last edited by a moderator:
DiGuru said:
So, your single distinction between the two is the workload average? But in that case, none of the x86/PPC general purpose processors in use nowadays qualify for that. Because you can speed all those up by rewriting critical sections of your code to use the more specialized units like SSE, or even the floating point units. It's just that not many developers do so, as long as their code also has to run on all other processors of that kind, even the ones that don't support those possibilities.

My distinction is in the context of the target workloads and its performance on various tasks relative to contemporary designs of similar complexity.

My checklist:
1) What did the designers want this to run on?
The designers have already stated that they want the SPEs to offload computationaly intensive work like graphics, physics, and video onto the SPEs.

2) What assumptions about the workload does the architecture make in its design features?
The SPE model works best on workloads that can stream or can be batched to fit in local store. Local store doesn't make it either special or general purpose, but it hints that there is a strong preference for a certain kind of memory use.

The register width is only used if the data is vectorized, and non-vector math ops will only yield a fraction of the SPEs capability. Vectorizable workloads are a pretty special group.

The SPE has no branch prediction, which when coupled with its long pipeline pretty much assumes its workload is either not branchy or easily predicted at compile time. Its restricted ability to handle multiple branch hints within a small number of cycles means that this chip has a very strong preference for non-branching code.


3) Compared to other designs that use the same number of transistors, what is the performance that results?
For vectorized compute-intensive tasks the SPE blows things away without question.
Outside of this realm, performance is on probably on par with general-purpose chips of earlier process generations. Without solid benchmark numbers on a wider set of software, I can't be sure.
If an SPE can perform better than a Pentium III on SPEC Int, for example I'd lean towards more general purpose. However, given the transistors and aggressive design spent on an SPE, getting older than last generation performance outside of its chosen apps means it is more likely the chip is destined to be specialized.

4) What and how many tasks can it simply not do, either because of implementation restrictions or architectural limits?

The SPE is designed to work on single tasks within a separate memory space, with the rest of CELL maintaining its simple environment.
The SPE can't service a lot of interrupts, and with its hefty penalties on context switching, it really shouldn't anyway. General processors can't make that assumption.
The SPE does not handle software permissions, period. It can't run an OS, and it can't access system data.
It could never run a system on its own.
The general purpose PPE can, even with its lesser performance.

All these factors make me think that the SPE is special purpose. It is versatile, powerful, but not entirely free of the baggage that makes it very clear it has a special set of jobs to do.

That doesn't matter, of course, because in conjuction with the other system elements, CELL as a whole is general purpose.

Processors that are considered general purpose are permitted to have specialized hardware within them, since they still devote resources to general execution.

I've already said that CELL as a whole would in my mind be a general purpose processor, just that an SPE in isolation would not be.

Without the PPE and the internal interconnect, there is a huge number of programs that would not run on CELL, because the SPE either cannot or barely runs it at the speed of chips from previous hardware generations.

No, when we go by that definition, the processor in a coffee maker would be the most general purpose one, if you look at the code it executes. And it would be very efficient if you look at the amount of transistors it uses to do all that.

The embedded processor would be universal in its computational ability, not general purpose. This is the distinction I am trying to make. Almost any programmable chip can be made to run any algorithm that another chip can run.

Any universal computational machine can run an algorithm that another machine can run.
If being able to run any software is general purpose, the terms general purpose and special purpose would be the same in everything but non programmable ASICs.

In the case of the coffee maker, it can be made to manipulate everything in its limited memory space as demanded by any appropriately configured algorithm. If space concerns could be alleviated, any software could be compiled to run on it.
Problem is, it can't do a lot of that work well.

You don't want it to spell check a word document, even though in theory you could hack a complicated software monstrosity to do it.

If nobody in their right mind will use a given chip outside of a given set of specialized tasks, it doesn't matter if it in theory could be made to do so.
 
3dilettante said:
In the case of the coffee maker, it can be made to manipulate everything in its limited memory space as demanded by any appropriately configured algorithm. If space concerns could be alleviated, any software could be compiled to run on it.
Problem is, it can't do a lot of that work well.

With the main difference being that the SPU is about a thousand times more powerful than your coffee maker embedded CPU.
 
3dilettante said:
You don't want it to spell check a word document, even though in theory you could hack a complicated software monstrosity to do it.
That wouldn't be a very demanding thing to do. Just about any processor can do that, as long as it can access the memory required. I'm pretty sure I could get an Atmel Mega microcontroller at 8 MHz do that well enough.

Which is just the point about general processing: just about all of it is non-critical. And while you can speed it up by reducing the time each instruction takes, look at the difference between an Athlon and a Pentium 4: the Athlon blows the Pentium away, even while running at half the speed, with a worse branch-prediction and OOO implementation, less execution units and a much worse IPC rate.

Using specialized units to do stuff, be it SSE units or other processors on the same chip has turned out to be a MUCH better way to increase the workload done per second. That's why they all go that way nowadays. The reign of the best IPC is over.

And even if it wasn't: how do you calculate the IPC between different processor architectures? You need to look at the workload done per second to be able to come up with any meaningful numbers.

If nobody in their right mind will use a given chip outside of a given set of specialized tasks, it doesn't matter if it in theory could be made to do so.
Yes, but as Edge said: does it matter if that processor is much more than powerful enough in any case? It gets the job done. And if you want more performance, you have more of them to use.

That is another distinction: you're not running out of time for other tasks. You can have other cores run that other task, or optimize the critical parts of the code to run an order of magnitude faster. You're not limited to a single instruction path.


But I do agree that SPE's aren't designed and wired up to do it all. But they could, if you did.

;)
 
DiGuru said:
That wouldn't be a very demanding thing to do. Just about any processor can do that, as long as it can access the memory required. I'm pretty sure I could get an Atmel Mega microcontroller at 8 MHz do that well enough.

My point is that being universally capable doesn't make a processor general purpose. Nobody would claim the microcontroller is general purpose, even though it could be made to run the same tasks as a Pentium 4.

The main argument is that an SPE can do (almost) anything that a processor like the PPE can do, so it must be general purpose.

edit:
I tried to illustrate through this example that just having the possibility of running almost everything doesn't make a chip general purpose.


Which is just the point about general processing: just about all of it is non-critical.

A lot of general processing isn't compute-intensive but it is still critical, it is either I/O or system limited. A lot of it is unglamorous data juggling, context switching, running through shared libraries, and synchronization.

For current general purpose processors, these parts are not bottlenecks because the chips are adept enough at processing that other elements of the system become bottlenecks.

If you change the chip to one that is not as capable, then a lot of non-critical tasks become critical.

The SPE alone simply is not designed to be such a processor. Its targets are computationally intensive. They must be subdivided into convenient batches, and it works best if it can run its task uninterrupted from its local store for the entire duration. Its memory accesses must be highly predictable, and a huge portion of its performance within this limited scope is dependent on a workload that is vectorizable.

I don't get what's so controversial about that statement.

An SPE would be wasted or disfunctional if it were used to run an OS, a web browser, a game, a hardware driver, or a spreadsheet program.

It could be made to run parts of them, but not their entirety.

And while you can speed it up by reducing the time each instruction takes, look at the difference between an Athlon and a Pentium 4: the Athlon blows the Pentium away, even while running at half the speed, with a worse branch-prediction and OOO implementation, less execution units and a much worse IPC rate.

Depending on which Athlon and which Pentium 4 you are talking about, the performance was either worse, equivalent, or better.

The Athlon's branch predictor is very close to the Pentium 4, and there are situations where it is better.

The Athlon's OOO implementation is not as aggressive as the P4's but this doesn't make it inferior. The Athlon is also a far wider design internally than the Pentium 4, so there is no Pentium advantage there.

Using specialized units to do stuff, be it SSE units or other processors on the same chip has turned out to be a MUCH better way to increase the workload done per second. That's why they all go that way nowadays. The reign of the best IPC is over.

Putting an SSE unit on a Pentium did not magically make the Pentium special-purpose. The Pentium as a whole is still a fully-featured and highly capable general purpose processor.

On the other hand, being put on the Pentium did not magically make the SSE unit general purpose. The general purpose nature of the greater whole cannot be transferred back down to a specialized component.

I've already stated that CELL is general purpose, because the SPE is just part of a larger whole.

And even if it wasn't: how do you calculate the IPC between different processor architectures? You need to look at the workload done per second to be able to come up with any meaningful numbers.

You are measuring throughput, which is just one consideration. Latency, flexibility, and capability are other considerations, though not the only ones.

There is a vast swath of tasks the SPE
1) cannot do (software permissions, various interrupts, run the priviledged software that handles memory mapping)
2) cannot do very well (anything needing a context switch, tons of branches, a lot of synchronization)
3) would be a waste of a lot of specialized hardware to have it do (take user input)

But it doesn't matter in the end, because CELL as a whole is capable of handling those tasks.

Yes, but as Edge said: does it matter if that processor is much more than powerful enough in any case? It gets the job done. And if you want more performance, you have more of them to use.

That is a property of CELL, not the SPE. The SPE is not all there is to CELL, if there were only an SPE, it would be considered some special purpose wannabe funky vector processor, and its shortcomings would be painfully obvious.

The SPE's design assumes there's a CELL built around it for it to work in a general environment.

That is another distinction: you're not running out of time for other tasks. You can have other cores run that other task, or optimize the critical parts of the code to run an order of magnitude faster. You're not limited to a single instruction path.

That is a property of CELL, not the SPE in isolation.

But I do agree that SPE's aren't designed and wired up to do it all. But they could, if you did.

;)

They aren't wired up to run some of the most common software in existence.
I don't want to sound like a broken record, but if there's a huge number of things it can't run, then it doesn't sound very general purpose.
 
Last edited by a moderator:
AI is a task that tipically requie branch prediction code, the spe totally lack of any branch prediction, so wharever optimization you are going to make the AI will always run better on a general purpose processor than on a spe.
 
Last edited by a moderator:
supervegeta said:
AI is a task that tipically requie branch prediction code, the spe totally lack of any branch prediction, so wharever optimization you are going to make the AI will always run better on a general purpose processor than on a spe.

First, there is a software based branch prediction on the SPU. If you have enough work to place the prediction early enough, the branch will be for free.
Second, even if one SPU is slower than a general purpose processor at the same frequency, the way SPUs are designed it will be trivial to extend the processing to 8 SPUs, and since SPUs scale very well you'll get close to 8x speedup.
Good luck doing that on a general purpose processor, given that you find one with 8 cores, which presently doesn't exist.
 
Barbarian said:
First, there is a software based branch prediction on the SPU. If you have enough work to place the prediction early enough, the branch will be for free.

It will never be as good as the branch prediction of a modern general purporse processor.

Second, even if one SPU is slower than a general purpose processor at the same frequency, the way SPUs are designed it will be trivial to extend the processing to 8 SPUs, and since SPUs scale very well you'll get close to 8x speedup.
Good luck doing that on a general purpose processor, given that you find one with 8 cores, which presently doesn't exist.

Wow if you need so much spe running together just to match a single general purpose processor, you are just wasting all the spe power for running a task that would run a lot better on the PPE , this is a very idiotic thing to do.
 
Last edited by a moderator:
supervegeta said:
Wow if you need so much spe running together just to match a single general purpose processor, you are just wasting all the spe power for running a task that would run a lot better on the PPE , this is a very idiotic thing to do.
If comparing Cell's branch-heavy code processing toa conventional processor like P4, if Cell can match P4 then it can't be considered weak at general processing, no? The fact it uses SPE's rather than a branch predictor doesn't change it's capacity to match P4 in branchy code (if it can). So in Cell you have a processor that can handle branchy code as well as a P4, and vector code an order of magnitude faster. I don't see that that's something to complain about!

And yes, you're right, using all your SPE's to run branchy code is a waste of their potential. But only if what you're processing can't be dealt with as vector ops with branching. If there's no other way round the problem, at leat you know using Cell isn't going to be any slower than any other processor at the same task, which is what this comparison is about, Cell versus 'general purpose processors' (assuming Barbarian's comments on SPE's branch performance is correct).
 
Shifty Geezer said:
If comparing Cell's branch-heavy code processing toa conventional processor like P4, if Cell can match P4 then it can't be considered weak at general processing, no? The fact it uses SPE's rather than a branch predictor doesn't change it's capacity to match P4 in branchy code (if it can). So in Cell you have a processor that can handle branchy code as well as a P4, and vector code an order of magnitude faster. I don't see that that's something to complain about!

And yes, you're right, using all your SPE's to run branchy code is a waste of their potential. But only if what you're processing can't be dealt with as vector ops with branching. If there's no other way round the problem, at leat you know using Cell isn't going to be any slower than any other processor at the same task, which is what this comparison is about, Cell versus 'general purpose processors' (assuming Barbarian's comments on SPE's branch performance is correct).

Don't take this offensively but I prefer proof to wild assumptions.
 
Shifty Geezer said:
If comparing Cell's branch-heavy code processing toa conventional processor like P4, if Cell can match P4 then it can't be considered weak at general processing, no?

It depend where you are executing the code, if you are executing it on the PPE it have branch prediction even if it is not as good as the branch preditcion of a P4, it can be considered not so weak , but if you are executing it on the spe and it run with a lot worse efficency to me it is more weak.



The fact it uses SPE's rather than a branch predictor doesn't change it's capacity to match P4 in branchy code (if it can). So in Cell you have a processor that can handle branchy code as well as a P4, and vector code an order of magnitude faster. I don't see that that's something to complain about!

Read again what i said before :

"if you need so much spe running together just to match a single general purpose processor, you are just wasting all the spe power for running a task that would run a lot better on the PPE , this is a very idiotic thing to do."

Why waste the spe in executing a task that would run a lot better on the PPE ?


And yes, you're right, using all your SPE's to run branchy code is a waste of their potential. But only if what you're processing can't be dealt with as vector ops with branching. If there's no other way round the problem, at leat you know using Cell isn't going to be any slower than any other processor at the same task
which is what this comparison is about, Cell versus 'general purpose processors' (assuming Barbarian's comments on SPE's branch performance is correct).

But the cell don't have infinite power, so if you use all the power of the spe's just to compensate the inefficency in running this task, you have little power left to execute other tasks and this make it weak compared to other processors that don't have need to use all the power just to execute this type of task, and have a lot more free power left to do other things.

And again, you have a powerPc core that have some branch prediction, so you have the way to solve the problem, no need to waste the spe power.
 
Last edited by a moderator:
supervegeta said:
It will never be as good as the branch prediction of a modern general purporse processor.

Wow if you need so much spe running together just to match a single general purpose processor, you are just wasting all the spe power for running a task that would run a lot better on the PPE , this is a very idiotic thing to do.

Actually, software branch prediction has some advatanges, for example, if the code predicts the branch early enough, it will always have zero cost, while with a hardware predictor there are no guarantees. This fact, combined with predictable memory latency, makes the SPU very deterministic, which is a very good thing for scheduling parallel tasks.
And I think people overestimate the branch cost, on the SPU it's 18 cycles maximum. How many cycles do you think it is on the P4 with it's rumoured 35+ stage pipeline? Branch prediction or not, if the algorithm jumps all over the place you're screwed.

And regarding using 8 SPUs to match a P4 - you say it's a dumb waste of resources, so how come running a 4Ghz P4 to type in Word is not a waste?!
My argument was that the SPU is quite capable in general processing and even if it falls short here or there, you've got 8 of them to compensate. Whether you can use them better is all a matter of time, resources and need.

As for using the PPU for general purpose tasks, yes, by all means, that's why it's there. On the other hand I've seen examples where a single SPU performs better than the PPU in legacy code, ie general purpose code, that was just modified to run within the constraints of SPU's local store, and this gave an improvement of 1.2 to 1.5 times. Now, the real shocker was when the same code was optimized specifically for the SPU, it gained 40x speedup compared to the PPU!
 
scificube said:
It's not a waste is the PPE is overburdened already and you have SPEs just sitting around...

To me it is still a waste because you could use those spe's to add more graphical detail instead to run inefficent code.
 
Barbarian said:
Actually, software branch prediction has some advatanges, for example, if the code predicts the branch early enough, it will always have zero cost, while with a hardware predictor there are no guarantees. This fact, combined with predictable memory latency, makes the SPU very deterministic, which is a very good thing for scheduling parallel tasks.
And I think people overestimate the branch cost, on the SPU it's 18 cycles maximum. How many cycles do you think it is on the P4 with it's rumoured 35+ stage pipeline? Branch prediction or not, if the algorithm jumps all over the place you're screwed.

I think that other users alredy exposed why branch prediction code can't run on the spe with the same performance of a p4 processor.


And regarding using 8 SPUs to match a P4 - you say it's a dumb waste of resources, so how come running a 4Ghz P4 to type in Word is not a waste?!

With the difference that while you are using a fraction of the P4 power, you are using all the power of the spe's .


My argument was that the SPU is quite capable in general processing and even if it falls short here or there, you've got 8 of them to compensate. Whether you can use them better is all a matter of time, resources and need.

As for using the PPU for general purpose tasks, yes, by all means, that's why it's there. On the other hand I've seen examples where a single SPU performs better than the PPU in legacy code, ie general purpose code, that was just modified to run within the constraints of SPU's local store, and this gave an improvement of 1.2 to 1.5 times. Now, the real shocker was when the same code was optimized specifically for the SPU, it gained 40x speedup compared to the PPU!

Then yes it is all about efficency.
 
supervegeta said:
To me it is still a waste because you could use those spe's to add more graphical detail instead to run inefficent code.
You've totally missed the plot. The thread is about what is possible, and comparing SPU and Cell branch performance with 'normal' processors. If you need to run branchy code, Cell maybe isn't any worse than a P4. If you don't need to run branchy code, you can run vector math on Cell. If you use a P4 instead, if you need branchy code you're alright. If you don't need branchy code, all that branch prediction logic and resources are a waste of die space, sat around doing nothing.

If you put a Cell in a computer, you can run really fast vector code and good branchy code. If you put a P4 in a computer, you can run good branchy code and moderate vector code. That's what Barbarian's saying.

As to how you choose to use Cell's resources, obviously you'd try to play to strengths as much as possible. But this topic is about what the chip is capable of. Barbarian is saying Cell's penalties in running branchy code, that to date has been regarded as high and renders Cell a less effective processor in that regard to conventional processors, can be minimised, producing an effective branch-capable processor, with SPE's being capable of handling branch prediction in software.
 
Shifty Geezer said:
ou've totally missed the plot. The thread is about what is possible, and comparing SPU and Cell branch performance with 'normal' processors. If you need to run branchy code, Cell maybe isn't any worse than a P4. If you don't need to run branchy code, you can run vector math on Cell. If you use a P4 instead, if you need branchy code you're alright. If you don't need branchy code, all that branch prediction logic and resources are a waste of die space, sat around doing nothing.

If you put a Cell in a computer, you can run really fast vector code and good branchy code. If you put a P4 in a computer, you can run good branchy code and moderate vector code. That's what Barbarian's saying.

As to how you choose to use Cell's resources, obviously you'd try to play to strengths as much as possible. But this topic is about what the chip is capable of. Barbarian is saying Cell's penalties in running branchy code, that to date has been regarded as high and renders Cell a less effective processor in that regard to conventional processors, can be minimised, producing an effective branch-capable processor, with SPE's being capable of handling branch prediction in software.

You are totally missing the point of my post again , so let's repete the same thing i said in the other post, and if you don't understand, then I give up.

If you talk of cell as a whole processor you are talking about the PPE plus the SPE's, if you talk about the spe's alone it is a total different thing.

Now, the spe's can execute branch prediction code but with a very low efficency compared to a general purpose processor and compared to the PPE, this is the point and if you think othervise provide some proof.
 
Last edited by a moderator:
supervegeta said:
If you talk of cell as a whole processor you are talking about the PPE plus the SP's, if you talk about the spe's alone it is a total different thing.
Thread title : The SPE as general purpose processor

Part of that discussion is how well SPE can handle branch code. Barbarian points out it could be very good in some cases.

Now, the spe's can execute branch prediction code but but with a very low efficency compared to a general purpose processor and compared to the PPE.
I can't argue this. I don't know what SPE's branch prediction code is like. But Barbarian has said SPE can execute branch prediction code as well as a general purpose CPU.

Now this is the real debate and one you need to take up with Barbarian. Give your reasons why you think SPE's can't handle branch prediction as a fast as a P4, and he gives his reasons why he thinks it can. Or you give evidence of eprforamnce of a software branch predictor versus a hardware branch predictor. Just saying 'SPE's are no good at this' without giving reasons or evidence isn't contributing intelligently to the debate.

That's the discussion of this thread - How good are SPE's at running different types of code. Don't go saying 'that's a waste of SPE power' as an argument against software branch predictions because that's not talking about how well SPE's execute different types of code.
 
Back
Top