randycat99
Veteran
The Itanium comes from the era of shorter pipeline design, prior to the P4 age, no?
randycat99 said:The Itanium comes from the era of shorter pipeline design, prior to the P4 age, no?
So, your single distinction between the two is the workload average? But in that case, none of the x86/PPC general purpose processors in use nowadays qualify for that. Because you can speed all those up by rewriting critical sections of your code to use the more specialized units like SSE, or even the floating point units. It's just that not many developers do so, as long as their code also has to run on all other processors of that kind, even the ones that don't support those possibilities.3dilettante said:That makes them universal, not general purpose (so long as you ignore things like software permissions and interrupts).
An 8-bit embedded processor from a coffee maker could be made to do any operation you want within its memory space with enough work. That doesn't mean it's magically general purpose.
DiGuru said:So, your single distinction between the two is the workload average? But in that case, none of the x86/PPC general purpose processors in use nowadays qualify for that. Because you can speed all those up by rewriting critical sections of your code to use the more specialized units like SSE, or even the floating point units. It's just that not many developers do so, as long as their code also has to run on all other processors of that kind, even the ones that don't support those possibilities.
No, when we go by that definition, the processor in a coffee maker would be the most general purpose one, if you look at the code it executes. And it would be very efficient if you look at the amount of transistors it uses to do all that.
3dilettante said:In the case of the coffee maker, it can be made to manipulate everything in its limited memory space as demanded by any appropriately configured algorithm. If space concerns could be alleviated, any software could be compiled to run on it.
Problem is, it can't do a lot of that work well.
That wouldn't be a very demanding thing to do. Just about any processor can do that, as long as it can access the memory required. I'm pretty sure I could get an Atmel Mega microcontroller at 8 MHz do that well enough.3dilettante said:You don't want it to spell check a word document, even though in theory you could hack a complicated software monstrosity to do it.
Yes, but as Edge said: does it matter if that processor is much more than powerful enough in any case? It gets the job done. And if you want more performance, you have more of them to use.If nobody in their right mind will use a given chip outside of a given set of specialized tasks, it doesn't matter if it in theory could be made to do so.
DiGuru said:That wouldn't be a very demanding thing to do. Just about any processor can do that, as long as it can access the memory required. I'm pretty sure I could get an Atmel Mega microcontroller at 8 MHz do that well enough.
Which is just the point about general processing: just about all of it is non-critical.
And while you can speed it up by reducing the time each instruction takes, look at the difference between an Athlon and a Pentium 4: the Athlon blows the Pentium away, even while running at half the speed, with a worse branch-prediction and OOO implementation, less execution units and a much worse IPC rate.
Using specialized units to do stuff, be it SSE units or other processors on the same chip has turned out to be a MUCH better way to increase the workload done per second. That's why they all go that way nowadays. The reign of the best IPC is over.
And even if it wasn't: how do you calculate the IPC between different processor architectures? You need to look at the workload done per second to be able to come up with any meaningful numbers.
Yes, but as Edge said: does it matter if that processor is much more than powerful enough in any case? It gets the job done. And if you want more performance, you have more of them to use.
That is another distinction: you're not running out of time for other tasks. You can have other cores run that other task, or optimize the critical parts of the code to run an order of magnitude faster. You're not limited to a single instruction path.
But I do agree that SPE's aren't designed and wired up to do it all. But they could, if you did.
supervegeta said:AI is a task that tipically requie branch prediction code, the spe totally lack of any branch prediction, so wharever optimization you are going to make the AI will always run better on a general purpose processor than on a spe.
Barbarian said:First, there is a software based branch prediction on the SPU. If you have enough work to place the prediction early enough, the branch will be for free.
Second, even if one SPU is slower than a general purpose processor at the same frequency, the way SPUs are designed it will be trivial to extend the processing to 8 SPUs, and since SPUs scale very well you'll get close to 8x speedup.
Good luck doing that on a general purpose processor, given that you find one with 8 cores, which presently doesn't exist.
If comparing Cell's branch-heavy code processing toa conventional processor like P4, if Cell can match P4 then it can't be considered weak at general processing, no? The fact it uses SPE's rather than a branch predictor doesn't change it's capacity to match P4 in branchy code (if it can). So in Cell you have a processor that can handle branchy code as well as a P4, and vector code an order of magnitude faster. I don't see that that's something to complain about!supervegeta said:Wow if you need so much spe running together just to match a single general purpose processor, you are just wasting all the spe power for running a task that would run a lot better on the PPE , this is a very idiotic thing to do.
Shifty Geezer said:If comparing Cell's branch-heavy code processing toa conventional processor like P4, if Cell can match P4 then it can't be considered weak at general processing, no? The fact it uses SPE's rather than a branch predictor doesn't change it's capacity to match P4 in branchy code (if it can). So in Cell you have a processor that can handle branchy code as well as a P4, and vector code an order of magnitude faster. I don't see that that's something to complain about!
And yes, you're right, using all your SPE's to run branchy code is a waste of their potential. But only if what you're processing can't be dealt with as vector ops with branching. If there's no other way round the problem, at leat you know using Cell isn't going to be any slower than any other processor at the same task, which is what this comparison is about, Cell versus 'general purpose processors' (assuming Barbarian's comments on SPE's branch performance is correct).
Shifty Geezer said:If comparing Cell's branch-heavy code processing toa conventional processor like P4, if Cell can match P4 then it can't be considered weak at general processing, no?
The fact it uses SPE's rather than a branch predictor doesn't change it's capacity to match P4 in branchy code (if it can). So in Cell you have a processor that can handle branchy code as well as a P4, and vector code an order of magnitude faster. I don't see that that's something to complain about!
And yes, you're right, using all your SPE's to run branchy code is a waste of their potential. But only if what you're processing can't be dealt with as vector ops with branching. If there's no other way round the problem, at leat you know using Cell isn't going to be any slower than any other processor at the same task
which is what this comparison is about, Cell versus 'general purpose processors' (assuming Barbarian's comments on SPE's branch performance is correct).
supervegeta said:It will never be as good as the branch prediction of a modern general purporse processor.
Wow if you need so much spe running together just to match a single general purpose processor, you are just wasting all the spe power for running a task that would run a lot better on the PPE , this is a very idiotic thing to do.
scificube said:It's not a waste is the PPE is overburdened already and you have SPEs just sitting around...
Barbarian said:Actually, software branch prediction has some advatanges, for example, if the code predicts the branch early enough, it will always have zero cost, while with a hardware predictor there are no guarantees. This fact, combined with predictable memory latency, makes the SPU very deterministic, which is a very good thing for scheduling parallel tasks.
And I think people overestimate the branch cost, on the SPU it's 18 cycles maximum. How many cycles do you think it is on the P4 with it's rumoured 35+ stage pipeline? Branch prediction or not, if the algorithm jumps all over the place you're screwed.
And regarding using 8 SPUs to match a P4 - you say it's a dumb waste of resources, so how come running a 4Ghz P4 to type in Word is not a waste?!
My argument was that the SPU is quite capable in general processing and even if it falls short here or there, you've got 8 of them to compensate. Whether you can use them better is all a matter of time, resources and need.
As for using the PPU for general purpose tasks, yes, by all means, that's why it's there. On the other hand I've seen examples where a single SPU performs better than the PPU in legacy code, ie general purpose code, that was just modified to run within the constraints of SPU's local store, and this gave an improvement of 1.2 to 1.5 times. Now, the real shocker was when the same code was optimized specifically for the SPU, it gained 40x speedup compared to the PPU!
You've totally missed the plot. The thread is about what is possible, and comparing SPU and Cell branch performance with 'normal' processors. If you need to run branchy code, Cell maybe isn't any worse than a P4. If you don't need to run branchy code, you can run vector math on Cell. If you use a P4 instead, if you need branchy code you're alright. If you don't need branchy code, all that branch prediction logic and resources are a waste of die space, sat around doing nothing.supervegeta said:To me it is still a waste because you could use those spe's to add more graphical detail instead to run inefficent code.
Shifty Geezer said:ou've totally missed the plot. The thread is about what is possible, and comparing SPU and Cell branch performance with 'normal' processors. If you need to run branchy code, Cell maybe isn't any worse than a P4. If you don't need to run branchy code, you can run vector math on Cell. If you use a P4 instead, if you need branchy code you're alright. If you don't need branchy code, all that branch prediction logic and resources are a waste of die space, sat around doing nothing.
If you put a Cell in a computer, you can run really fast vector code and good branchy code. If you put a P4 in a computer, you can run good branchy code and moderate vector code. That's what Barbarian's saying.
As to how you choose to use Cell's resources, obviously you'd try to play to strengths as much as possible. But this topic is about what the chip is capable of. Barbarian is saying Cell's penalties in running branchy code, that to date has been regarded as high and renders Cell a less effective processor in that regard to conventional processors, can be minimised, producing an effective branch-capable processor, with SPE's being capable of handling branch prediction in software.
Thread title : The SPE as general purpose processorsupervegeta said:If you talk of cell as a whole processor you are talking about the PPE plus the SP's, if you talk about the spe's alone it is a total different thing.
I can't argue this. I don't know what SPE's branch prediction code is like. But Barbarian has said SPE can execute branch prediction code as well as a general purpose CPU.Now, the spe's can execute branch prediction code but but with a very low efficency compared to a general purpose processor and compared to the PPE.