Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

Fafalada said:
If only... :(
Frankly I find the exact opposite to be true for me - just about every thing I dislike about Cell (as well as that other console CPU) is clearly all IBM's influence/design.

True about the AOS vs SOA SIMD.

But the memory model is closer to the VUs than to a "real" CPU (DMA+local memory)

Cheers
Gubbi
 
aaronspink said:
There are a significant number of jobs that the SPU cannot do. The SPU's are subordinate to the PPE and rely on the PPE for a lot of the non-computational function.
I think to help this debate you ought to list s few and have other devs see whether they agree ot not. At the moment you seem to be saying (least I'm hearing) the SPe's can't do some things, but you don't seem to be saying (least I'm not hearing) what they can't do, and more importantly the relevance of not being able to do those things. eg. As has been said, though a SPE can't boot itself into action, why would it need to? Save those transistors for other things.
 
aaronspink said:
The 8080 had multi-tasking added on by people. It doesn't make it suited for the work though. What you are effectively doing is writing a program that has several code sequences that loop back to a DO list when a given code sequence is done. You don't really need a kernal to do this.
Agreed but for anything else you need some form of kernel to isolate tasks and ensure they do not interfere with each other. Cell is most likely going to use an exokernel structure on SPUs which consists of a very primitive resource multiplexer with all other abstractions generated in the application itself.

aaronspink said:
There are a significant number of jobs that the SPU cannot do. The SPU's are subordinate to the PPE and rely on the PPE for a lot of the non-computational function.
They can do quite a lot aside handle interrupts it would appear. They can do computation and basic branching, correct? The branching is not as limited as people seem to believe either; consider a P4 with a 120 instruction/cycle pipeline/penalty (most being redundant so you can pass data across the chip) with a huge branch predictor that has to deal with the x86 complexities versus a SPU with an 18 cycle penalty if a hint goes wrong. The hints often won't go wrong if you think about loops (unrolling elminiates the need for hints entirely if possible; or you could branch and use the hint and the only time you're going to violate is in the very final case where you quit the loop).

aaronspink said:
The SPUs are glorified SIMD pipelines. They lack a significant amount of functionality in an effort to reduce their size to the point that so many could be put on. The question is if the trade off is worthwhile or merely a stopgap.
They aren't intended to be full cores so stop making out as if they are, the PPC is an integral part and its job is to handle interrupts and manage the EIB (which do require a more general purpose approach as you would suffer immensely if you had to wait X cycles every time you wish to enter an interrupt handler). Why should an SPU care about an interrupt when its job is to do a task given to it? These tasks often involve SIMD and as the RISC philosophy states - optimise the common case. On a games console vector instructions are common and similarly in a lot of other cases where Cell is going to be applied - signal processing, image analysis, 3D geometry all are vector streams.

aaronspink said:
I do believe that I've made my point. I'm fairly confident that I understand the technicalities at least as well as anyone else on this board.
You are coming from your 'holy' software perspective and preaching about hardware. Every processor in existence has tradeoffs - period. Stop trying to apply principles you have learnt from writing compilers where there are few tradeoffs and you can aim for an optimal solution to designing chips where optimal is relative to the task. Ask yourself what is general purpose computation? What is general purpose code? General is only relative to the application.
Another thing with the viewpoint you have taken is that you neglect the natural ability for hardware to operate concurrently. Given we can no longer get the gains by operating sequentially we are having to switch to utilising this facet to our advantage. It will cause some headache but hardware designers have managed it so why can't software developers? This is a pardiagram shift that is going to happen.


I confess you have highlighted some key points though - FFT benchmarks are not going to highlight the branching performance of the Cell but just its pure-throughput which is obviously highly impressive. People here do take these examples out of context and somehow translate amazing performance here to amazing in-game performance and that remains to be seen. However, given Cell is most definately a RISC architecture and the mantra of Aahmadl's law was adhered to throughout (even at the gate level) means either IBM and Sony are just plain dumb when they did all this (unlikely) or they did exactly as it said and had some clue about what they were doing (very likely, and probably a very large clue - MS decisions also reflect this).

kryton
speaking for myself inc.
 
Kryton said:
They aren't intended to be full cores so stop making out as if they are.

Erhh, Aaron was the one that said the SPUs aren't first class citizens (so to speak).

Kryton said:
You are coming from your 'holy' software perspective and preaching about hardware.

;)

Cheers
 
Last edited by a moderator:
I have some questions maybe someone is kind enough to answer:

Looking at the self-multi-tasking slides does anyone know if the distributed kernel could allow code on one SPU to execute on data from another SPU?

A claim made is that this scheme would allow programmers to get around the limiting size of the LS. I don't see how this is true if code and the data it must exectute on are still restricted to being in a single LS. Basically, I don't know what, "overlap of memory fetch," really means...but if it means an SPU can use another's LS somehow then the LS size restriction is lifted...otherwise I'm not seeing it yet.

A clear drawback to this approach is increased latency no?

If tasks can yield to one another why is pre-emptive scheduling not possible?

What is to stop one from yielding anytime they like in a thread and secondly to select what ever thread they like to execute next via whatever scheduling mechanism one could craft? I mean with a couple of semaphores and a yield function one could craft a multi-level feedback queue...as long as they had space to save all the state information of the threads and whatnot...err...I think you can at least :) Anyway...what gives?
 
scificube said:
If tasks can yield to one another why is pre-emptive scheduling not possible?

What is to stop one from yielding anytime they like in a thread and secondly to select what ever thread they like to execute next via whatever scheduling mechanism one could craft? I mean with a couple of semaphores and a yield function one could craft a multi-level feedback queue...as long as they had space to save all the state information of the threads and whatnot...err...I think you can at least :) Anyway...what gives?

Because multitasking relying on tasks yielding is co-operative multitasking, not pre-emptive.

Pre-emptive multitasking is when a supervisor preempts a running task to switch to another for whatever reason (eg. time-slice spent).

Cheers
Gubbi
 
^^^^Thanks!

Last question and I'll leave you guys alone...at least for a little while :)

Do SPUs have HW support for quantums i.e. a kernel on an SPU has a mechanism to reclaim the SPU? Without this preemptive scheduling would seem impossilbe as without it the kernel must rely on an executing thread to yield to it...which may never happen. I'm lost as to why STI would do this.
 
Last edited by a moderator:
MrWibble said:
Which is odd considering you're just outputting a stream of FUD, some of which flies in the face of the actual real world experience of those on this board who are programming the thing on a daily basis.

I don't believe that I'm outputing a stream of FUD. But then again, if I'm outputting a stream of FUD many on this board are outputting a stream of propaganda.

You are confusing (I feel deliberately, thought it might just be stupidity) the *choice* of a hw designer to place certain functions in the PPE instead of an SPU with design decisions made out of necessity.

It was not choice. There was a deliberate need on the part of the architects to reduce the size of the SPU in order to fit a large number on the chip in order to meet some marketting created target flops number. The architects and implementors would ideally, in both mine and I'm fairly certain theirs, replicated the PPE as many time as required to get the targetted flops number. The issue with this was the die size of the device that would subsequently be desired.

This led to designing a secondary attached processor with a limitted functionality set. But don't confuse this with choice. There are a lot of significant compromises in the SPU design that are there in order to acheive the flop performance target.



Some of the things you mention aren't even absent from the current SPUs anyway.. memory access?? My SPUs can access memory just fine, thank you very much.

Fine, why don't you program your SPU to walk a linked list stored over 512KB of memory. Did I mention that the walk will end up pseudo-random? Have fun. Let me know how that performance is going.

You are living in a dream world if you think that your SPU can access memory. The closest approximation to accessing memory is setting up a descriptor and initiating the start of a DMA transaction, waiting for the DMA transaction to complete, and then reading some datum from the LS. This isn't memory access.

I might as well suggest that the X360 CPU is crippled and not a "real CPU" because its memory interface is on a seperate chip (the GPU) rather than being integrated.

You might, you'd be two sheets to the wind and jumping the snark.

If the XGPU was not present I'm sure XCPU could have a memory interface, and if the PPE was not present then SPUs could be given more responsibility.

If the SPU's had been given more capability then they would be significantly bigger in size and less would fit on a die. A fully featured SPU would look somewhat like the PPE.

Certainly in practice the machine can work exactly as described by Sony's slides - the SPUs can entirely self-manage without intervention. So whatever you're claiming can be disproven simply by observation - my code certainly hasn't stopped working just because you've said it isn't possible, and runtime performance is not bound by reliance on the PPE.

They cannot self-manage without intervention. They can run long segmented programs, but they cannot self-manage. They rely on the PPE to herd them along and switch out programs when demand so requires.

And I never said your code wasn't possible, its just a long segmented program with branching. Any concept that it is really a multi-tasking kernal is figment of your imagination.

Aaron Spink
speaking for myself inc.
 
Shifty Geezer said:
I think to help this debate you ought to list s few and have other devs see whether they agree ot not. At the moment you seem to be saying (least I'm hearing) the SPe's can't do some things, but you don't seem to be saying (least I'm not hearing) what they can't do, and more importantly the relevance of not being able to do those things. eg. As has been said, though a SPE can't boot itself into action, why would it need to? Save those transistors for other things.

I've stated the limitations of the SPEs on a variety of occations in this thread, I'll not repeat myself for fear of being called a someone who is trying to bury others in my posts and spreading FUD.

Suffice to say, the SPU pretty much only has the capability of running linear code sequences. It cannot maintain itself, it cannot maintain the system.

Aaron Spink
speaking for myself inc.
 
Kryton said:
The hints often won't go wrong if you think about loops (unrolling elminiates the need for hints entirely if possible; or you could branch and use the hint and the only time you're going to violate is in the very final case where you quit the loop).

If hints often didn't go wrong, there wouldn't be so many branch predictors in the world. Loop unrolling causes code expantion which can actually slow down programs.


They aren't intended to be full cores so stop making out as if they are,

I believe that has been my point from the beginning. They aren't full complete compuational entities.


You are coming from your 'holy' software perspective and preaching about hardware.

The lines of pure software I've written over the years is probably in the range of 100k lines.

OTOH, I've written on the order of 1000K lines of HDL. Worked on several processors, from the consumer space(with an attached media processor back in oh, 97) to server processors (of several different architectures).

Every processor in existence has tradeoffs - period. Stop trying to apply principles you have learnt from writing compilers where there are few tradeoffs and you can aim for an optimal solution to designing chips where optimal is relative to the task.

I'm well aware of tradeoffs. I've stated as such. I would also posit that you don't know jack about compilers if you think there aren't significant trade-offs involved in compiler designs.

Another thing with the viewpoint you have taken is that you neglect the natural ability for hardware to operate concurrently.

I'm well aware of the ability of hardware to operate concurently. People gave up on sequence instruction processors in the late 80's. Vector supers data farther back than that. Threaded machines date to almost the same timeframe as vectors. I believe what you would like to refer to is SOFTWARE operating concurrently.

Given we can no longer get the gains by operating sequentially we are having to switch to utilising this facet to our advantage. It will cause some headache but hardware designers have managed it so why can't software developers? This is a pardiagram shift that is going to happen.

Because software designers are stupid. They have enough trouble wrapping their heads around memory models that all hardware has had to eventually go back to strict coherence. I have no problem with telling software engineers to stop being lazy. The question is how many actually will and how screwed up will the end product be? You will have some very bright coders that will manage the transition but this will be a fraction of the software people out there (and likely a small fraction). The added complexities of CELL certainly don't help this.

This isn't a new problem. Parallel hardware has existed for over 3 decades without a lot to show for it on the software side. It is still a very very specialized skill set within the software community with a lot of complexities.



However, given Cell is most definately a RISC architecture and the mantra of Aahmadl's law was adhered to throughout (even at the gate level) means either IBM and Sony are just plain dumb when they did all this (unlikely) or they did exactly as it said and had some clue about what they were doing (very likely, and probably a very large clue - MS decisions also reflect this).

I assume you ment Amdahl's Law. Amdahl's law has little to nothing to do with the hardware in this case and certainly nothing at the gate level.

MS decisions reflect a much more programmer centric view of the world. CELL reflect a much more hardware centric view of the world.
 
scificube said:
Looking at the self-multi-tasking slides does anyone know if the distributed kernel could allow code on one SPU to execute on data from another SPU?

Not really. Within a code stream you are still restricted to only that data and code that is within your SPU. It is possible if things are setup correctly to DMA from one SPU's LS to anothers, but this has some significant latency overheads and is not dynamic.

The reallity is there really isn't much you can do to get around the limitations of the LS design.

Aaron Spink
speaking for myself inc.
 
scificube said:
Do SPUs have HW support for quantums i.e. a kernel on an SPU has a mechanism to reclaim the SPU? Without this preemptive scheduling would seem impossilbe as without it the kernel must rely on an executing thread to yield to it...which may never happen. I'm lost as to why STI would do this.

The SPU's don't directly have support. It would be possible to make this work via the PPE though. The PPE could take a timer interrupt and then if required, stop execution in the SPU, context switch it, and start another context. The overhead to do this because of the way the SPU was designed would be significant. In general, most programs are going to have to dedicated a code sequence per SPE and leave it there for the duration of its execution.

Aaron Spink
speaking for myself inc.
 
There was a deliberate need on the part of the architects to reduce the size of the SPU in order to fit a large number on the chip in order to meet some marketting created target flops number.

So the Cell chip was designed by the Sony marketing department and every white/grey and black paper written on it is just PR people covering up this "scandal".
 
Fud

aaronspink said:
I've stated the limitations of the SPEs on a variety of occations in this thread, I'll not repeat myself for fear of being called a someone who is trying to bury others in my posts and spreading FUD.

If you know what you are saying, then I agree with others that FUD is good characterization of your postings.

http://en.wikipedia.org/wiki/FUD

"Fear, uncertainty and doubt (FUD) is a sales or marketing strategy of disseminating negative but vague or inaccurate information on a competitor's product."

Suffice to say, the SPU pretty much only has the capability of running linear code sequences. It cannot maintain itself, it cannot maintain the system.

It can only run linear code sequences and not maintain itself or the system? Why talk in such silly terms my friend? Number one purpose of SPE, like GPU, is for task accelerator purpose. GPU, like Synergistic Processors of CELL, enables efficiently off-loading and acceleration of process intensive tasks (such as transform and lighting) from primary control unit (PPE of CELL). Would you prefer GPU have full self-initialization and management function? Would you feel that is efficient use of silicon/heat/power/cost?

I am sorry to say your complaint about SPE is as sensible as compaining that a jet-plane cannot take corner at nurburgring as fast as a car. Different machine for different purpose my friend. This is simple lesson to learn.
 
aaronspink said:
It was not choice. There was a deliberate need on the part of the architects to reduce the size of the SPU in order to fit a large number on the chip in order to meet some marketting created target flops number.
Only that is a choice. That's the whole principle of these new processors. Choosing to shrink the core by removing complexity in order to fit more on.
The architects and implementors would ideally, in both mine and I'm fairly certain theirs, replicated the PPE as many time as required to get the targetted flops number.
Only that wouldn't be possible without an ENORMOUS chip. A Few large cores was considered, and the compromise was made.
This led to designing a secondary attached processor with a limitted functionality set. But don't confuse this with choice. There are a lot of significant compromises in the SPU design that are there in order to acheive the flop performance target.
Which, again, was the choice they made. How do we get more float performance for our future float-heavy tasks? Cram on lots of small maths-strong cores.

Suffice to say, the SPU pretty much only has the capability of running linear code sequences.
But that's not true. You don't need a branch predictor to branch. You don't need DMA to be able to access memory. It seems to me because SPE's don't work the way the PPE does, you rate them as non-entities. You just have to do things differently, with of course some compromises in performance in areas the SPE's aren't intended to be strong such as branch prediction.

As I see it, SPE's are no less able than the simple processors of yesteryear that did everything required of them.
 
-tkf- said:
So the Cell chip was designed by the Sony marketing department and every white/grey and black paper written on it is just PR people covering up this "scandal".

Toshiba introduced the SPEs.
 
The architects and implementors would ideally, in both mine and I'm fairly certain theirs, replicated the PPE as many time as required to get the targetted flops number.

But why replicate the PPE at all. aaronspink, you should be avocating the replication the PPC 970 (aka G5) as many times as possible till you hit your FLOP target.

You seem quite determine to downplay the SPE's as much as possible and advocate the PPE's, but you would be far better off advocating the PPC 970.

I notice in the other thread your are equal as vocal against the RSX architecture. You're are certainly exhibiting a certain level of bias here, in favor of the Xbox 360 over the PS3 components.

I think it's safe to say, no matter your experience, we can take your opinion with a certain level of scepticism as to your motives.

Like others have shared here, the SPE are specialized processors to deal with large sets of data that require massive computation, and are not designed to deal with the main loop of a game, which the PPE handles.

You say, they are difficult to program, but I don't really hear very many complaints from developers on this. Sure there has been a few, but nothing major.
 
aaronspink said:
The SPU's don't directly have support. It would be possible to make this work via the PPE though. The PPE could take a timer interrupt and then if required, stop execution in the SPU, context switch it, and start another context. The overhead to do this because of the way the SPU was designed would be significant. In general, most programs are going to have to dedicated a code sequence per SPE and leave it there for the duration of its execution.

Aaron Spink
speaking for myself inc.

I think that the SPU's have enough support - in the CBE_Architecture books there is the ability to interupt the SPU internaly on a timer event, without any PPU help.

Maybe the best way to think of the SPE's is as 'User mode' processors... They aren't given access to TLB/Page mechanisms usually - ( If you wanted to you could though, as everything in the system in memory mapped, including the TLBs.. ) in the same way as applications aren't given access under windows.

For a fixed console design I think the processing capability is important - At the end of the day the best programmers will tune their applications to the platform.
For general purpose things are way more murky - After all, on a P4 or K8, how often is the cpu running full out with the average windows application load :)
 
Back
Top