Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

weaksauce said:
Can somebody explain the "local storage" the spe's have? Is it better/worse, slower/faster than regular l2 cache?
The "local storage" is a fixed chunk of fast memory that the SPE has to itself. This memory is the only memory that the SPE can actually access through its normal load/store instructions. If you want to access memory outside the SPE, then you need to initiate a DMA transfer, which has an enormous latency (~1000 cycles).

The main advantage of the "local storage" is that it is somewhat smaller and faster than a traditional L2 cache of similar capacity.

The main disadvantage is that the DMA mechanism greatly complicates programming of the SPE - unlike an L2 cache miss in a standard processor, the DMA cannot be handled in a manner that is transparent to the programmer.

Proper management of the local store and the DMA mechanism can result in tremendous performance boosts for some (but far from all) applications - if the programmer is skilled enough and is able/willing to spend some serious amount time optimizing for it.
 
aaronspink said:
There is the theory of specialization. It pretty much says that either do or don't do it, but don't stratle the fence.

Oh bullshit. Any "theory of specialization" is going to be a subset of the work done by Wolpert and MacReady on the NFLT; of which I'm quite well acquainted. And in any application of NFLT, as is often the case in the biological models I work with, it's readily appearent to anyone who comprehends the theory that while a generalized and universal optimization strategy doesn't exist, in the real world you need to look at the mapping of the strategy to the underlying landscape (whatever it maybe, perhaps an abstract fitness landscape or the landscape which abstractly defines computational requirements for gaming).

It doesn't state that you "don't stratle the fence," it states that you find a strategy which best optimizes for your abstract landscape, whatever that is. It doesn't state a binary - all or nothing - approach to design as you so adamantly posit; but, rather as many others are attempting to point out, a more hybrid design is likely to be most optimum in the plurality of cases, over the plurality of landscapes.

And it's for this reason that GPUs are built as they are and outpreform CPU, it's for this reason that DSP's exist and outpreform CPUs. It's for this reason that SPUs and Cell exist and kick the living shit out of your EV7 on the tasks it was designed for; each suited for their particular niche.

It just so happens that Cell's requirements are differential to those of the EV7 or any other CPU which follows the generalized x86 evolution. And when you stop and actually think about it, it's pretty clear that Cell's architecture pretty much kicks all ass on the landscapes it'll be operating on.
 
Last edited by a moderator:
arjan de lumens said:
Proper management of the local store and the DMA mechanism can result in tremendous performance boosts for some (but far from all) applications - if the programmer is skilled enough and is able/willing to spend some serious amount time optimizing for it.

and the counter-thesis of the above:

if the programmer is unskilled enough and/or is unwilling to spend time analyzing his code's memory access patterns he would 'seamlessly' take performance hits in the magnitude of multiple hundreds of cycles on a cache-based architecture.


moral of the story - nothing works auto-magically, neither in the CPU industry nor elsewhere.
 
The difference being that if you are careless and run out of L2 cache space on a traditional CPU, you merely experience performance degradation; if you are careless and run out of "local storage" space in an SPE, your program fails altogether.
 
arjan de lumens said:
if you are careless and run out of "local storage" space in an SPE, your program fails altogether.
careless CELL programmers should change job..;)
 
SPE efficient performance/sq. mm

The SPE's don't fit as much streaming float performance per sqaure millimetre as a GPU.

SPE (single vector processor):
25.6 Gflops ...
from 14.5 Sq. mm = 1.77 Gflops/Sq. mm
from 21m transistors = 1.22 Gflops/million transistors

Xenos (as vector processor, no edram):
192 Gflops ...
from ~200 (?) Sq. mm = .96 Gflops/Sq. mm
from 232m transistors = .83 Gflops/million transistors
 
arjan de lumens said:
The difference being that if you are careless and run out of L2 cache space on a traditional CPU, you merely experience performance degradation; if you are careless and run out of "local storage" space in an SPE, your program fails altogether.

my experience in the filed inclines me to think that the average programmer is likely going to address the issue of a his code failing, rather than the issue of his code performing abismally ; ) or put in other words, code running on SPUs will either be written sanely or it will not run at all. code written on a vanilla cached architecture has all the liberties to be as sub-par as it comes.

which reiterates that original 'moral' i brough up - nothing works auto-magically, and behaving as if things actually did is self-delusion on the part of the programmer - in this regard, for one reason or another, caches have played a historically negative role *shrug*
 
Last edited by a moderator:
arjan de lumens said:
Proper management of the local store and the DMA mechanism can result in tremendous performance boosts for some (but far from all) applications - if the programmer is skilled enough and is able/willing to spend some serious amount time optimizing for it.

IBM seems to think this is well suited to game programming as per one of their articles:

One consideration when using DMA to span the effective memory space of an SPE is that it imposes significant latency. With that in mind, prefetching (also known as double-buffering or multibuffering, particularly in game programming) becomes an important technique. If, while buffer N is being processed, buffer N-1 is being written out, and then buffer N+1 is being read in, the processor can execute continuously, even if the time required to transfer the data is a substantial fraction (up to half) of the time it takes to perform an operation.

from this very interesting page here:
http://www-128.ibm.com/developerworks/power/library/pa-fpfunleashing/
 
Crazyace said:
I think that the SPU's have enough support - in the CBE_Architecture books there is the ability to interupt the SPU internaly on a timer event, without any PPU help.

Maybe the best way to think of the SPE's is as 'User mode' processors... They aren't given access to TLB/Page mechanisms usually - ( If you wanted to you could though, as everything in the system in memory mapped, including the TLBs.. ) in the same way as applications aren't given access under windows.

For a fixed console design I think the processing capability is important - At the end of the day the best programmers will tune their applications to the platform.
For general purpose things are way more murky - After all, on a P4 or K8, how often is the cpu running full out with the average windows application load :)

The question is whether an SPU can interrupt execution on another SPU and perform a context switch to another thread of execution. From the sound of things HW support for this does not appear to be there.
 
Last edited by a moderator:
arjan de lumens said:
The difference being that if you are careless and run out of L2 cache space on a traditional CPU, you merely experience performance degradation; if you are careless and run out of "local storage" space in an SPE, your program fails altogether.

I am curious why you feel it is possible to run out of local storage for SPE and under what circumstance given multi-tier memory architecture with very large register and managed load to and from local store from memory and SPE. Do you feel this would be due to bad management effort by developer or an unpredictable event?
 
ihamoitc2005 said:
SPE (single vector processor):
25.6 Gflops ...
from 14.5 Sq. mm = 1.77 Gflops/Sq. mm
from 21m transistors = 1.22 Gflops/million transistors

Xenos (as vector processor, no edram):
192 Gflops ...
from ~200 (?) Sq. mm = .96 Gflops/Sq. mm
from 232m transistors = .83 Gflops/million transistors
You should also consider the fact that a GPU is likely to reach is theoretical throughtput than CELL with considerably less effort (by developers)
 
Why and cost

scificube said:
The question is whether an SPU can interrupt exectution on another SPU and perform a context switch to another thread of exectution. From the sound of things HW support for this does not appear to be there.

I would like to understand why you feel this is important for and what you feel is overall performance cost of not having this feature.
 
Edge said:
IBM seems to think this is well suited to game programming as per one of their articles:



from this very interesting page here:
http://www-128.ibm.com/developerworks/power/library/pa-fpfunleashing/

If you have an idea in advance of what your memory access pattern will look like, you can achieve great performance. Otherwise you are in trouble. The CELL will reward/penalize you in this regard much more than a 'traditional' CPU.
 
Maybe

nAo said:
You should also consider the fact that a GPU is likely to reach is theoretical throughtput than CELL with considerably less effort (by developers)

I was only showing that in terms of actual vector calculation capability, from standpoint of die-size, SPE has ~85% more, and from transistor standpoint, ~45% more, than Xenos GPU.

Therefore Aaron Spink's proposal for Xenos instead of SPEs for vector processor array is very poor choice from size-cost-heat-power as well as raw capability. Clearly small, cheap, fast, efficient, SPE is much superior performance for such tasks and has added advantage of much more flexibility and range of use as well as superior control for isolating specific SPE for specific task.

However, maybe you are correct that Xenos is easier to program as vector processor, I do not know. If so, then choice for Xenos for vector coprocessor duty instead of SPEs is despite big disadvantage from hardware capability standpoint for perhaps gain for programming ease. If easier to program, then perhaps hardware capability disadvantage can be decreased no?
 
ihamoitc2005 said:
I would like to understand why you feel this is important for and what you feel is overall performance cost of not having this feature.

With respect to the PS3's needs...it would seem for the most part inconsequential.

With respect to how well Cell maps to other tasks one generally comes across this is more important.

Personally I am trying to see if the claim that Cell could handle multiple OS simultaneously was pure marketing tripe or is there something the consensus here is missing. Without preemptive scheduling via the Kernel I would say most OS have considerable trouble running on an SPU. Without preemptive scheduling I would say one's OS is rather dated or really isn't a serious contender. It is an assumption perhaps wrong but I assume the SPUs were to have hand in this as the PPE alone running multiple OS seems impractical to me given it will also have to handle the wealth of more unpredictably branchy code out there.

I wish to know if the SPU can in fact fully operate independently or not. OSs to me seem the only real place true preemptive scheduling is needed so that's where I focus. I admit though...I'm inexperienced so perhaps that's wrong.

As this discussion is no longer about how Cell stacks against Xenon and is at this point focused solely on the merits or the lack of merit in Cell's design I thought I could take the opportunity to learn something from those farther along in the game than I.

Hope that's ok.
 
Last edited by a moderator:
ihamoitc2005 said:
I am curious why you feel it is possible to run out of local storage for SPE and under what circumstance given multi-tier memory architecture with very large register and managed load to and from local store from memory and SPE. Do you feel this would be due to bad management effort by developer or an unpredictable event?
Bad (faulty) management more than unpredictable events. This should never be an issue with "correct" (in the sense of functioning correctly/not having bugs) programs running on it, but writing "correct" code for the SPE generally requires a fair bit of attention.
 
ihamoitc2005 said:
If so, then choice for Xenos for vector coprocessor duty instead of SPEs is despite big disadvantage from hardware capability standpoint for perhaps gain for programming ease. If easier to program, then perhaps hardware capability disadvantage can be decreased no?
Yes, even though a direct comparison can't be made cause Xenos (or RSX) can be much better at some things (texture mapping, rasterizing, etc..) than CELL, like CELL can be much better than a GPU in some other task.
A simple example: with a SPE one can randomly walk over an octree structure easily and without mishinting a branch,at full rate. Try to do that with a GPU.. :)
 
Last edited:
Back
Top