The "local storage" is a fixed chunk of fast memory that the SPE has to itself. This memory is the only memory that the SPE can actually access through its normal load/store instructions. If you want to access memory outside the SPE, then you need to initiate a DMA transfer, which has an enormous latency (~1000 cycles).weaksauce said:Can somebody explain the "local storage" the spe's have? Is it better/worse, slower/faster than regular l2 cache?
aaronspink said:There is the theory of specialization. It pretty much says that either do or don't do it, but don't stratle the fence.
arjan de lumens said:Proper management of the local store and the DMA mechanism can result in tremendous performance boosts for some (but far from all) applications - if the programmer is skilled enough and is able/willing to spend some serious amount time optimizing for it.
careless CELL programmers should change job..arjan de lumens said:if you are careless and run out of "local storage" space in an SPE, your program fails altogether.
nAo said:careless CELL programmers should change job..
The SPE's don't fit as much streaming float performance per sqaure millimetre as a GPU.
arjan de lumens said:The difference being that if you are careless and run out of L2 cache space on a traditional CPU, you merely experience performance degradation; if you are careless and run out of "local storage" space in an SPE, your program fails altogether.
arjan de lumens said:Proper management of the local store and the DMA mechanism can result in tremendous performance boosts for some (but far from all) applications - if the programmer is skilled enough and is able/willing to spend some serious amount time optimizing for it.
One consideration when using DMA to span the effective memory space of an SPE is that it imposes significant latency. With that in mind, prefetching (also known as double-buffering or multibuffering, particularly in game programming) becomes an important technique. If, while buffer N is being processed, buffer N-1 is being written out, and then buffer N+1 is being read in, the processor can execute continuously, even if the time required to transfer the data is a substantial fraction (up to half) of the time it takes to perform an operation.
Crazyace said:I think that the SPU's have enough support - in the CBE_Architecture books there is the ability to interupt the SPU internaly on a timer event, without any PPU help.
Maybe the best way to think of the SPE's is as 'User mode' processors... They aren't given access to TLB/Page mechanisms usually - ( If you wanted to you could though, as everything in the system in memory mapped, including the TLBs.. ) in the same way as applications aren't given access under windows.
For a fixed console design I think the processing capability is important - At the end of the day the best programmers will tune their applications to the platform.
For general purpose things are way more murky - After all, on a P4 or K8, how often is the cpu running full out with the average windows application load
arjan de lumens said:The difference being that if you are careless and run out of L2 cache space on a traditional CPU, you merely experience performance degradation; if you are careless and run out of "local storage" space in an SPE, your program fails altogether.
You should also consider the fact that a GPU is likely to reach is theoretical throughtput than CELL with considerably less effort (by developers)ihamoitc2005 said:SPE (single vector processor):
25.6 Gflops ...
from 14.5 Sq. mm = 1.77 Gflops/Sq. mm
from 21m transistors = 1.22 Gflops/million transistors
Xenos (as vector processor, no edram):
192 Gflops ...
from ~200 (?) Sq. mm = .96 Gflops/Sq. mm
from 232m transistors = .83 Gflops/million transistors
scificube said:The question is whether an SPU can interrupt exectution on another SPU and perform a context switch to another thread of exectution. From the sound of things HW support for this does not appear to be there.
Edge said:IBM seems to think this is well suited to game programming as per one of their articles:
from this very interesting page here:
http://www-128.ibm.com/developerworks/power/library/pa-fpfunleashing/
nAo said:You should also consider the fact that a GPU is likely to reach is theoretical throughtput than CELL with considerably less effort (by developers)
ihamoitc2005 said:I would like to understand why you feel this is important for and what you feel is overall performance cost of not having this feature.
Bad (faulty) management more than unpredictable events. This should never be an issue with "correct" (in the sense of functioning correctly/not having bugs) programs running on it, but writing "correct" code for the SPE generally requires a fair bit of attention.ihamoitc2005 said:I am curious why you feel it is possible to run out of local storage for SPE and under what circumstance given multi-tier memory architecture with very large register and managed load to and from local store from memory and SPE. Do you feel this would be due to bad management effort by developer or an unpredictable event?
Yes, even though a direct comparison can't be made cause Xenos (or RSX) can be much better at some things (texture mapping, rasterizing, etc..) than CELL, like CELL can be much better than a GPU in some other task.ihamoitc2005 said:If so, then choice for Xenos for vector coprocessor duty instead of SPEs is despite big disadvantage from hardware capability standpoint for perhaps gain for programming ease. If easier to program, then perhaps hardware capability disadvantage can be decreased no?