Cell specifications released

Because the latency
and instruction overhead associated with DMA transfers exceeds that of the
latency of servicing a cache miss, this approach achieves an advantage only if
the DMA transfer size is sufficiently large and is sufficiently predictable (that is,
DMA can be issued before data is needed).

Could someone elaborate on the implications of this? Would this be practical for game programming?
 
:oops: There's quite a bit of information there

of which I don't think I really understand 1%

but will read anyway, 'cos there could be worse ways to spend a day at work :D
 
Any one who can actually understand this. Maybe a trans into laymans terms from faf, deano, or nao. It might take a while, who has the time to read 319 pages anyway and thats just the first pdf!!!
 
Last edited by a moderator:
seismologist said:
Could someone elaborate on the implications of this? Would this be practical for game programming?

Of course. You typically use a double buffer approach, e.g. if you have to transform a bunch of vertices, you DMA in data on one buffer while processing the other. That's the way you do it on PS2 VU1.
 
version said:
spe's SUMB instruction a monster 24 ops/cycle

I see why you're saying that, since the instruction requires 24 summation operations, but is that actually what it means. I'm not sure what their notation means in the ISA document. I mean, the instuction operands table for the sumb, shows 8 3-summation operations. Are we supposed to assume it does all of that in one clock cycle, or are we to assume each row in that table takes a clock cycle so the entire sumb instruction takes 8 clock cycles? I haven't read any of the documentation, so forgive me if I'm asking something completely ridiculous that is cleared up somewhere earlier in the docs.
 
Description of microarchitecture seems completely missing, let alone Latency/throughput information ... very incomplete.
 
MfA said:
Description of microarchitecture seems completely missing, let alone Latency/throughput information ... very incomplete.
You're right. However, IBM does explicitly mention in the documents that the specifications are for the overall "Cell Broadband Engine" architecture, and not implementation specific. This looks to be consistant with their documentation structure for the PowerPC architecture, where they had 3 volumes for defining the overall archictecture, and then an implementation specific "user's manuals" for individual parts like the PPC 750 and 970. So there should be a "PS3 Cell" user's manual somewhere down the line. Something else to look forward to, I guess.
 
Throughput is 1 cycle for everything except double precision.
Float ops are 6 cycles, conversions to/from float 7, integer complex (madds etc) 7, integer and logic simple 2, shift and shuffle 4, load/store 6, branch misspredict 18, doubles are 13 cycles non pipelined for the first 6.
Floats and integers go in pipe 0, load/store branch and shuffles go in pipe 1.
All in all, pipes seem quite well balanced. Select bits in 2 cycles is quite nice. Every branch that can be converted to a conditional move should be.
Branch hint must be issued quite a lot of cycles in advance (13 i believe), even for unconditional jumps. There is a possibility to stall till the hint arrives, rather than filling with tons of nops.
 
Interesting how the calibre of posting keeps away the trouble makers. That was what was intended for here. The brainiacs discuss and the looker-ins think 'blimey, they're all smart I feel intimidated and will post in GAF instead' and then goes off and posts 'Revolution IS da SUXXORZ!!11!'

Not that this is at all on topic for the thread. :oops:
 
Panajev2001a said:
Let's hope it stays. I do not see the need of GAF/B3D regular discussion being brought to a forum where STI engineers are taking time to read the posts and answr questions.

Interesting.... H_Peter_Hofstee in that IBM forum really who he is supposed to be ?

Peter Hofstee = dutch = one of the founders of "Cell" ...
 
Back
Top