Here are my other notes from the SPU talk (the Cell talk is tomorrow). Sorry they are a little disorganized, they are scribbled on an envelope
One of the goals of the SPU was obviously simplicity. The local store is not a cache, so there are no misses, no tags, no backing store. Likewise there are no complex instructions (I guess your definition of complex is relative). But no divide - multiply-add and permute seemed the most complex of the instructions. The philosophy was that every time something complex came up, they asked themselves if it was better off adding it, or keeping the SPU simple and packing more SPU's on a chip.
The DMA was presented as a big deal. They support scatter/gather, etc. DMA can be overlapped with computation by using S/W multithreading on a single SPU (run one compute thread while another is waiting for a DMA, etc.). DMA accesses are up to 16 kilobytes each.
Some definition clarifications. SPE referred to the combination of the SPU and it's DMA unit.
Most SPU instructions are 3 128-bit input operands.
A single 128x128-bit register file is shared for fixed and floating point values.
GFLOPS rating followed the simple math - 4-way SIMD of multiply-add operation. 2 ops (multiply+add) x 4way SIMD x 4GHz = 32 GFLOPS per SPU. 8 SPU's per BE (yes, the Cell was explicitly referred to as the broadband engine) is 256 GFLOPS total.
Branch mis-predicts are 18 cycles, so this has to be carefully managed in S/W. Mux instruction is used to avoid branches (compute both sides of an if-then and select the result instead of branching around one).
Load/store unit has 6 cycle latency for accesses to the local store.
Presented as a middleground between a CPU and a GPU.
Interesting enough, all the power numbers being quoted were for the example of a single-precision transformation+lighting benchmark. They claimed achieving 1.4 IPC for this. The loop was unrolled 4 times to hide the 6 cycle latency.
The SPU is dual issue, but it is completly in-order. There is no register renaming or reordering of anything.
Circuits are about 20% dynamic logic, 80% static logic.
Another interesting factoid, the interconnect between SPU's is set up as a ring, so adjacent SPU's can pass data between their 256KB local stores. In this way the SPU's can be set up as a simple pipeline.
That's it for now, I'll take more notes in tomorrow's BE presentation