ISSCC 2005

Update from San Francisco :D

First post is on SPU, next will be on overall CELL.

Presentations haven't happened yet, but here is some stuff from the conference proceedings (which anyone can buy as of this morning):

On the SPU paper they can't seem to make up their mind on the name. It's called an SPU (streaming processor unit) and also an SPE (synergistic processor element), and then in the overall CELL paper the 8 little boxes in the block diagram are labelled SXU. Seriously, I didn't make that second one up. The last one I think actually refers to the interconnect mechanism to the rest of the chip.

The core area of one SPU/SPE (of which there are 8 on the chip) is 2.5x5.81mm2 in 90nm.

Each SPU has 256KB local SRAM which is not part of system address space (referred to as "untranslated, unguarded and non-coherent"). There is a DMA unit per SPU to manage background transfers to/from system memory space (with MMU). There can be up to 16 pending DMA requests, each of up to 16kb.

Each SPU has 128 128bit registers. The text says there are both seven and eight execution units per SPU (doesn't anyone proofread their papers anymore? :) ). There are fixed and floating point units, permute, some other stuff. Ask if you want details.

All data fetch and branch prediction is managed in software, i.e. you have to explicitly prefetch what you want when you want it, and for branches it mentions that "efficient S/W" manages branches by replacing branches with bitwise select instructions, arranging common case code to be inline, and inserting branch hint instructions.

They claim the SPU/SPE is programmable in C/C++ with intrinsics.

Clock rate ranges from 2-5 GHz over a voltage range 0.9-1.3v with power ranging from 1-11W.
 
At around 10 watts per SPU (at 4GHz+), i'd doubt we'll see 32 of them on one ic this or next year...
 
Ok, now for the CELL itself. All of the following info is from now publicly available conference proceedings:

- 8 SPU's/SPE's
- 1 64-bit PPU, dual threaded Power microprocessor (also referred to as PPE for power processor element and also a PXU because they can't seem to make up their minds what they want to call anything)
- Dual XDR channels for memory
- whole chip is 234M transistors, 200+ mm2 in 90nm

PPU/PPE/etc. has L1 and L2 of unknown size. L2 looks really big though (I'm guessing it's 512KB, but it might be 1MB).

I'll try to describe the block diagram in words. Use your imagination :)

There is an interconnect block called Element Interface Bus (EIB). On one side of this are the 8 SPU's hanging off it, each through their own load store/DMA unit. On the other side is the PPU (with it's L1 and L2), dual XDR connection, and two non-coherent I/O interfaces (not sure what they do exactly).

The 8 SPU's can have a total of 128 DMA transfers outstanding.

256Gflops total (no, not 1Tflops). Peak number of course (i.e. take clock rate and multiple by # of floating point units).
 
SiBoy said:
Each SPU has 256KB local SRAM which is not part of system address space (referred to as "untranslated, unguarded and non-coherent"). There is a DMA unit per SPU to manage background transfers to/from system memory space (with MMU). There can be up to 16 pending DMA requests, each of up to 16kb.
So it uses an IOMMU to translate memory mappings for the SPU's DMA transfers. Is there any info on how many of these translations the IOMMU can serve per cycle (ie. how many DMA transfers can be started per cycle) ?

SiBoy said:
Each SPU has 128 128bit registers. The text says there are both seven and eight execution units per SPU (doesn't anyone proofread their papers anymore? :) ). There are fixed and floating point units, permute, some other stuff. Ask if you want details.
The confusion about the amount of exec units is probably similar to the confusion with Altivec units in the PPC 970, the CPU diagrams shows all kinds of logical units, when in fact there are only two, the permute and a big fat SIMD ALU.

Does it say if the SPUs are single or dual issue ?

Cheers
Gubbi
 
Even if you make it 8 Watts per SPU, you'd certainly need exotic cooling, presumably even more so at .65
 
PiNkY said:
Even if you make it 8 Watts per SPU, you'd certainly need exotic cooling, presumably even more so at .65

David Wang speculated 4W/SPU @ 4GHz. Add the PPE, the caches, the SPU switch fabric and the bus interface, -> total power is likely to be in the 50-70W range for the entire thing.

Cheers
Gubbi
 
Gubbi said:
So it uses an IOMMU to translate memory mappings for the SPU's DMA transfers. Is there any info on how many of these translations the IOMMU can serve per cycle (ie. how many DMA transfers can be started per cycle) ?

Only one started per cycle, 16 can be outstanding at any one time.

Gubbi said:
The confusion about the amount of exec units is probably similar to the confusion with Altivec units in the PPC 970, the CPU diagrams shows all kinds of logical units, when in fact there are only two, the permute and a big fat SIMD ALU.

Does it say if the SPUs are single or dual issue ?

Dual issue, there is an even and odd execution pipe. Exec units are split over the two (i.e. LS unit is in odd pipe, etc.).
 
SiBoy said:
Dual issue, there is an even and odd execution pipe. Exec units are split over the two (i.e. LS unit is in odd pipe, etc.).
That's very PS2 VUesque.. ;)
 
David Wang speculated 4W/SPU @ 4GHz. Add the PPE, the caches, the SPU switch fabric and the bus interface, -> total power is likely to be in the 50-70W range for the entire thing.

i was refering to McFly's and my own post above. Though , while the increase is definitly superlinear, 4 Watts at 4+ GHz seems a bit low when they quote 11 Watt at 5.2 GHz.
 
Any info on whether the Power core is clocked at the same rate as the APUs? There's been a lot of speculation about the clock that the core is running at...

Thanks again Siboy :)
 
PiNkY said:
David Wang speculated 4W/SPU @ 4GHz. Add the PPE, the caches, the SPU switch fabric and the bus interface, -> total power is likely to be in the 50-70W range for the entire thing.

i was refering to McFly's and my own post above. Though , while the increase is definitly superlinear, 4 Watts at 4+ GHz seems a bit low when they quote 11 Watt at 5.2 GHz.

If 5.2GHz is the absolute maximum they can get out of it, than the 4W to 11W difference seems to be normal.

Fredi
 
Dual issue, there is an even and odd execution pipe. Exec units are split over the two (i.e. LS unit is in odd pipe, etc.).
PS2 legacy, definately :)
And good to hear it too - having separate pipe for housekeeping is a godsend for tight loops.
 
hey69 said:
the moment i have it, somebody needs to give a an adres where i can upload :devilish:

I did a little page where I have collected a few cool (read: free plus practically no restrictions) file upload sites.

Free Upload/Hosting Sites

I will update that page whenever I find some new good sites or when the existing sites change their stuff.
 
SiBoy said:
Gubbi said:
So it uses an IOMMU to translate memory mappings for the SPU's DMA transfers. Is there any info on how many of these translations the IOMMU can serve per cycle (ie. how many DMA transfers can be started per cycle) ?

Only one started per cycle, 16 can be outstanding at any one time.

Sorry, I meant for the entire chip (all SPUs). Can it start eight per cycle ? or just one ?

If it's eight there's a fair bit of logic going into the DMA/switch engine.

Cheers
Gubbi
 
Back
Top