ISSCC 2005

Jov · Feb 8, 2005

JF_Aidan_Pryde said:
Is anyone expecting more than one 'Cell' in the PS3?

It seems sensible that the PS3 will have one cell and a beasty GPU, aggregating sub-1TF peak performance.

Is it possible to integrate one PE + nVidia GPU on a same silicon in the future (I guess anything is possible, but how practical is another question)?

If Sony doesn't plan to have a PE+GPU, then they'll probably want to utilise the transistor count on 65nm, no? Thus itâ€™s a good possibility to have more than one PE.

AutomatedMech · Feb 8, 2005

Kutaragi Ken's numbers.

Frequency : 4 Ghz logical.
FLOPS : 256 GFLOPS with compressed floats.

My REAL numbers.

Frequency : 1 Ghz physical
FLOPS : 64 GFLOPS with uncompressed normal floats.

The only thing revolutionary about CELL is the use of compressed floats to boost FLOPS performance figure. This, I gotta give it to Kutaragi Ken.

Too bad this trick does not work on scientific and physics computing. Run LINPACK on CELL and you are lucky to get 32 GFLOPS sustained.[/url]

ninelven · Feb 8, 2005

erased

Psikotiko · Feb 8, 2005

Ars Article about the Cell (Part 1)

L1 cache has been replaced by 256K of locally addressable memory. The SPE's ISA, which is not VMX/Altivec-derivative (more on this below), includes instructions for using the DMA controller to move data between main memory and local storage. The end result is that each SPE is like a very small vector computer, with its own "CPU" and RAM.

This RAM functions in the role of the L1 cache, but the fact that it is under the explicit control of the programmer means that it can be simpler than an L1 cache. The burden of managing the cache has been moved into software, with the result that the cache design has been greatly simplified. There is no tag RAM to search on each access, no prefetch, and none of the other overhead that accompanies a normal L1 cache. The SPEs also move the burden of branch prediction and code scheduling into software, much like a VLIW design.

The individual SPUs can throw a lot overboard, because they rely on a regular, general-purpose POWERPC processor core to do all the normal kinds of computation that it takes to run regular code. The Cell system features eight of these SPUs all hanging off a central bus, with one 64-bit POWERPC core handling all of the regular computational chores. Thus all of the Cell 's "smarts" can reside either on the PPC core, while the SPUs just do the work that's assigned to them.

http://arstechnica.com/articles/paedia/cpu/cell-1.ars

SiBoy · Feb 8, 2005

Re: hahahahaha

AutomatedMech said:
4-5+ Ghz was the SRAM speed and not the ALU speed.

Now it makes perfect sense, CELL ALUs run at 1/4th the clock of XDR input signal. In other world, that 4 Ghz input = 1 Ghz internal operating clock.

Sorry, you're completely and utterly wrong. The ALU runs at 4GHz+. The presentations went into the number of FO4 levels per pipe stage, etc.

Regarding power consumption, there was the peculiar comment that the power given was JUST the dynamic power consumption. One entry in the schmoo plot was 2W, but this didn't include 1.3W of leakage and 1.7W of clock power (5W total). I might have the leakage vs. clock backwards, but those were the raw numbers.

one · Feb 8, 2005

AutomatedMech said:
blah

From left to right, Yoshio Masubuchi, Director of Engineering, Toshiba America Electronic Components, Jim Kahle, IBM Fellow, and Masakazu Suzuoki, Vice President of Microprocessor Development Department, SCEI, 0WN j00

SiBoy · Feb 8, 2005

Gubbi said:
PiNkY said:

Why is local storage divided into four banks? Can each be individualy addressed during a 128bit load/store and what does "permute" offer (beyond bit/byte permutations) for its large estate requirements...?

Click to expand...

So that you can DMA to/from local storage, all while running code which loads/stores from/to local memory ?

Cheers
Gubbi

I didn't get this impression. The local store was single ported only, and stores all instructions+data. I think the 4 banks was a physical partitioning only.

There are 3 "customers" for the local store, each arbitrating for access on a per cycle basis:

1) DMA transfer - highest priority, 128 bytes per access
2) data load/store - 2nd priority
3) instruction fetch - lowest priority, 128 bytes per access

They claimed they could get 80-90% utilization of the local store's single port interface.

SiBoy · Feb 8, 2005

Here are my other notes from the SPU talk (the Cell talk is tomorrow). Sorry they are a little disorganized, they are scribbled on an envelope

One of the goals of the SPU was obviously simplicity. The local store is not a cache, so there are no misses, no tags, no backing store. Likewise there are no complex instructions (I guess your definition of complex is relative). But no divide - multiply-add and permute seemed the most complex of the instructions. The philosophy was that every time something complex came up, they asked themselves if it was better off adding it, or keeping the SPU simple and packing more SPU's on a chip.

The DMA was presented as a big deal. They support scatter/gather, etc. DMA can be overlapped with computation by using S/W multithreading on a single SPU (run one compute thread while another is waiting for a DMA, etc.). DMA accesses are up to 16 kilobytes each.

Some definition clarifications. SPE referred to the combination of the SPU and it's DMA unit.

Most SPU instructions are 3 128-bit input operands.

A single 128x128-bit register file is shared for fixed and floating point values.

GFLOPS rating followed the simple math - 4-way SIMD of multiply-add operation. 2 ops (multiply+add) x 4way SIMD x 4GHz = 32 GFLOPS per SPU. 8 SPU's per BE (yes, the Cell was explicitly referred to as the broadband engine) is 256 GFLOPS total.

Branch mis-predicts are 18 cycles, so this has to be carefully managed in S/W. Mux instruction is used to avoid branches (compute both sides of an if-then and select the result instead of branching around one).

Load/store unit has 6 cycle latency for accesses to the local store.

Presented as a middleground between a CPU and a GPU.

Interesting enough, all the power numbers being quoted were for the example of a single-precision transformation+lighting benchmark. They claimed achieving 1.4 IPC for this. The loop was unrolled 4 times to hide the 6 cycle latency.

The SPU is dual issue, but it is completly in-order. There is no register renaming or reordering of anything.

Circuits are about 20% dynamic logic, 80% static logic.

Another interesting factoid, the interconnect between SPU's is set up as a ring, so adjacent SPU's can pass data between their 256KB local stores. In this way the SPU's can be set up as a simple pipeline.

That's it for now, I'll take more notes in tomorrow's BE presentation

Inane_Dork · Feb 8, 2005

SiBoy said:
That's it for now, I'll take more notes in tomorrow's BE presentation

Thanks so much. I appreciate it.

Megadrive1988 · Feb 8, 2005

http://arstechnica.com/articles/paedia/cpu/cell-1.ars/2

Each SPE is made of 21 million transistors: 14 million SRAM and 7 million logic.

so each Synergistic Processing Element (formerly known as the APU) has 21 million transistors. that's more than the entire Emotion Engine which was 13 ~ 13.5 million transistors.

it's nice that they doubled the Local Storage from 128K for the APU in the patent to 256K in the actual implementation, now called the SPE.

Now be bold Sony and give us 32 to 64 SPEs in Playstation3. I don't care how they're divided up. 8)

Acert93 · Feb 8, 2005

SiBoy said:
GFLOPS rating followed the simple math - 4-way SIMD of multiply-add operation. 2 ops (multiply+add) x 4way SIMD x 4GHz = 32 GFLOPS per SPU. 8 SPU's per BE (yes, the Cell was explicitly referred to as the broadband engine) is 256 GFLOPS total.

Thanks SiBoy. That clears up a lot. Good info, and also a great tid-bit on the BE.

I know some of you are dissappointed, but holy cow! 256GFLOPs?! If you consider what Sony did with the PS2 and argueably the least powerful console, think of what they will do with possibly the most powerful HW.

randycat99 · Feb 8, 2005

I think the only persons here that are "disappointed" are the ones that live to pick apart every bone in a Sony product, anyway. The rest of us are quite hopeful of what "256 GFLOPs" will bring in future PS3 games, for sure. To say this has been an ambitious project does not really do it justice. I think it is safe to say that those here who are genuinely intrigued with the project are happy to see it is progressing along and reaching its milestones pretty much on schedule.

MfA · Feb 8, 2005

SiBoy said:
Branch mis-predicts are 18 cycles, so this has to be carefully managed in S/W. Mux instruction is used to avoid branches (compute both sides of an if-then and select the result instead of branching around one).

Oh god, fugly ... so the software can know 18 cycles ahead of time which way a branch will go, but will not have a way to tell the hardware that? Personally I would even prefer delay slots over this (split branches are even better, dont really expose the pipeline and they save you the headache of pro/epilogue code, but I guess they might be patented). I still prefer a disposeable ISA over inefficient hardware.

Bohdy · Feb 8, 2005

Hang on, since when are multiply-adds counted as 2 ops? Afaik current processors specs treat a fused multiply-add as one op.

one · Feb 8, 2005

psurge said:
This slide is confusing to me :
http://pc.watch.impress.co.jp/docs/2005/0208/kaigaip046.jpg

Does this mean 25.6 GB/s to memory and 76.8GB/s to other chips (GPU, etc...)?

It seems so.
A possible configuration in the PS3:
128MB XDR-DRAM @ 25.6 GB/s - CPU - (76.8GB/s FlexIO) - GPU - 128MB XDR-DRAM @ 25.6 GB/s

Entropy · Feb 8, 2005

Bohdy said:
Hang on, since when are multiply-adds counted as 2 ops? Afaik current processors specs treat a fused multiply-add as one op.

It has been par for the course in supercomputing bragging rights for literally decades, and the practise has carried over into other fields of computing as well.
It makes some sense - if one architecture is capable of single cycle mutiply-add, and another is not, should not the one capable of performing two floating point operations in one cycle be credited for that? And it is two floating point ops fused into one computer instruction.
Again, it's standard practise when counting max FLOPS.

Gubbi · Feb 8, 2005

MfA said:
SiBoy said:

Branch mis-predicts are 18 cycles, so this has to be carefully managed in S/W. Mux instruction is used to avoid branches (compute both sides of an if-then and select the result instead of branching around one).

Click to expand...

Oh god, fugly ... so the software can know 18 cycles ahead of time which way a branch will go, but will not have a way to tell the hardware that?

I read the "manage in software" as "we'll leave it to the programmer to sort it out". Ie. the programmer (or compiler, but let's be real

) has to ensure that the predicate used in the branch is calculated 18 cycles ahead (36 instructions, *ugh*), if the predicate is not calculated the hint bit will decide which way the branch is speculated.

Cheers
Gubbi

PZ · Feb 8, 2005

Oh god, fugly ... so the software can know 18 cycles ahead of time which way a branch will go, but will not have a way to tell the hardware that?

Yeah, I am concerned that this chip will require a super compiler, OS, and scheduler to work properly without overwhelming the programmer. It seems as though they took a lot of the complexity out of the hardware in order to get speed and shifted the complexity to the OS and compiler and ultimately back onto the general purpose PPC core (is that good?). I wonder how much these tools will do as versus the programmer having to manage the cache, set up pipelines, control local stores, etc. and how much real world performance will be lost without the dedicated cache/memory management.

phat · Feb 8, 2005

Bohdy said:
Hang on, since when are multiply-adds counted as 2 ops? Afaik current processors specs treat a fused multiply-add as one op.

I think actually the convention is an FMAC counts as 2 FP ops. The FMAC's I've encountered can all be used to either just add (multiplier has selectable hard value 1), just multiply (accumulator has selectable assignment instead of summation), or add and multiply, so, conceptually, each FMAC can be viewed as two independently selectable operations.

PZ · Feb 8, 2005

Gubbi said:
MfA said:

SiBoy said:

Branch mis-predicts are 18 cycles, so this has to be carefully managed in S/W. Mux instruction is used to avoid branches (compute both sides of an if-then and select the result instead of branching around one).

Click to expand...

Oh god, fugly ... so the software can know 18 cycles ahead of time which way a branch will go, but will not have a way to tell the hardware that?

Click to expand...

I read the "manage in software" as "we'll leave it to the programmer to sort it out". Ie. the programmer (or compiler, but let's be real ) has to ensure that the predicate used in the branch is calculated 18 cycles ahead (36 instructions, *ugh*), if the predicate is not calculated the hint bit will decide which way the branch is speculated.

Cheers
Gubbi

Hmmm... did PS3 programming just start to suck

ISSCC 2005

Jov

AutomatedMech

ninelven

PM

Psikotiko

SiBoy

one

Unruly Member

SiBoy

SiBoy

Inane_Dork

Rebmem Roines

Megadrive1988

Acert93

Artist formerly known as Acert93

randycat99

MfA

Bohdy

one

Unruly Member

Entropy

Gubbi

PZ

phat

PZ

Similar threads