ISSCC 2005

AutomatedMech said:
figure2.gif

figure3.gif


IBM's PowerPC 970FX's voltage/clock scaling chart. PowerPC970FX is fabbed on same fab as CELL and CELL uses 970 core as its CPU. This chart is IBM's admission that PowerPC 970FX burns 100 watts at 2.5 Ghz.

I hit the bull's eye on CELL's clockspeed and FLOPS rating, 64 GFLOPS @ 1 Ghz. Too be honest, even I was shocked I got it years before.

Right now, CELL was designed for 1 Ghz operation and tested successfully upto 1.4 Ghz according to SCEI's released material. The final clockspeed depends on how much loss per hardware Kutaragi Ken is willing to take on the hardware.

2 GHZ =40 watt, synerdistc cpus run 4 GHZ :)
 
AutomatedMech said:
and CELL uses 970 core as its CPU.

No it doesn't. Cell PPC is dual-issue. 970 is considerably wider (as well as MUCH heavier in logic, even with L1 caches stripped it would take up a much bigger die area than is allotted for the PPC core. Read Hannibal's article on Cell if you're confused on this issue...

The rest of your post is just pure nonsense (and lies).
 
AutomatedMech said:
I hit the bull's eye on CELL's clockspeed and FLOPS rating, 64 GFLOPS @ 1 Ghz. Too be honest, even I was shocked I got it years before.

LOL. Who is this kook? Obviously DeadMeat needs to be banned again.
 
AutomatedMech said:
A 4 Ghz CPU does not exist, Intel CEO even bowed down before an audience to apologize for its canceleration.
Your trolling is not welcome. Stay out of this thread, create your own topics and get those locked.
 
Troll? This is analysis of new CELL information. If predictions are coming true then I might mention it.

Am I the only one concerned about inflated performance? As noted, SCEI is claiming they can fit 16 floats and compute them in parallel in 16 byte wide register for 256GLops (with maybe some vertex compression). KutaragiFLOPâ„¢ definition is interesting but should not be used, for the sake of fair comparison. Not as bad as nvidia FLOPS though.
 
The overall CELL paper this morning was a little disappointing from a disclosure standpoint. They stuck pretty much to the written paper which I already posted some info on.

The package is a 42.5x42.5mm flip chip BGA. There is going to be a paper at ECTC later this year which is supposed to go into more of the packaging side.

90nm SOI process with 8 layers of metal (copper interconnect).

They mentioned 20% of the power was due to leakage and another 20% due to clock tree power, but wouldn't give the absolute numbers (which tells you something...).

The device taped out in January 2004, about 10 months after the high-level architecture was completed.

The EIB (bus interconnect) contains 4 128-bit rings with a 64-bit tag. No clue on the actual configuration. The EIB runs at HALF the PPE/SPE clock rate (so 2GHz-ish). The EIB can move 96 bytes total per cycle.

Everything connected to the EIB (I listed this earlier) can each individually move 16 bytes per cycle in to/out of the EIB, except for the FlexIO interface which can move twice this much. When they say "per cycle" I'm not sure if they're referring to the EIB half-rate cycle or the PPE/SPE full-rate cycle.

The PPE and SPE, local 256KB SRAMs, etc. are all on one clock network (same frequency, sorry AutomatedMech). The EIB is another clock network, and the external memory interface a third.

Incidentally, the SPU/SPE paper referred to 3 other papers submitted to the IEEE Symposium on VLSI Circuits for June '05 (describing physical design details, the fixed point unit of the SPE and the floating point unit of the SPE).

One last detail from the SPE/SPU talk I forgot to list:

The even pipeline contains the simple fixed point and SP float instructions, shifts/rotates, integer multiply-acc, byte operations (pop count, absolute differences, byte average, byte sum).

The odd pipeline contains the permute, load/store, channel read/write (built-in blocking message passing interface supported by 3 instructions: channel read, channel write and read channel capacity) and branch instructions.
 
'The even pipeline contains the simple fixed point and SP float instructions, shifts/rotates, integer multiply-acc, byte operations (pop count, absolute differences, byte average, byte sum).'

hmm, iam dissapointman , dont run paralell integer and floating
 
didnt it ever occur to you ?

AutomatedMech said:
Troll? This is analysis of new CELL information. If predictions are coming true then I might mention it.

Am I the only one concerned about inflated performance? As noted, SCEI is claiming they can fit 16 floats and compute them in parallel in 16 byte wide register for 256GLops (with maybe some vertex compression). KutaragiFLOP™ definition is interesting but should not be used, for the sake of fair comparison. Not as bad as nvidia FLOPS though.

didnt it ever occur to you that the 16 wide (across a 128bit register) SIMD could be refering to integer SIMD ?

thats my guess.... (since ps2 had that same support)

mtm
 
there are "simple fixed point operations"?
I could only imagine that for a very limited range you would have a 32Bit Mantissa, other than that, why use it :oops:
 
nAo said:
What about SPE instructions latency/troughput? :?:

Here are pipeline depth and instruction latency for each. I'll use the notation {E/O,A,B} where E/O indicates even or odd pipe, A is the unit pipeline depth, B the instruction latency.

word arithmetic, logicals, count leading zero's, selects, compares {E,2,2}

word shifts and rotates {E,3,4}

SP floating point multiply-accumulate {E,6,6}

integer multiply accumulate {E,7,7}

byte pop count, absolute sum of differences, byte avg, byte sum {E,3,4}

permute {O,3,4}

load/store from 256KB SRAM {O,6,6}

channel read/write {O,5,6}

branches {O,3,4} (mispredict is 18 cycles like I said earlier)
 
SiBoy said:
nAo said:
What about SPE instructions latency/troughput? :?:

Here are pipeline depth and instruction latency for each. I'll use the notation {E/O,A,B} where E/O indicates even or odd pipe, A is the unit pipeline depth, B the instruction latency.

word arithmetic, logicals, count leading zero's, selects, compares {E,2,2}

word shifts and rotates {E,3,4}

SP floating point multiply-accumulate {E,6,6}

integer multiply accumulate {E,7,7}


byte pop count, absolute sum of differences, byte avg, byte sum {E,3,4}

permute {O,3,4}

load/store from 256KB SRAM {O,6,6}

channel read/write {O,5,6}

branches {O,3,4} (mispredict is 18 cycles like I said earlier)




6,7 ??? what a fuck this is much, ps2 VU 4 latency
 
Back
Top