ISSCC 2005

version said:
"DP" = duble precision?



20.3 A Double-Precision Multiplier with Fine-Grained Clock-Gating Support for a First-Generation CELL Processor
9:30 AM
J. Kuang(1), T. Buchholtz(2), S. Dance(2) , J. Warnock(3), S. Storino(2), D. Wendel(4)

1 - IBM, Austin, TX
2 - IBM, Rochester, MN
3 - IBM, Yorktown Heights, NY
4 - IBM, Böblingen, Germany

A double-precision multiplier for a 90nm SOI CELL processor is presented. Dynamic Booth logic is designed for scalability and with noise, leakage, and pulse-width variation tolerance. Static partial-product compression is implemented with replicated bits for performance. The design supports fine-grained clock gating domains for active power reduction.
 
Did the SCEE paper mention SMT for the PU as well as VMX ?

I think I read just that...

Contains 64-bit Power ArchitectureTM with VMX that is a dual thread SMT design – views system memory as a 10-way coherent threaded machine

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.
 
Panajev2001a said:
Did the SCEE paper mention SMT for the PU as well as VMX ?

I think I read just that...

Contains 64-bit Power ArchitectureTM with VMX that is a dual thread SMT design – views system memory as a 10-way coherent threaded machine

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

I should probably shut up until I've had the opportunity to fully dig into the data becoming available, but the core would seem more sophisticated than most speculation assumed. If nothing else, it makes it more useful for non-PS3 application where you might choose to scale back on the SPUs. And of course, it implies robust PS3 performance for "general game code", that isn't amenable to vectorization.
 
Panajev2001a said:
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.
Why do you think its OOOe? Its easy to build SMT on an in order system.

And given its seems very similar to another processor core thats is in order, that would be my guess.
 
DeanoC said:
Panajev2001a said:
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.
Why do you think its OOOe? Its easy to build SMT on an in order system.

And given its seems very similar to another processor core thats is in order, that would be my guess.

I thought that as it seems easier to build SMT on top of an OOOe core than an in-order core as it shares quite a bit of the logic with the dynamic schedule&execute engine of the core: in fact modern SMT implementations (Pentium 4's HT and POWER5) have both been built on top of OOOe core with a 5-10% penalty in extra chip-area.
 
Given that Cell is a new Design, OOOe wont do much( Edit: Well, I guess there could be benefits since the Compiler cant know if a load/store is done from the cache or from memory ). Aslong there wont be Versions with a different amount of Execution units atleast.


Thats one thing Im quite curious though: Given that Cell will find its way into lowcost electronics there will be cut-down Versions, what exactly are they gonna cut away?

Less SPUs? Certainly

Less SPU-SRAM? Would require different Versions of those "Apulets", especially funky if you consider their claims of Cells beeing able to run code on other Cells. Same Problem if future Version will get more SRAM.

Cridpling the PE? Cachesize should have no affect, modifing ISA and/or Execution units would require a different Cell-OS, could be doable, but developers will go crazy if theres no compatibility.

Even if you would remove all APUs, theres still a full blown Power-Core, not exactly something that I would consider lowcost.
 
DeanoC said:
Panajev2001a said:
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.
Why do you think its OOOe? Its easy to build SMT on an in order system.

And given its seems very similar to another processor core thats is in order, that would be my guess.

It is not impossible to do SMT without OOOe, so it is still likely that the two cores are really descendent of the same father ;).
 
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

It's not OOOe, it's in-order... Also, SMT is typically easier to build into an in-order design as well...
 
archie4oz said:
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

It's not OOOe, it's in-order... Also, SMT is typically easier to build into an in-order design as well...

I accept the fact it might not be OOOe... I guess I was wishing it to be too much.

Sorry to contraddict you, but how can it be both easier on in-order as well as out-of-order cores to implement SMT ? You just removed the BASE figure you define what "easier" is.

Is it easier to implement SMT for an in-order CPU or for an out-of-order CPU ?

Is your answer "neither" ?
 
Guden Oden said:
Athlon had 16-way L1 caches at least initially as I recall. That may have changed in later revisions though. 2-way though seem a much too drastic a change to be realistic however...

It's two way set associative but 8-way interleaved. The 8-way interleaving means that that the second cache access will be succesful in 7/8 of cases (given a perfectly random access pattern), so on average 1.875 cache accesses / cycle.

Cheers
Gubbi
 
Re: ISSCC 2005 Slides

PZ said:

good post:)

kaigaip039.jpg



"up to 16 way 128 bit SIMD"

hmm, 16*8 bit process ? then mips performance 4 times larger than flops , 1 teraops + 256 gigaflops with paralell ?

edit : thx faf
 
hmm, 16*8 byte process ? then mips performance 4 times larger than flops , 1 teraops + 256 gigaflops with paralell ?
You mean 16*8 bit process right?
And yeah, of course it's possible, but how usefull 8bit operations really are in practice? ;)
R5900 can do 16way 128bit SIMD also... :p
 
Well it is usefull for full search ME.
I'm not saying it's useless - I happen to do a few things with it too (and there's a reason SIMD ISAs all have them) but I'm not sure I would use it to claim higher GOPS ratings...
 
Is anyone expecting more than one 'Cell' in the PS3?

It seems sensible that the PS3 will have one cell and a beasty GPU, aggregating sub-1TF peak performance.

edit: oops, I just saw the thread on number of cells in PS3. Sorry.
 
Back
Top