ISSCC 2005

Quaz51 · Feb 7, 2005

version said:
"DP" = duble precision?

20.3 A Double-Precision Multiplier with Fine-Grained Clock-Gating Support for a First-Generation CELL Processor
9:30 AM
J. Kuang(1), T. Buchholtz(2), S. Dance(2) , J. Warnock(3), S. Storino(2), D. Wendel(4)

1 - IBM, Austin, TX
2 - IBM, Rochester, MN
3 - IBM, Yorktown Heights, NY
4 - IBM, BÃ¶blingen, Germany

A double-precision multiplier for a 90nm SOI CELL processor is presented. Dynamic Booth logic is designed for scalability and with noise, leakage, and pulse-width variation tolerance. Static partial-product compression is implemented with replicated bits for performance. The design supports fine-grained clock gating domains for active power reduction.

Panajev2001a · Feb 7, 2005

Did the SCEE paper mention SMT for the PU as well as VMX ?

I think I read just that...

Contains 64-bit Power ArchitectureTM with VMX that is a dual thread SMT design â€“ views system memory as a 10-way coherent threaded machine

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

PiNkY · Feb 7, 2005

Athlon had 16-way L1 caches at least initially as I recall. That may have changed in later revisions though. 2-way though seem a much too drastic a change to be realistic however...

constructing level 1 caches with 16 way set associativity does not make much sense with regards to their absolute size. Athlon64's L2 cache however is 16 way associative...

www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/23932.pdf

Entropy · Feb 7, 2005

Panajev2001a said:
Did the SCEE paper mention SMT for the PU as well as VMX ?

I think I read just that...

Contains 64-bit Power ArchitectureTM with VMX that is a dual thread SMT design â€“ views system memory as a 10-way coherent threaded machine

Click to expand...

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

I should probably shut up until I've had the opportunity to fully dig into the data becoming available, but the core would seem more sophisticated than most speculation assumed. If nothing else, it makes it more useful for non-PS3 application where you might choose to scale back on the SPUs. And of course, it implies robust PS3 performance for "general game code", that isn't amenable to vectorization.

nAo · Feb 7, 2005

..hey guys..local SPU ram is NOT a cache..

DeanoC · Feb 7, 2005

Panajev2001a said:
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

Why do you think its OOOe? Its easy to build SMT on an in order system.

And given its seems very similar to another processor core thats is in order, that would be my guess.

Panajev2001a · Feb 8, 2005

DeanoC said:
Panajev2001a said:

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

Click to expand...

Why do you think its OOOe? Its easy to build SMT on an in order system.

And given its seems very similar to another processor core thats is in order, that would be my guess.

I thought that as it seems easier to build SMT on top of an OOOe core than an in-order core as it shares quite a bit of the logic with the dynamic schedule&execute engine of the core: in fact modern SMT implementations (Pentium 4's HT and POWER5) have both been built on top of OOOe core with a 5-10% penalty in extra chip-area.

Npl · Feb 8, 2005

Given that Cell is a new Design, OOOe wont do much( Edit: Well, I guess there could be benefits since the Compiler cant know if a load/store is done from the cache or from memory ). Aslong there wont be Versions with a different amount of Execution units atleast.

Thats one thing Im quite curious though: Given that Cell will find its way into lowcost electronics there will be cut-down Versions, what exactly are they gonna cut away?

Less SPUs? Certainly

Less SPU-SRAM? Would require different Versions of those "Apulets", especially funky if you consider their claims of Cells beeing able to run code on other Cells. Same Problem if future Version will get more SRAM.

Cridpling the PE? Cachesize should have no affect, modifing ISA and/or Execution units would require a different Cell-OS, could be doable, but developers will go crazy if theres no compatibility.

Even if you would remove all APUs, theres still a full blown Power-Core, not exactly something that I would consider lowcost.

Panajev2001a · Feb 8, 2005

DeanoC said:
Panajev2001a said:

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

Click to expand...

Why do you think its OOOe? Its easy to build SMT on an in order system.

And given its seems very similar to another processor core thats is in order, that would be my guess.

It is not impossible to do SMT without OOOe, so it is still likely that the two cores are really descendent of the same father

.

archie4oz · Feb 8, 2005

The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

It's not OOOe, it's in-order... Also, SMT is typically easier to build into an in-order design as well...

Panajev2001a · Feb 8, 2005

archie4oz said:
The core is also OOOe (easier to build SMT on top of a OOOe processor core), has 2 nice layers of cache and a VMX unit: this is quite FAST.

Click to expand...

It's not OOOe, it's in-order... Also, SMT is typically easier to build into an in-order design as well...

I accept the fact it might not be OOOe... I guess I was wishing it to be too much.

Sorry to contraddict you, but how can it be both easier on in-order as well as out-of-order cores to implement SMT ? You just removed the BASE figure you define what "easier" is.

Is it easier to implement SMT for an in-order CPU or for an out-of-order CPU ?

Is your answer "neither" ?

Gubbi · Feb 8, 2005

Guden Oden said:
Athlon had 16-way L1 caches at least initially as I recall. That may have changed in later revisions though. 2-way though seem a much too drastic a change to be realistic however...

It's two way set associative but 8-way interleaved. The 8-way interleaving means that that the second cache access will be succesful in 7/8 of cases (given a perfectly random access pattern), so on average 1.875 cache accesses / cycle.

Cheers
Gubbi

Gubbi · Feb 8, 2005

archie4oz said:
... Also, SMT is typically easier to build into an in-order design as well...

Seriously doubt that.

Cheers
Gubbi

PZ · Feb 8, 2005

ISSCC 2005 Slides

Looks like there are some of the slides up:

http://pc.watch.impress.co.jp/docs/2005/0208/kaigai153.htm

version · Feb 8, 2005

Re: ISSCC 2005 Slides

PZ said:
Looks like there are some of the slides up:

http://pc.watch.impress.co.jp/docs/2005/0208/kaigai153.htm

good post

"up to 16 way 128 bit SIMD"

hmm, 16*8 bit process ? then mips performance 4 times larger than flops , 1 teraops + 256 gigaflops with paralell ?

edit : thx faf

Fafalada · Feb 8, 2005

hmm, 16*8 byte process ? then mips performance 4 times larger than flops , 1 teraops + 256 gigaflops with paralell ?

You mean 16*8 bit process right?
And yeah, of course it's possible, but how usefull 8bit operations really are in practice?

R5900 can do 16way 128bit SIMD also...

MfA · Feb 8, 2005

Well it is usefull for full search ME.

Fafalada · Feb 8, 2005

Well it is usefull for full search ME.

I'm not saying it's useless - I happen to do a few things with it too (and there's a reason SIMD ISAs all have them) but I'm not sure I would use it to claim higher GOPS ratings...

JF_Aidan_Pryde · Feb 8, 2005

Is anyone expecting more than one 'Cell' in the PS3?

It seems sensible that the PS3 will have one cell and a beasty GPU, aggregating sub-1TF peak performance.

edit: oops, I just saw the thread on number of cells in PS3. Sorry.

psurge · Feb 8, 2005

This slide is confusing to me :
http://pc.watch.impress.co.jp/docs/2005/0208/kaigaip046.jpg

Does this mean 25.6 GB/s to memory and 76.8GB/s to other chips (GPU, etc...) or am I missing something?

ISSCC 2005

Quaz51

Panajev2001a

PiNkY

Entropy

nAo

Nutella Nutellae

DeanoC

Trust me, I'm a renderer person!

Panajev2001a

Npl

Panajev2001a

archie4oz

ea_spouse is H4WT!

Panajev2001a

Gubbi

Gubbi

PZ

version

Fafalada

MfA

Fafalada

JF_Aidan_Pryde

psurge

Similar threads