ISSCC 2005

Hmmm, on a second thought, are there different upper and lower Instructionsets
I am pretty sure that's what it is - as mentioned, this is quite like the approach taken on VUs in PS2, and what you said about execution units being split across pipelines pretty much confirms it.
 
Re: hahahahaha

nAo said:
The second pipeline is probably devoted to load/store, dma queues, branching, etc..
If we factor in even those operations we can inflate the 256 GFlops/s figure ;)
Those instructions can't really be said to be flops now can they... They sound more like ints to me, unless there's a way that I missed learning of to branch to a fractional address or somesuch of course! ;)
 
Re: hahahahaha

Guden Oden said:
Those instructions can't really be said to be flops now can they... They sound more like ints to me, unless there's a way that I missed learning of to branch to a fractional address or somesuch of course! ;)
I left out div or other complex fp instructions (thanks Faf!) :)
 
Re: hahahahaha

AutomatedMech said:
Something is not right. Each CELL APU burns only 1 watt @ 0.9 V at 2 Ghz???? 11 watts at 5 Ghz??? If IBM had such technology, it can forget about making chips for a living, license that tech to Intel and make billions/year.

Read. Learn. Post. I suggest you repeat the first two:

P ~= CFVV

F ~= V

P ~= CF^3

5/2 = 2.5

2.5 * 2.5 * 2.5 ~= 15.

Most likely they are reaching their min functional voltage before they reach 2 Ghz. Which shifts the results somewhat.

Aaron Spink
speaking for myself inc.
 
cell0.JPG


what is it? not readable
 
Why is local storage divided into four banks? Can each be individualy addressed during a 128bit load/store and what does "permute" offer (beyond bit/byte permutations) for its large estate requirements...?
 
PiNkY said:
Why is local storage divided into four banks? Can each be individualy addressed during a 128bit load/store and what does "permute" offer (beyond bit/byte permutations) for its large estate requirements...?

So that you can DMA to/from local storage, all while running code which loads/stores from/to local memory ?

Cheers
Gubbi
 
Hmm that might sound totally stupid (as knowledge wise, this really is walking on thin ice...) but wouldn't you simply need a second access port on the memory (along with an arbiter) for simultanious/interleaved dma transfers?

P.S.: Shouldn't the 128 GPRs give you some flexibility in manual prefetching /caching anyways...
 
First CELL presentation should start in a few minutes.
I want the paper..I want the paper..I want...or the slides at least! ;)

ciao,
Marco
 
I thought DP was short for double precision?

Anyway so if the banks are for that purpose memory is single ported I assume? But yeah, even on VU they with single ported access they just arbitrate all DMA requests to wait for the VU.
 
Fafalada said:
Anyway so if the banks are for that purpose memory is single ported I assume.

Pure speculation on my part: I assume it's pseudo dual ported, like AMD's K7/8 (8 way interleaved) level 1 dcache.

Another possibility is that IBM's SRAM macro is 64KB and they just made 4 instances of it.

Cheers
Gubbi
 
No nitpicking intended but i think K7's as well as K8's l1-datacaches are only 2 way-set-associative, though both are pseudo dual-ported...
 
Athlon had 16-way L1 caches at least initially as I recall. That may have changed in later revisions though. 2-way though seem a much too drastic a change to be realistic however...
 
Back
Top