CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

one · Sep 7, 2004

Panajev2001a said:
160 Gbits per second is opnly about 20 GB/s, 5.6 GB/s slower than an XDR memory interface the rumors mentioned (the last one rumored was 51.2 GB/s), also 150 nm technology ?

Well, I think 123Mhz and 1300 signal pins are just test implementation they came up in mid 2003 before ISSCC 2004 preview. If they can achieve higher clock+more pins on it in 2005, it's possible Cell use it instead of eDRAM.

Panajev2001a · Sep 7, 2004

I do not see the technology mapping well to XDR and it might be a technology unrelated to CELL.

XDR needs only 128 data-pins (64 bits bus) and 400 MHz of master clock (for 3.2 GHz on-chip data signalling rates) to achieve 25.6 GB/s of bandwidth.

V3 · Sep 7, 2004

160 Gbits per second is opnly about 20 GB/s, 5.6 GB/s slower than an XDR memory interface the rumors mentioned (the last one rumored was 51.2 GB/s), also 150 nm technology ?

That's just a trial. Anyway this was what I was thinking in regard to memory implementation, if eDRAM is too much. Of course eDRAM is preferable, if or once they can do it.

Panajev2001a · Sep 7, 2004

V3 said:
160 Gbits per second is opnly about 20 GB/s, 5.6 GB/s slower than an XDR memory interface the rumors mentioned (the last one rumored was 51.2 GB/s), also 150 nm technology ?

Click to expand...

That's just a trial. Anyway this was what I was thinking in regard to memory implementation, if eDRAM is too much. Of course eDRAM is preferable, if or once they can do it.

V3, those are 1300 signal traces: wayyy too much for the bandwith this thing provides.

XDR uses about 128 traces for data (I do not remember off-handabout control, address and clock traces but I doubt that all three combined would go beyond 64 traces).

XDR pushes, with a master clock of 400 MHz, about 3.2 Gbps per effective data pin (we have two pins per data line as we are using differential signalling).

This solution pushes 160 Gbps using 1300 signal traces.

Say we are using differential signalling for everything including control and address data (not needed, but I want to inflate the numbers for this solution on purpose).

It means that effective we are only using about 650 traces.

Say that we are using 64 bits address and 64 bits control busses with a clock that uses 64 traces.

That still leaves 458 data traces.

Say that for some reason we have to halve that number in half to obtain the real effective data traces.

We are left with 229 data lines.

This is still way beyond what XDR needs to beat 160 Gbps of total bandwidth.

Could this be an interface for Redwood ? I dunno, but it sure does not look to have advantages over XDR in my eyes.

j^aws · Sep 7, 2004

nAo said:
...
There is no mention of special (hw assisted) thread management on current CELL patents.
...

The Cell patents mention this,

[0065] PU 203 can be, e.g., a standard processor capable of stand-alone processing of data and applications. In operation, PU 203 schedules and orchestrates the processing of data and applications by the APUs.

Does this imply hardware thread management?

nAo said:
...
I believe the 'word' thread is mentioned just here and there in CELL patents but it's never really addressed.
...

These 6 Suzuoki cell patents not once, never mention the word 'thread' anywhere in the masses of text describing the Cell architecture! I just did a quick word search in the browser!

...I would be extremely grateful if you/anyone could point me to Cell patents referring to the word 'thread'. I've never seen one!

nAo said:
Obviously an APU (controlled by a PU) can run multiple threads...

Sorry, silly question but how is the APU mult-threading obvious ?

Also, I'd be grateful on peoples thoughts on these,

1. What is a 'thread' in the context of Cell and how does it differ from an Apulet (software Cell), as a thread != Apulet?

2. Apulets contain both 'data' and 'instructions', IIRC. Isn't the role of the Cell compiler to 'generate' Apulets from code, either vector Apulets or Scaler Apulets and optimised for instuction level parallelism as APUs are SIMD based?

3. How do you think 'dependencies' will be handled for Apulets?

4. Will there be a preferred 'language' to code in to help the compiler to 'look' for paralleism, e.g. a functional language like Lisp/ Scheme?

Thanks in advance

ERP · Sep 7, 2004

The apulet dependencies are relatively trivial to handle (assuming all of them are external to the apulet) once the dependancies are listed.

If you use a model where input and output memory are decoupled and the output of one apulet can be the input of another, then dependencies are trivial to determine.

This is one of the best understood ways to implement parallelism. However you still have to somehow generate these functional blocks, and I believe that will require human intervention.

The definition of thread in this type of context is somewhat amorphous, traditionally you would simply execute each code fragment until it finishes it's work for this frame. Having said that there is nothing to stop you executing multiple tasks simultaneously if the hardware is there to deal with it.

As for Cell I have no idea, but I would err on the simple side of no harware thread support if only because of the limited local memory the processing elements have.

nAo · Sep 7, 2004

Jaws said:
[...I would be extremely grateful if you/anyone could point me to Cell patents referring to the word 'thread'. I've never seen one!

In a private message Vince pointed out to me this:
Multiprocessor system
We still don't know if this stuff is CELL related at all..

nAo said:
nAo said:

Obviously an APU (controlled by a PU) can run multiple threads...

Click to expand...

Sorry, silly question but how is the APU mult-threading obvious ?

Once you have some 'master' unit that controls APUs state and that can start/stop apulets execution is pretty trivial to obtain some kind of multithreading support.
What we really want is to fill every free instructions slot on an APU, not to just switch some thread every X microseconds..

1. What is a 'thread' in the context of Cell and how does it differ from an Apulet (software Cell), as a thread != Apulet?

I see a thread in CELL enviroment as a different instance of the same apulet. The same code working on the same shared data + some unique data per thread

2. Apulets contain both 'data' and 'instructions', IIRC. Isn't the role of the Cell compiler to 'generate' Apulets from code, either vector Apulets or Scaler Apulets and optimised for instuction level parallelism as APUs are SIMD based?

Correct.

3. How do you think 'dependencies' will be handled for Apulets?

I dunno..but there are sandboxes at least..

ciao,
Marco

Brimstone · Sep 8, 2004

nAo said:
Jaws said:

[...I would be extremely grateful if you/anyone could point me to Cell patents referring to the word 'thread'. I've never seen one!

Click to expand...

In a private message Vince pointed out to me this:
Multiprocessor system
We still don't know if this stuff is CELL related at all..

nAo said:

Obviously an APU (controlled by a PU) can run multiple threads...

Click to expand...

Sorry, silly question but how is the APU mult-threading obvious ?

Click to expand...

Once you have some 'master' unit that controls APUs state and that can start/stop apulets execution is pretty trivial to obtain some kind of multithreading support.
What we really want is to fill every free instructions slot on an APU, not to just switch some thread every X microseconds..

1. What is a 'thread' in the context of Cell and how does it differ from an Apulet (software Cell), as a thread != Apulet?

Click to expand...

I see a thread in CELL enviroment as a different instance of the same apulet. The same code working on the same shared data + some unique data per thread

2. Apulets contain both 'data' and 'instructions', IIRC. Isn't the role of the Cell compiler to 'generate' Apulets from code, either vector Apulets or Scaler Apulets and optimised for instuction level parallelism as APUs are SIMD based?

Click to expand...

Correct.

3. How do you think 'dependencies' will be handled for Apulets?

Click to expand...

I dunno..but there are sandboxes at least..

ciao,
Marco

I came across something that I think is what your hoping for. It's called SCALE Vertical Threading Architecture or SCALE VT.

http://www.cag.lcs.mit.edu/scale/scalearch/

http://catfish.csail.mit.edu/scale/papers/vta-isca2004.pdf

http://www.mit.edu/~cbatten/work/scale-churchill-talk.pdf

V3 · Sep 8, 2004

V3, those are 1300 signal traces: wayyy too much for the bandwith this thing provides.

Well that's the idea of this technology, making large amount of signal traces possible. The bandwidth may seem underwhelming, but the idea is much like eDRAM, that is providing wide bus bandwidth.

This solution pushes 160 Gbps using 1300 signal traces.

There is nothing stopping them from going above 160 Gbps. Speed bump and further improvement, I see no reason why this tech can't achieved 1000 Gbps and beyond. Its potential bandwidth should be somewhere inbetween XDR and eDRAM.

Panajev2001a · Sep 8, 2004

V3 said:
V3, those are 1300 signal traces: wayyy too much for the bandwith this thing provides.

Click to expand...

Well that's the idea of this technology, making large amount of signal traces possible. The bandwidth may seem underwhelming, but the idea is much like eDRAM, that is providing wide bus bandwidth.

We do not know that this technology makes it so desirable.

I see no evidence and no reason why XDR should not be used as main RAM (it will be) and this memory be used instead.

Traces cost money, 160 Gbps at ~120 MHz with 1300 traces is very underwelming.

To achieve what XDR can achieve with 256 traces or with higher signalling rate (6.4 GHz which could mean a 400-800 MHz extenral clock depending if you change the PLL settings for on-chip clock multiplication) you would need to push this technology to about 330 MHz.

First, I want you to explain why XDR would not be used for main RAM and how this technology would be cheaper and faster.

I see no reason why this tech can't achieved 1000 Gbps and beyond.

125 GB/s off-chip ?!? With 1300 traces their clock should be about 767 MHz.

That is quite insane for an off-chip bus.

I do, PCB.

They cannot expect me to believe that such a thing would scale almost 10x in performance with off-chip connections that wide.

Wide busses work with e-DRAM as the e-DRAM is embedded with the logic chip.

You have very short wires and are not running on PCB.

Off-chip and as fast as e-DRAM, right ? Does it also raise your stock options magically ?

I see more future on SCE+Toshiba's 65 nm e-DRAM technology (very small DRAM cells) and their 45 nm technology node related FB e-DRAM (no capacitor, even smaller DRAM cell) which seriopusly reduces massively e-DRAM area needs.

Also chip-to-chip interconnect should be covered by Redwood... is this a patent on a Redwood implementation (not likely) ?

I think this might not be related to PlayStation 3.

V3 · Sep 8, 2004

125 GB/s off-chip ?!? With 1300 traces their clock should be about 767 MHz.

That is quite insane for an off-chip bus.

Its not really off-chip in the sense of XDR, the memory chip and the logic chip is glued together directly. So no nightmare from PCB traces.

They can widen the bus width or increase clock.

Off-chip and as fast as e-DRAM, right ?

Yep. Compare that to PS2 RDRAM & eDRAM solution, it sits pretty well with its current fabrication tech and clock speed of this demonstration.

BTW, this demonstration is not really about that particular, its about technology to bond memory and logic chip together.

Vince · Sep 9, 2004

I don't mean to interrupt, but I have a quick question, thats more EE-centric, as I'm forgetting alot. When you're employing a process that's mixed-loading DRAM onto a given substrate what is the manufacturing process this entails? Thanks.

Panajev2001a · Sep 9, 2004

Vince said:
I don't mean to interrupt, but I have a quick question, thats more EE-centric, as I'm forgetting alot. When you're employing a process that's mixed-loading DRAM onto a given substrate what is the manufacturing process this entails? Thanks.

You mean e-DRAM on non SOI silicon and logic on SOI silicon

?

Could this ISSC pesentation relate ? I have to read it then in full... argh... no time...

j^aws · Sep 9, 2004

Since you guys are talking about eDRAM, here's a new eDRAM patent from IBM,

Structure and System-on-Chip Integration of a Two-Transistor and Two-Capacitor Memory Cell for Trench Technology

Background of Invention

[0001] This invention generally relates to embedded dynamic random access memory, and more particularly to a cell structure formed by two transistors and two capacitors to be used in a system on-chip embedded dynamic random access memory (DRAM).
.....

Abstract

A two-port dynamic random access memory (DRAM) cell consisting of two transistors and two trench capacitors (2T and 2C DRAM cell) connecting two one transistor and one capacitor DRAM cell (1T DRAM cell) is described. The mask data and cross-section of the 2T 2C DRAM and 1T DRAM cells are fully compatible to each other except for the diffusion connection that couples two storage nodes of the two 1T DRAM cells. This allows a one-port memory cell with 1T and 1C DRAM cell and a two-port memory cell with 2T and 2C DRAM cell to be fully integrated, forming a true system-on chip architecture. Alternatively, by halving the capacitor, the random access write cycle time is further reduced, while still maintaining the data retention time. The deep trench process time is also reduced by reducing by one-half the trench depth.
....

I don't think it's as dense as the Toshiba capacitor-less eDRAM but it may have better latencies?

Vince · Sep 10, 2004

Panajev2001a said:
Could this ISSC pesentation relate ? I have to read it then in full... argh... no time...

No. I just would like someone to explain the process of mix-loading a DRAM on a given substrate.

phed · Sep 10, 2004

Can these apulets be compared to the "microthreads" as in stackless python? You know, the "dialect" of python that Eve is programmed in.

kyetech · Sep 10, 2004

transistor count

Hello all,

Just have a question for Jaws really....

I noticed that you think the die size for the PS3 chip will about 290 - 300mm sq.... @ 65nm.

After reading an article at the inq about intels MONTECITO / 580Sqmm / 90nm, 1.72 billion transistor chip, it got me thinking.

source : http://www.the-inquirer.net/?article=18345

I wanted to calculate how many transistors the CELL chip would have at a 300mm sq @ 65nm

first I calculated what area shrink would be from 90nm to 65nm :

90*90 = 8100 (multiplies out the surface area)
65*65 = 4225 (multiplies out the surface area)
8100 / 4225 = 1.9172 (the result is the value I need to divide the die size by.)

So am i correct in saying that if you took a 1.72 billion transistor chip built on 90nm process (that meseaures 580mm sq) and shrunk it down to 65nm, you get :

580 * 580 = 336400 (multiplies out the surface area)
336400 / 1.91 = 175464 (divides the surface area by required factor)
square root of 175464 = 418.9mm sq (resulting die size)

so a 1.72 billion transistor chip @ 65 nm has a 418.9mm sq size chip. so now to reduce the transistor count to make it fit on 300mm sq.

418.9 * 418.9 = 175477.21 (multiplies out the surface area of the oversized chip)
300 * 300 = 90000 (multiplies out the surface area of the required die size)
175477.21 / 90000 = 1.95 ( the division required for transistor reduction)

1.72 billion / 1.95 = 882 million

I therefore conclude that the CELL chip, according to your estimates of die size and in accordance with transistor counts from intel the cell will contain

(around) 882 million transistors. !!

Have I done somthing really silly, are my calculations going in the wrong direction.

DO you agree or disagree?

Quaz51 · Sep 10, 2004

disagree

1.72billions at 90nm 580mm2 =~ 1.72billions at 65nm 300mm2 or =~ 882 millions at 90nm 300mm2 but not 882 millions at 65nm 300mm2, it's absurd

j^aws · Sep 11, 2004

Re: transistor count

kyetech said:
...
After reading an article at the inq about intels MONTECITO / 580Sqmm / 90nm, 1.72 billion transistor chip, it got me thinking.
...

Firstly, welcome to the forum!

Secondly, I have to agree with Quaz51 and disagree with the calculations.

A drop from 90nm to 65nm will give you this scaling factor for unit area,

(90/65)^2 ~ 1.92

so 580 mm2 die> 580/1.92 = 302 mm2 with 1.72 Billion trannies at 65nm

300 mm2 die at 65 nm > 1.72 * (302/300) ~ 1.74 Billion trannies and not 882 million.

Your mistake was this,

....
580 * 580 = 336400 (multiplies out the surface area)
336400 / 1.91 = 175464 (divides the surface area by required factor)
square root of 175464 = 418.9mm sq (resulting die size)
....

You did not need to multiply the area again to get unit square, square area!

You could attempt this with a GPU with multiple SIMD units more akin to Cell, e.g. NV40

NV40 has 222 million trannies at approx 300 mm2 die with 130nm. So at 65nm,

(130/65)^2 =4

300/4= 75 mm2 die with 222 million trannies at 65nm

300 mm2 die at 65nm = 222*(300/75) = 888 million trannies!

This is ironically close to your original erroneos calcaulations using Montecito as a reference!

So we seem to have a range of the BE die at 300mm2 on 65nm ~ 0.9-1.7 Billion trannies.... 'cause I like averages, will say ~ 1.3 Billion trannies!

kyetech · Sep 12, 2004

before I make an explanation i want you to visualise this anology.

if you half the res of your monitor from 1000x1000 to 500x500 you divide each axis of your monitor by 2.... but the resulting number of pixels has divided by four... this is an important idea as to why you cant just divide 580mm sq by 1.91.

580 represents just one side of the sqaure, it does not represent the area.
1.91 represents the reduction of total surface area, not the division of an axis.

therefore if you think that 90 / 65 = 1.3846 then you will see that EACH AXIS of the SQUARE IS REDUCED by 1.3846 giving a TOTAL AREA DECRESE of 1.91.

580 (x axis) / 1.3846 = 418.9x
580 (y axis) / 1.3846 = 418.9y

or you could say...

580 * 580 = 336400 (ie the actual area, similar to number of pixels on monitor)
336400 / 1.91 = 175464 (divides the surface area by required factor)
square root of 175464 = 418.9mm sq (resulting die size)

this id saying divide the surface area by the factor that the area needs reducing.

Lets now look at what you are doing............ 580 / 1.91 = 303mm sq

///////////////////////////////90nm (the size of a single transistor in one axis) / 1.91 = 47.1
///////////////////////////////90nm (the size of a single transistor in one axis) / 1.3846 = 65

if you scale each axis by 1.91 (which is what you are doing) then you actually achieve an area shrink factor of 3.648 which is wrong, its actually 1.91 (nearly two) hence intels ability to double their transistor count with each process, not quadruple.

do you see what I mean?

I hope this image I made also helps to

CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

one

Unruly Member

Panajev2001a

V3

Panajev2001a

j^aws

ERP

nAo

Nutella Nutellae

Brimstone

B3D Shockwave Rider

V3

Panajev2001a

V3

Vince

Panajev2001a

j^aws

Vince

phed

kyetech

Quaz51

j^aws

kyetech

Similar threads