PlayStation III Architecture

Peppermonkey · Jan 15, 2003

Wow, this thread became way to technical for my tastes.

I find it funny that this kind of a discussion is happening when we really don't know much at all about the PS3. Most of this looks like speculation... and speculation of a console that won't come out for another couple years probably won't produce the correct answer.

Panajev2001a · Jan 15, 2003

well CELL should start entering mass-production by early 2004... and that means that the chip must tape out a good 6 months before it ( approx. ) which means that by end of 2003 the design must be pretty much finalized...

Specs have to be finalized even earlier as you will need to worry about implementing those specs into silicon at the projected clock speed and this will take a looooooong time

Peppermonkey · Jan 15, 2003

The problem that I already see in this architecture is that it doesn't seem that it would be easy to program for. Developers already complain about the PS2, and this seems much more complex.

Is CELL supposed to do everything through a Network connection? Or is it just a normal modular chip kinda like what the Voodoo 5 had?

I'm not sure I grasp what exactly the PS3 is supposed to do.

zurich · Jan 15, 2003

Hehe yeah, would a tech-savy guy care to do a bullet point breakdown of what's known about the PS3? Kind of break it into arm-chair engineer terms?

Panajev2001a · Jan 15, 2003

I agree that this beast would be impossible or close to impossible to handle in ASM or something of similar level of HW abstraction... we will need a powerful OS and high level APIs and a nice Shading Language to code in.

Possibly a good OS and a solution similar to Renderware should be made the standard by Sony for PS3 developers for a while, until the platform is diffused and developers can afford to do lower level optimization
( some will still try )... similar to PSX days...

Sony, IBM and Toshiba will have to worry about Software R&D as well and it seems they are...

CELL doesn't do EVERYTHING through the network... what CELL does is providing a modular platform with a certain ISA kept constant through all CELL based devices... what is executed are Packets or software "Cells" with instructions and data ( data and what to do on that data travels together )... each software "Cell" has an unique global ID and it can be executed by any APU ( Attached Processor Unit ) on the network the CELL device is connected on, included any of its own APUs...

The Instruction Set Architecture is constant across all APUs ( if CELL 2.0 will arrive it will probably be backward compatible with the current ISA... pretty much like the basic IA-32 ISA stayed with us since the 80386 )...

wazoo · Jan 15, 2003

Peppermonkey said:
Wow, this thread became way to technical for my tastes.

I find it funny that this kind of a discussion is happening when we really don't know much at all about the PS3. Most of this looks like speculation... and speculation of a console that won't come out for another couple years probably won't produce the correct answer.

We are 3 full years (at least in Europe) from its launch and we are already talking about it. In 2004, the hype will make the ps2 launch a non event.

PiNkY · Jan 16, 2003

Well, Sony/IBM/Toshiba anounced "tape out" for the "cell" chip in late summer 2002 (late august if i remember correctly). If that is the chip to power PS3, 2'n1/2 half years for ramp up seem to be an awful long time. I guess we might see that architecture show up before then in other appliances (which ones elude me though, but everything requiring upper grade DSPs like high end receivers, -set-top boxes and -dvd-players might be a good guess), on the other hand, i guess the C-Net tape out statement might just have been a misinterpretation...

marconelly! · Jan 16, 2003

The same thing was said about GS's 512 bit bus before. Unfortunately, it wasn't enough to sustain a decent rendering performance.

So what WAS decent rendering performance back then when it was released? In the same price class, of course. Was there anything really better, so that it would give the rights to complain?

V3 · Jan 16, 2003

The same thing was said about GS's 512 bit bus before. Unfortunately, it wasn't enough to sustain a decent rendering performance.

??? what not enough to sustain decent rendering performance ?

128 byte * 4 Ghz = 512 Gbyte/s. Is it enough to feed 32 VUs???

Half of that would be enough to feed 32 APUs.

Presuming a VU cache miss rate of 10%

Those APUs doesn't have any cache, just local memory.

Gubbi · Jan 16, 2003

Fafalada said:
Gubbi,

I disagree about physics though. Obviously it calls for different algoryhtms, but from what research I did on the subject, particularly matrix factorization can be fairly well adapted to paralel approaches.
And if I could make a vectorized version of cholesky factorization and linear solver that can be processed in isolated parts, working entirely on scratch pad principle, I'm sure a lot better can be done too

Application in supercomputer arrays were always attractive research topic I think...
http://www.computer.org/tpds/td1997/l0502abs.htm

Naive sparse matrix solvers has a communication cost that scales with p^3log^3(p), where p is the number of nodes. With 128KB nodes you'll need a good number of nodes. The above paper reduces this communication complexity to p^1.5log^1.5(p), it does this by doing more (redundant) calculations on each node, to the effect of reducing per node performance to 40% (50MFlops to 20MFlops in their case, old paper

) over a single thread single node solution.

So you have nonlinear communication costs, reduced performance and the added complexity of programming a message passing solution.

It is still good performance (lots of oomph), but I'm just emphasizing that you get nowhere near peak performance. I actually think that the message passing that is required is more of a problem (steep learning curve for PS3 developers).

Cheers
Gubbi

Panajev2001a · Jan 16, 2003

I posted this yesterday, but nobody commented on it... although I think it is more than lightly interesting... it ask the question "are we sure each APU has simply 2 standard Vector Units ( Integer and FP SIMD VUs ) ?" if you read what I just quoted of this post of mine it should be clear why I asked myself thsi question in the first place...

Panajev2001a said:
The patent seems to suggest that the PE will be clocked at around 1ghz. So for each PE that has a full 8 APU's, we'll have a peak of 32billion 32bit operations per second. That is definately fast. Compare to a 2ghz P4 which, using SEE, is capable of just 8billion 32bit operations.

Click to expand...

I disagree with this assesment...

Ok we know that the APUs can each perform SIMD operations and if we keep the PS2 VUs' model each APU can do ( pipelined ) 4 MADDs/cycle ( FMAC, fuse multiply-add ) and that is 8 FP ops/cycle per APU...

From the patent:

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed. In a preferred embodiment, local memory 406 contains 128 kilobytes of storage, and the capacity of registers 410 is 128.times.128 bits. Floating point units 412 preferably operate at a speed of 32 billion floating point operations per second (32 GFLOPS), and integer units 414 preferably operate at a speed of 32 billion operations per second (32 GOPS).

Click to expand...

Each APU is rated indeed at 32 GFLOPS...

And since we know each APU can do a max of 8 FP ops/cycle...

8 FP ops/cycle * 4 GHz = 32 GFLOPS

And this is for each APU: suggested speed is indeed 4 GHz

Quoting again the quote I just posted I have to disagree that these are "simple" VUs like PS2's ones... first of all we haven't been presented with the 4 FMACs structures and one or two FDIVs... the only thing we know is that we have four FP Units: for all we know each could pack an FDIV, for all we know each could be an EFU-like unit...

Another thing: if the four FP Units were indeed 4 FMACs tied together and being able to work as a SIMD unit only ( no independant operation allowed and only support for 4-way parallel SIMD operations ) how would we explain THIS ( here is the quote I was presenting again as I said few lines above ):

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed.

Click to expand...

Look at the underscored portion of the text...

"[...]a greater or lesser number of floating points units 512 and integer units 414 can be employed [...]"

And we also know that the "ISA is constant across all APUs"... even if we change the number of FP Units, no changes to the ISA or changes to the code should be planned...

How could this work in a standard SIMD VU architecture ?

To me, the workarounds in Instruction decoding and Control Unit operation, to make sure a 4-way MADD SIMD instruction is performed with 2 FMACs or even 1 FMAC as if we had 4 FMACs, would have a certain degree of complexity involved...

What we would need, to have a quasi-optimal solution, would be the FP Units to be able to work in two modes: independent mode and SIMD mode ( all together )...

Impossible ?

Uhm... but I thought I saw that before... somewhere, i must have been a super-computer with insane budgets... BEEEP!!! WRONG!!! We saw it in the EE: as you can quickly check the Integer Units of the RISC core in the EE were two separate 64 bits IUs, but they could work as a single 128 bits VU and this is quite close to what I think it's going on with the APU's Integer Units and FP Units... it is indeed "prooven" and already "pioonered" technology, present in consumer chip for quite sometimes ( the EE )...

One of the ways that would come up to my mind to do "another" approach, which has basically fixed in the ISA that each APU is basically made of two standard SIMD VUs and that we can still variate the number of FP and Integer Units without sacrificing program compatibility, is this:

in each chip that uses more or less Execution Units than the standard 4 tied FMACs the instruction gets micro-coded ( think if you had to perform a 4-way SIMD MADD with a single FMAC... you would loop it ~4 times through the FMAC and each time working on a different field of the 128 bits vectors )...

Or we could have 4 APUs with one FMAC each do the operation while working in parallel, but that would be quite a waste...

After all the patent says...

The APUs preferably are single instruction, multiple data (SIMD) processors.

Click to expand...

And that compared with

this

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed.

Click to expand...

and

this

These processors also preferably all have the same ISA and perform processing in accordance with the same instruction set.

Click to expand...

tells me something is a bit unclear in this patent...

I still have some other comments, but I wanted to get these off my chest first...

I have fixed some of the grammar and wording...

V3 · Jan 17, 2003

Quoting again the quote I just posted I have to disagree that these are "simple" VUs like PS2's ones... first of all we haven't been presented with the 4 FMACs structures and one or two FDIVs... the only thing we know is that we have four FP Units: for all we know each could pack an FDIV, for all we know each could be an EFU-like unit...

Maybe each FP units in the APUs is capable of 4 FMACs or be more complete solution than just a single FMAC. But its still up in the air, IMO.

Panajev2001a · Jan 17, 2003

thanks for reading and replying V3

V3 · Jan 17, 2003

Though in my opinion 4 GHz would be the way to go for Sony. But that's kind hard to achieve with 100nm technology. From some poster here, they are moving to 65nm tech for 2005, I find that hard to belive, but for mass production, this chip would only be possible on such process.

Panajev2001a · Jan 17, 2003

V3 IBM's 0.10um SOI process was licensed by Sony a good while ago already... since then Sony and Toshiba announced a new 65 nm process... in 2 years they can either go with 0.10u, take a bit of loss because of the chip size, but enjoy a gorgeous YELD rate and then move to 65 nm and lower once they get good yelds to or try to start at the ehnd of 2004 to mass produce CELL with 65 nm technology... they will choose the safest bet ( considering shortages == VERY bad

)

V3 · Jan 17, 2003

Do you know that 0.1u tech will give 65nm gate length. Are you sure Sony and Toshiba announced 65nm tech and not the gate length they stated ?

I estimate this chip using 0.1u to be around 300mm2, that's just abit on the big size for cheap consoles IMO.

Panajev2001a · Jan 17, 2003

I will do some research on both announcements I was almost sure that 65 nm was the process and not the gate lenght...

After all Sony and Toshiba passed it as a breakthrough THEY achieved... it would not be good for them to pass IBM's tech as their own, as if they INVENTED it

Vince · Jan 17, 2003

V3 said:
From some poster here, they are moving to 65nm tech for 2005

<raises hand> That was me... my name is Vince... I've been posting here for almost 4 years. I know the diffrence between "gate length" and a CMOS process size... thanks

I find that hard to belive

I find it hard to beleive that you have more than perhaps 50 excitable cells functioning right now. Do you think I post BS just for the hell of it or purposly lie? Do you have any idea how much people questioning stuff like this pisses me off?

Are you to damn lazy to open google and search for <"65nm" "Sony" "Toshiba"> and see the following:

Toshiba, Sony Reveal Advanced Semiconductor Process Technologies

65-nanometer process technology will create small, powerful System LSIs
TOKYO, December 3, 2002 -- Toshiba Corporation and Sony Corporation today announced the world's first 65-nanometer (nm) CMOS process technology for embedded DRAM system LSIs -- a major breakthrough in process technology for highly advanced, compact, single-chip system LSIs that will be only one-fourth the size of current devices while offering higher levels of performance and functionality.

The move to ubiquitous computing -- total connectivity at all times -- relies on high-performance equipment. These in turn require advanced SoC (system on chip) LSIs integrating ultra-high performance transistors and embedded high-density DRAM. In such devices, size and performance levels are directly related to process technology: finer lithography results in smaller devices that offer higher levels of performance. The new process technology announced by Toshiba and Sony and integration to a new level that allows bandwidths to be scaled up and the maximization of system performance.

Current system LSI devices on the market are produced with 130 nanometer process technologies. Toshiba, the recognized industry leader in advanced process technology, is the only company with mass production technology for 90nm process embedded DRAM system LSI, a technology it is currently deploying and that will meet ever increasing demand for more and more compact devices.

The new SoC technologies for 65nm process generation include: 1) a high-performance transistor with the world's fastest switching speed; 2) the world's smallest cell for embedded DRAM; and 3) the world's smallest cell for embedded SRAM.

The new process technology is the result of joint development of Toshiba Corporation and Sony Corporation of 90nm and 65nm CMOS process technology that was initiated in May 2001. Full details will be presented at the December 9 - 11 International Electron Devices Meeting (IEDM) in San Francisco.

Outline of new technology

1) High-performance transistor with 30nm gate length:
Transistors in this technology have high nitrogen concentration plasma nitrided oxide-gate dielectrics to suppress gate leakage current. This optimization reduces leakage current approximately 50 times more efficiently than conventional SiO2 film and allows formation of an oxide with an effective thickness of only 1nm. Furthermore, Ni silicide is applied in the gate electrodes and source/drain regions to attain low resistance and to reduce junction leakage current. Shallow extension formation optimizing ultra-low energy ion implantation, spike RTA and offset spacer process successfully suppresses the short channel effect of MOSFET and achieves superior roll-off characteristics. An excellent switching speed of 0.72psec for NMOSFET and 1.41psec for PMOSFET at 0.85V (Ioff=100nA/um), were obtained. Currently available Hi-NA193-nm lithography with alternating phase shift mask and slimming process provides 30nm gate lengths.

2) Embedded DRAM cell:
High-speed data processing requires a single-chip solution integrating a microprocessor and embedded large volume memory. Toshiba is the only semiconductor vendor able to offer commercial trench-capacitor DRAM technology for 90nm-generation DRAM-embedded System LSI. Toshiba and Sony have utilized 65nm process to technology to fabricate an embedded DRAM with a cell size of 0.11um2, the world's smallest, which will allow DRAM with a capacity of more than 256Mbit to be integrated on a single chip.

3) Embedded SRAM cell:
SRAM is sometimes used as cache memory in SoC systems. The Hi-NA193-nm lithography with alternating phase shift mask and the slimming process combined with the non-slimming trim mask process will achieve the world's smallest embedded SRAM cell in the 65nm generation an areas of only 0.6um2.

4) 180nm Multi layer wiring:
In order to reduce the chip size, it is important reduce the pitch of the first metal of the lowest layer. The new technology has a 180nm pitch, a 75% shrink from the 90nm generation. To reduce wiring propagation delay and power dissipation, a low-k dielectric material is adopted. The target effective dielectric constant of the interlayer dielectric is around 2.7.

http://www.neoseeker.com/news/articles/headlines/Hardware/2167/

65 nm embedded DRAM

4 December 2002

Toshiba and Sony have announced the world's first 65-nanometer (nm) CMOS process technology for embedded DRAM system LSIs -- a major breakthrough in process technology for highly advanced, compact, single-chip system LSIs that will be only one-fourth the size of current devices while offering higher levels of performance and functionality.

The move to ubiquitous computing -- total connectivity at all times -- relies on high-performance equipment. These in turn require advanced SoC (system on chip) LSIs integrating ultra-high performance transistors and embedded high-density DRAM. In such devices, size and performance levels are directly related to process technology: finer lithography results in smaller devices that offer higher levels of performance. The new process technology announced by Toshiba and Sony and integration to a new level that allows bandwidths to be scaled up and the maximization of system performance

http://www.iee.org/OnComms/CompForum/Forum_News.cfm?ObjectID=A1242A49-8612-438A-A4B4FDB0D640C255

Toshiba & Sony - announced first 65nm process for embedded memories

Tuesday, December 03, 2002

Toshiba and Sony announced the world's first 65-nm CMOS process technology for embedded memories.

The process technology will enable single-chip devices, said to be one-fourth the size of current embedded chips in the market.

The process also enables a 30-nm transistor with the world's fastest switching speeds, as well as the world's smallest cell for embedded DRAM and SRAM.

Toshiba and Sony have utilized 65-nm process to fabricate an embedded DRAM with a cell size of 0.11um2, which will enable a 256-megabit memory to be integrated on a single chip. It also fabricated the world's smallest embedded SRAM cell of only 0.6um2.

The technology will bring the market towards what the companies call â€œubiquitous computing,â€ that is, total connectivity at all times, according to Toshiba and Sony.

http://www.simmtester.com/page/news/shownews.asp?num=5080

But for mass production, this chip would only be possible on such process.

I'm not even going to touch this one right now...

Although the idea that they're going to fab a chip on a lithography technology thats 2 generations old after throwing in almost half a Billion dollars with IBM to develop advanced processes strikes me as ignorant... hm.. maybe it's just me <bangs head on wall>

JF_Aidan_Pryde · Jan 17, 2003

LSI = ??? (Large scale integration on dictionary.com)

Panajev2001a · Jan 17, 2003

Although the idea that they're going to fab a chip on a lithography technology thats 2 generations old after throwing in almost half a Billion dollars with IBM to develop advanced processes strikes me as ignorant... hm.. maybe it's just me

no Vince that is not just you... my thought was about yelds and if would b beter for Sony to have GREAT YELDS or save more money with the latest manufacturing process... ? I am sure they will choose the best option... maybe by PS3's launch the 65 nm process is going to be mature enough...[/quote]

PlayStation III Architecture

Peppermonkey

Panajev2001a

Peppermonkey

zurich

Kendoka

Panajev2001a

wazoo

PiNkY

marconelly!

V3

Gubbi

Panajev2001a

V3

Panajev2001a

V3

Panajev2001a

V3

Panajev2001a

Vince

JF_Aidan_Pryde

Panajev2001a

Similar threads