CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

Kazuaki Yazawa and his Thermal Engineering Group at Sony have been working on the Cell project since 2001. Looking at their other projects, they seem to have an awesome trackrecord.
 
PC-Engine said:
You could make it look like a power amplifier for car use and add a disc tray, however, the exposed fins will get VERY hot and I don't think SONY is ready for lawsuits from consumers getting burned by accidently touching them.

A case like this will not be cheap. Of course you could replace the other parts with plastic but then the cooling efficiency goes down.

http://www.hushtest.de/hushshop/shop/shop_img/big_img/silber_kl_3d.jpg

Yeah...that's the type of case I was talking about. An intgerated case/ heatpipe / fin design. Apparently those Hush PCs run upto 2.8 Ghz p4s without any fans whatsoever! 8) Those external fins must be cool enough not to get your fingers burnt, otherwise they wouldn't be on sale I suppose. And that aluminium case sure doesn't look cheap but I do like that minimalist, industrial look! 8)

Vince said:
Kazuaki Yazawa and his Thermal Engineering Group at Sony have been working on the Cell project since 2001. Looking at their other projects, they seem to have an awesome trackrecord.

Thanks for the info...apparently he also has this patent under his name,

Source: Heat dissipating structure for an electronic device

heat-case1.jpg


and

heat-case2.jpg


Heat dissipating structure for an electronic device

Abstract

A heat dissipating structure for an electronic device includes a heat source and a heat dissipating member. The heat dissipating member has an inner wall, an outer wall, and a plurality of partition walls. The inner wall receives heat transfer from the heat source. The outer wall opposes the inner wall at a distance. The partition walls connect the inner wall and the outer wall, and together with the inner wall and outer wall define a plurality of through-holes which are approximately the same shape and are aligned at roughly regular intervals along the inner wall or the outer wall. The through-holes are arranged along the vertical direction that allows most effective utilization of gravitational influence, and are open to the outside at the upper and lower ends thereof.

That case definitely fits the mould of a console case / desktop case. It has a highly conductive honeycombe wall design. Also mentions external fins and the second embodiment looks like a cylindrical chimney! If Kazuaki Yazawa is indeed working on Cell, then his patents could be applied to PS3. Imagine a smoking, chimney shaped PS3! :oops:
 
In a console you are free to spread a heatsink to the size of the entire area of the console unlike a PC where it would block PCI slots etc. so your effective surface area for heat dissipation is much greater.

With regards to the hush PC. The whole chassis is aluminum not just the fins. That's a lot of surface area. Also the 2.8GHz version is a lot bigger than the miniITX version which only goes up to about 1GHz.
 
For what it`s worth:

http://bizns.nikkeibp.co.jp/cgi-bin...H_KIJIID=98468&NSH_CHTML=asiabiztech.html

Old article, but relevant. (Cut out parts to improve relevancy)

April 3, 2000 (TOKYO) -- What's inside the PlayStation2?

Nikkei Electronics dissembled a PlayStation2, which has been attracting attention as a new technology driver alternative to personal computers, on the first day of the release with cooperation from five specialists in a variety of technical fields.

The specialists who "dissected" the game console consisted of engineers specializing in designing audiovisual equipment, personal computers, interconnection and packaging for mainframe computers, heat transfer and printed circuit boards (PCBs). The specialists analyzed the internal structure of the PlayStation2 from the viewpoints of heat dissipation, electro-magnetic interference (EMI) and cost reduction.

Mainboard is Shielded by Metallic Plate

PlayStation2's body is made of plastics. A metallic body would have been the first choice to prevent EMI. PlayStation2, however, adopts ABS resin for its body.

Sony occasionally develops technologies to commercialize a new product after having decided its exterior. It is likely that the company developed the PlayStation2 with the first priority being its design and texture.

The specialists' interest was centered on PlayStation2's EMI countermeasures, and they got the answer after removing the plastic lid. The upper and under side of the mainboard are shielded by metallic plates. And a lot of bumps on the plate are contacted to the ground surface of the PCB. The shielding is almost the highest level ever, said a PC design engineer.

Also, shield plates are applied to the inside of the plastic body. To sum up, PlayStation2 has a plastic body, but it is double clad in metallic plates.

Large Heat Sink

The specialists found a large heat sink when they took out the mainboard from the body, after removing the shield plate. (right: inside the shield plate, the heat sink covers the mainboard. The heat sink weighs 370g, which is relatively heavy.) The heat sink plays a leading part for heat dissipation. The heat sink has 129 bumps to boost the cooling effect, as well as slits through which cooling air is conducted.

The heat sink has a special solid construction cast in aluminum. It was formed in a mold featuring so complicated a design that only a product with marketability of millions or tens of millions of units could afford to use it, said the specialist in heat transfer technologies.

To absorb heat, heat pipes are incorporated at the locations where the Emotion Engine and Graphics Synthesizer microchips are mounted. To top it all, a high-performance, heat-conductive silicon rubber sheet is laid between heat pipes and the PCB. "Metallic feelers of coarse grains, which were blended with other materials to raise heat conduction, were found at cross sections of the silicon rubber. The silicon rubber is likely to be specially developed for PlayStation2, the heat transfer specialist said.

DVD-ROM Device Has No Electric Connection

An electric power source unit and a DVD-ROM device were dissembled. The power source unit is a conventional type with a voltage output of 8.5VDC at no load. A DC-to-DC converter installed at the back side of the PCB generates various voltages necessary.

SCE developed the DVD-ROM device on its own for use in PlayStation2. It incorporates a two-wavelength laser pickup device for playback of CD and DVD disks.

No components were found in the DVD-ROM unit, except an IC microchip installed on a board inside the device, which was likely to be for servo control.

Components mounted on the back side of the mainboard were supposed to be an analog-to-digital converter and a DSP circuit. The DVD-ROM device does not need any cooling because components causing heat have been shifted to the mainboard featuring the unique cooling system.

To summarize:

- Two big metallic plates that sandwich the PCB
- Large, custom-built heat sink
- Heat pipes for critical ICs
- Special silicon rubber to wick away heat
- Power systems moved to other areas to facilitate cooling.

Personally, I think cooling the PS3 will be relatively trivial for Sony engineers. I`m more worried about decent compilers than heat sinks.

(EDIT: And what does any of this have to do with the original topic?! *shrug*)
 
IIRC the old 8-bit SEGA Master System and NES had EMI shielding similar to PS2. Also heatpipes filled with coolant were used to cool the chips in DC. Anyone know the power dissipation of the first EEs and GSs?
 
More offtopic :p, but interesting food for thought on 1-teraflop performance rant: The chip of 230Gflops @ 350Mhz and a petaflop supercomputer project...

http://news.com.com/2100-7337_3-5322558.html
(some of the links are mine)

Japan designers shoot for supercomputer on a chip

Published: August 24, 2004, 1:42 PM PDT
By Michael Kanellos Staff Writer, CNET News.com

PALO ALTO, Calif.--Chip designers at Japan's RIKEN say you can get a lot done by specializing.

RIKEN, an anglicized acronym for Japan's Research Institute of Physical and Chemical Research, described on Tuesday the MDGrape 3, a processor it thinks will become the cornerstone of a computer capable of operating at a petaflop, or a quadrillion operations per second--far faster than the 36 trillion ops supercomputers of today.

Samples of the chip, which was designed for life sciences research, can now perform 230 gigaflops, or 230 billion operations per second, while running at 350MHz, better than standard general-purpose chips. In a worst-case scenario, the chip performs 160 gigaflops at 250MHz, said Makoto Tanji, a researcher with RIKEN's high-performance computing group. Tanji spoke at the Hot Chips conference taking place at Stanford University.

The computational power comes, he said, because the chip is specialized for workloads that involve numerous, similar calculations on a comparatively small set of data. This sort of workload is common in the life sciences and bio-nanotechnology field, where researchers need to examine, for example, how a single protein interacts with thousands of different molecules. Consequently, the chip and the computers based on it can be directly compared with general purpose supercomputers only in a limited field, but the processor excels there.

"We can obtain about a 100 times better performance through specialization. The number of operations are more limited on a general purpose computer," Tanji said. For the MDGrape 3 to shine, "the amount of computation must be much larger than the data," he added.

The University of Tokyo initiated the MDGrape project 15 years ago to develop a chip for astrophysics. RIKEN, which is one of the world's largest biosciences institutes, has worked over the last several years to extend the chip's architecture to life sciences and molecular dynamics because the range of applications is wider, Tanji explained. The group will create computers based on the chip for its Protein 3000 project to determine the characteristics of 3,000 proteins. Those machines should appear sometime in 2007.

Commercial systems using the MDGrape 2, which can churn at 16 gigaflops and run at 100MHz, are currently on the market, Tanji said. Work on the MDGrape 3, also know as the Protein Explorer, began in 2002, and the chip should start to be used to run applications in 2006.

Research also continues at the University of Tokyo to develop a quasi general purpose chip capable of 1 teraflop, or a trillion calculations a second. IBM and the University of Texas have a similar teraflop-on-a-chip project.

Architecturally, the MDGrape 3 differs substantially from most other chips. It comes with 20 pipelines for calculations, the equivalent of an assembly line for a processor. Commercial chips typically have one or two. The chip also features what RIKEN calls a broadcast memory architecture, where data is force-fed to the different pipelines simultaneously. Parallelization, a design convention that aims to cut down on redundant or parallel calculations, is optimized in the chip's design.

Despite the differences from other chips, the MDGrape 3 is built on the 130-nanometer process, a manufacturing convention that has been in place for the past few years.

The enhancements lead to huge advantages over general purpose processors. Tanji said the 350MHz Grape 3 can provide a gigaflop of computing power for $15, compared with $400 per gigaflop for a Pentium 4, $640 per gigaflop for the chips inside IBM's Blue Gene/L and a whopping $4,000 per gigaflop from NEC's Earth Simulator, currently the world's most powerful supercomputer.

In terms of power consumption, the 350MHz MDGrape 3 consumers 14 watts of power, or 0.1 watts per gigaflop. A 3GHz Pentium 4 runs at 82 watts, or 14 watts per gigaflop, he said. The Blue Gene/L chip and Earth Simulator come in at 6 and 128 watts, he said.

RIKEN is also designing the computer that will house the MDGrape 3. Twelve chips will fit on a board, while two boards will fit into a 2U-high box (3.5 inches). The chips are all connected to each other through an 81-bit bus, and the boards are connected to the rest of the computer through PCI Express.

The petaflop computer will consist of 6,144 processors on 512 boards clustered together. In all, the system will fit into 32 boxes that will stand on 19-inch pedestals.

"It is very small," Tanji said.
 
nondescript said:
......
Personally, I think cooling the PS3 will be relatively trivial for Sony engineers. I`m more worried about decent compilers than heat sinks.

(EDIT: And what does any of this have to do with the original topic?! *shrug*)

Cooling maybe trivial but I doubt it would be for a BE to approach anything near 4GHz! ;)

I agree about the compiler and I've mentioned this in this thread and others that it would need to be the smartest component in the whole Cell architecture.

There was an earlier brief discussion on the compiler in this thread with Panajev. It will prolly be some type of VLIW compiler and maybe even a JIT type compiler to maximise mapping of code to Cell hardware similar to how Java works?

Btw, when was the last time you saw a thread remain completely on topic! :D
 
one said:
More offtopic :p, but interesting food for thought on 1-teraflop performance rant: The chip of 230Gflops @ 350Mhz and a petaflop supercomputer project...
....

OMG, Cell is pwned by a chip named after a grape! :oops:

Apparently the birds are singing ~ 256 Gflops...but maybe those birds might wanna clarify if that's the PS3 CPU, or one Cell (PE) or the BE or just more info really, hint hint ! ;)

Depends how they are measuring their Flops, specialised chips can reach high flops, like this 3Dlabs wildcat claiming 700 Gflops (350 per chip),

http://www.3dlabs.com/press/detail.asp?ref=71&prreg=17

Also this point was raised earlier in the thread,

Jaws said:
passerby said:
My current guess:
Each APU is just in the sub-1GHz regions - maybe even much lower. How they claim high GF numbers - well we'll wait and see. After all we don't know much about what an APU is - just lots of papers about how APUs deploy and work together, but nothing technical on an APU itself.

Sub < 1Ghz is quite low ... is that for 32 APUs?

Yeah...we've kinda postulated that an APU at 4GHz, we'd get 8 Flops per cycle but this hasn't really been confirmed. What if they could do 16 or 32 Flops per cycle. Then we wouldn't need to aim as high as 4 GHz! 8)

The Cell patents mention in the preferred embodiment that each APU with it's 4 floating pt SIMDs is 32 Gflops. We've assumed they're 4 FMACs giving 8 ops/cycle >>> 4 GHz... what if they're something different, then we wouldn't need a higher clock for the same Flops?

These cores will be designed to churn 16 operations per clock cycle each, for a total of 64 operations per clock cycle. The prototype chip is expected to operate at 500MHz, which means its internal clock should complete 500 million cycles per second. That adds up to about 32 billion operations per second, theoretically

Here's another way to get 32 Gflops per core, a coincidence :?:

Source from one of your links...
 
Hee, back to this again! :D

I think me and Jaws are in the same 'guessing camp'. (Right? :oops: )

APU clockspeeds relatively unimpressive, but something very specialized to account for the GF claims.
 
passerby said:
Hee, back to this again! :D

I think me and Jaws are in the same 'guessing camp'. (Right? :oops: )

APU clockspeeds relatively unimpressive, but something very specialized to account for the GF claims.

Trying to play 'devils avacado' here! :p ...But yeah, the concept of SIMD FPUs capable of more than 2 ops per cycle per unit so that a higher clock isn't required and therefore the cooling requirement isn't so intense. I don't think anything like that currently exists, so would have to be a new type of design :?
 
This info would shift some of my assumptions. Firstly, I'm not changing my opinon on beastly die sizes for the BE and GPU ~300mm2 and then a quick drop to 45 nm. The assumption I'm changing now, IMO, is that the PS3 chipsets released to the general public would be based on 45 nm process and hence the launch date of the PS3 will be tied to mass production of that process being ready. A release date of Q1 2006 would then be more likely than Q4 2005 if they can get 45 nm off the ground. Chipsets produced on 65 nm would likely find their ways into Cell worksations and PS3 TOOL devkits and possibly other devices.

Launching at 45 nm would have many advantages in realising the BE and GPU. IIRC, the capacitor-less eDRAM from Toshiba was designed for 45nm process and can be employed for large amounts of eDRAM on the BE and GPU (64 + 64 MB) at reasonable die sizes. It would also mean less power consumption and higher clocks would be feasable and a BE > 2 GHz+ may not be such a pipe dream!

I was under the assumption, perhaps an incorrect one, that 45 nm chips would not be possible for mass production until Q1 2007 at the soonest.
 
Megadrive1988 said:
.....
I was under the assumption, perhaps an incorrect one, that 45 nm chips would not be possible for mass production until Q1 2007 at the soonest.

Where did you get Q1 2007? Well this was from Toshiba's press release,

Sony and Toshiba to Develop 45-nanometer Process Technologies For Next Generation System LSI

12 February, 2004

Sony Corporation
Toshiba Corporation

Tokyo -- Sony Corporation and Toshiba Corporation today announced that they would collaborate in the development of highly advanced 45-nanometer (nm) process and design technologies for next-generation system LSI. Under the terms of an agreement, the two companies will take their successful development of 65nm process technologies to the next level, with positive results expected in 2005.

Sony and Toshiba have worked together to pioneer IC process technology since May 2001, in a collaboration that has resulted in co-development of cutting-edge 65nm design process that will soon be applied to the sample products. The companies have decided to build on this achievement and to apply the design know-how and cutting-edge technologies gained from developing the 65nm process to next generation 45nm process technology.

Sony and Toshiba signed the joint development agreement in Tokyo and it calls for completion of the project by late 2005, with the ultimate goal of being first to market with 45nm know-how. The project will have a budget of 20-billion yen, to be shared by both companies, and approximately 150 engineers from the two companies are expected to work on the project at Toshiba’s Advanced Microelectronics Center in Yokohama, Japan and Oita Operations in Kyushu island of Japan.

Continued advances in digitization fuel demand for the ability to access, process, save and enjoy increasingly rich data sources. That in turn is driving demand for system LSI that combines increased miniaturization with enhanced functionality, faster operating speeds and lower power consumption. Sony and Toshiba will position themselves in the vanguard of meeting these demands through the industry-leading development and deployment of 45nm design technologies.

http://www.toshiba.co.jp/about/press/2004_02/pr1201.htm

Also, below is a recent link that mentions some of the technical problems Toshiba have overcome for 45nm,

http://www.toshiba.co.jp/about/press/2004_06/pr1601.htm
 
More guesses from me. :devilish:
Maybe I can claim bragging rights when they finally reveal the thing, 3-6 months later. :devilish: :devilish:

guess #1
Unimpressive clockspeeds. But each APU has a built-in hardwired component that performs graphics-related tasks, such as clipping, culling, etc. Integrated in such a way that allows counting it as adding to the GFs, similar to how graphics cards count their GFs. Even if this guess is wrong, at the very least such a component exists in one form or other in the final product, since even the PSP carries someting of the sort.


guess #2
1 vector processor, executes instructions on 4 sets of 4x32-bit registers in a cycle. Which means that the APU must be able to transfer 512 bits of data to the registers from APU local memory/cache really, really fast to be useful(512-bit wide bus for APU local mem?). Not all code will benefit from this. But if a large portion (?%) of executed gaming code can be contructed into an SIMD operation of this form, it will be useful. Of course to be really useful it is highly desirable that the 'CPU core' of the APU can interact well with this VP, so that complicated instructions can be more easily executed. BTW with 4x32-bit registers, we can store all vertices of a triangle plus its normal. (Obviously I speak as one who is an outsider to the discipline of 3D graphics. :oops: ) This APU will be able to hit 32GF at 1 GHz. Of course, if STI can gurantee really high mem <-> register speed and a more agressive VP, they may be able to deliver the numbers at 500MHz....


OK, time to stop babbling.
 
passerby said:
1 vector processor, executes instructions on 4 sets of 4x32-bit registers in a cycle...Not all code will benefit from this. But if a large portion (?%) of executed gaming code can be contructed into an SIMD operation of this form, it will be useful.
It wouldn't.
Matrix SIMD like that would be massively redundant most of the time, and the narrow scope of problems that would benefit from it just doesn't justify it - even if APUs were limited to nothing but graphics processing.

If you really wanted to increase APU throughput, maybe you could try multithreading them, with multiple vector execution units(say 4, like you suggested). Don't know what that would do for complexity of each APU though.
 
Fafalada said:
passerby said:
1 vector processor, executes instructions on 4 sets of 4x32-bit registers in a cycle...Not all code will benefit from this. But if a large portion (?%) of executed gaming code can be contructed into an SIMD operation of this form, it will be useful.
It wouldn't.
Matrix SIMD like that would be massively redundant most of the time, and the narrow scope of problems that would benefit from it just doesn't justify it - even if APUs were limited to nothing but graphics processing.

If you really wanted to increase APU throughput, maybe you could try multithreading them, with multiple vector execution units(say 4, like you suggested). Don't know what that would do for complexity of each APU though.

Okay, done some digging as things don't seem to be adding up...

APU.jpg


Source: Processing modules for computer architecture for broadband networks


APU-BW.jpg


[0070] APU 402 further includes bus 404 for transmitting applications and data to and from the APU. In a preferred embodiment, this bus is 1,024 bits wide. APU 402 further includes internal busses 408, 420 and 418. In a preferred embodiment, bus 408 has a width of 256 bits and provides communications between local memory 406 and registers 410. Busses 420 and 418 provide communications between, respectively, registers 410 and floating point units 412, and registers 410 and integer units 414. In a preferred embodiment, the width of busses 418 and 420 from registers 410 to the floating point or integer units is 384 bits, and the width of busses 418 and 420 from the floating point or integer units to registers 410 is 128 bits. The larger width of these busses from registers 410 to the floating point or integer units than from these units to registers 410 accommodates the larger data flow from registers 410 during processing. A maximum of three words are needed for each calculation. The result of each calculation, however, normally is only one word.

Okay...384 bits feeding the APU (3 words) and the output of the APU is 128 bits (1 word)....and remember the Floating and Integer units cannot work concurrently...

So... if the SIMD units in the APU work on 32 bit chunks, that's a maximum of 12 inputs and 4 outputs per cycle per APU .

If we focus on the FPUs, we've previously assumed there would be 4 FMACs, each doing a MADD (mul+add) instruction (2 ops), so a total of 8 ops per cycle per APU. Now looking at the above, if we're to have a different type of instruction (3 ops), say a D-M-ADD (division, multiplication and addition) and a new type of FPU unit, a F-D-M-AC, that could execute that, then we'd get 12 ops per cycle, per APU...

then in order to achieve 32 Gflops per APU,

each APU speed = 32/12 ~ 2.7 GHz and not 4 GHz!

So a 2.7 GHz BE would achieve 1 TFlop...?


Any thoughts on this?
 
FMAC's are assumed because they are actually useful operations.

What do you itnend to accomplish with an FDMAD?

Also an FMAD is

Ans = Arg1*Arg2+Arg3

at 128 bits/argument I count 384 bits of input and 128 bits of output...Hmmm a lot like your diagram.
 
Thanks everyone for the very useful points raised. I'm still trying to find a good explanation as to how we can deliver 32GF on an APU without crazy clockspeeds - which seems antithesis to the CELL idea itself. Isn't it a fundamental idea that CELL can deliver high performance without the need for crazy clockspeeds and monstrous CPU designs? That we can deploy many relatively 'simple' CPUs that collaborate to deliver very great performance.

But there it is, in the patent paper itself Jaws referenced, 384-bits in. It's good, keeping the FPUs well-fed, but we are still stuck at the 4GHz explanation.

Unless, as Faf suggested, there are multiple 4xFPUs. Suppose there are 4, and each has a 384-bit bus into it. I only wonder if the 256-bit connection between the registers and local mem is enough to feed them? If VP code has the property of having data stay in the registers *just long enough*, it should suffice.

As for whether such a thing is possible, I think there is a reasonable chance. After all the EE is, simply put, a CPU + 10 FPU implementation(right? can't recall). 6 years later, is it too much to expect a CPU + ~16-20 FPU? Of course the important question is how well this team is integrated to work togther. Think about how difficult it is to get useful work out of VP0 on the EE.

Out of guesses now. Visit this thread again at CELL's unveiling.
 
passerby said:
Isn't it a fundamental idea that CELL can deliver high performance without the need for crazy clockspeeds and monstrous CPU designs?

No, the fundamental idea behind Cell is to provide one single architecture with a common make-up and instruction set that scales from rock bottom to cutting-edge high-end performance...

The broadband engine will be a monstrous chip no matter what, that much is certain. What we're arguing over :)lol:) is exactly HOW monstrous it will be! :)
 
Back
Top