PlayStation III Architecture

Nite_Hawk · Jan 7, 2003

marconelly!:

It also said that it's connected via a switched 1024bit bus. It would be pretty silly to have main memory on such an encredibly fast bus, but make it too small to store everything needed. My guess is that this is more akin to a centralized video memory, or cache area. There is no reason that you couldn't have local cache on a per processor basis, but setup a centralized L2 cache. This would be especially useful if the 64MB cache area is used for information that can be computed in parellel. Say for example, that you have multiple processors computing lighting vectors for doing monte carlo raytracing for global illumination. You can't simply work on a small area of the scene at a given time without knowing what's happeneing in other areas of the scene. In this way, each processor needs to have access to the entire scene's lighting calculations that have been accomplished so far. Thus, each processor could get the relevent information it needs from the central 64MB L2 cache, perform it's computations in the 128KB L1, and send the data back to 64MB L2. You'd need it to be switched to avoid race conditions caused by each processor reading and writing to the memory.

I personally am pretty confident they'll get the ratios of memory to cache correct to a certain extent. As you've already mentioned, having a 64MB segment for main memory would be pretty silly, and Sony's engineers arn't that dumb. I'm more interested in how the heck they plan to have a switched 1024bit memory bus between cells. They certainly must have some talented engineers on staff.

Nite_Hawk

marconelly! · Jan 7, 2003

I see what you mean, and it's in agreement with the article. 64MB is probably going to be some sort of cache shared among (4?) cells. There will not be 64MB per cell as someone mentioned earlier. The actuall main memory will be a separate entity.

DVFtaxman · Jan 7, 2003

1ST PICTURE?

mr · Jan 7, 2003

Nah...that would only be 3 cells.

bryanb · Jan 7, 2003

marconelly! said:
It goes on to mention there will be four of the cells there. Never does it say 64MB is per cell.

Rapid speculation:

So if the cells, which analogous to CPUs?, share the memory, how will this work?

One cell could be allocated to doing transforms on graphics in that memory and it gets the lion share of the memory?

While another cell which is allocated to AI functionality doesn't bother allocating much of that memory for its workload?

V3 · Jan 8, 2003

He said that the computers are made of cells, each one containing a CPU, which will probably be a PowerPC, and eight APUs (vectorial processors) each with 128K of memory.

The guess is One cell = PowerPC + 8 APUs ?

That's like an overblown PS2 EE.

It will run at 4GHz, producing a not inconsiderable 256Gflops, with the cells connected to the central 64MB memory through a switched 1024 bit bus.

4GHz producing 256 GFLOPS, for each cell, is definitely doable considering it has 8 vector processors.

Now this central memory arrangement is similar to that IBM cellular article. Now that 1024 bit bus to the central memory, I assume would be shared by several cells.

At this moment I don't think each cell will have 1024 bit into the central memory. My guess would be each cell have 128 or 256 bit bus to the central memory. If 256 bit than there should be 4 cell which would gives 1 TFLOPS.

1 TFLOPS is not 1000 times PS2. Maybe the new GS with all the new features will add up.

I haven't gone through the patent yet. I'll save it for later.

What happened with the B3D forum professional integrity Since when people around here trust Inquirer articles?

Btw, that article pretty clearly says that 64MB RAM is for all the cells to share. That sounds like a total amount of memory to me, which would be amazingly low for a 2005 device.

The Inquirer integrity is from a patent, so its not too bad

Yes that 64 MB its for several cell to share, this is not a suprise since that old IBM article stated the advantages of SMP.

Vince · Jan 8, 2003

bryanb said:
So if the cells, which analogous to CPUs?, share the memory, how will this work?

One cell could be allocated to doing transforms on graphics in that memory and it gets the lion share of the memory?
While another cell which is allocated to AI functionality doesn't bother allocating much of that memory for its workload?

I aim to stay away from this; but isn't this difficulty what GRID based computing is? I should hope that any implimentation of a hardware design as elegent as this would find better solutions for problems than a weak and obvious idea to divide tasks up as you stated.

If the abstraction is there; who knows, or cares, what the underlying architecture is.

1 TFLOPS is not 1000 times PS2. Maybe the new GS with all the new features will add up

Not to sound like I'm kissing Sony's ass; but do you have any idea how much 'power' there is in a sustainable tFLOP of computing? You sound like the guys in the 3D forum asking why not 4 or 6 TCU's per pipline as if bigger nomenclature is allways better.

Much more important than the big numbers is that there is a substantial about of onboard memory with an extremely low latency and high bandwith. This is what Diefendorff (sp) talked about in his paper on the future of dynamic media and computing. This, along with the Yellowstone announcement which could bring 30Gb/sec of system level bandwith in addition to the internal bandwith - and you have a machine that could be quite formidable.

Tagrineth · Jan 8, 2003

Vince said:
1 TFLOPS is not 1000 times PS2. Maybe the new GS with all the new features will add up

Click to expand...

Not to sound like I'm kissing Sony's ass; but do you have any idea how much 'power' there is in a sustainable tFLOP of computing? You sound like the guys in the 3D forum asking why not 4 or 6 TCU's per pipline - not a good thing.

So PS2 only manages 1GFLOP? Because that's what you get when you divide 1 TFLOP by 1,000...

THAT's what he means. It doesn't quite add up mathematically.

Vince · Jan 8, 2003

Tagrineth said:
So PS2 only manages 1GFLOP? Because that's what you get when you divide 1 TFLOP by 1,000...

THAT's what he means. It doesn't quite add up mathematically.

How dumb, The comment was talking about preformance of the console as a whole. I have the comment made by Okamoto, even had the slides which showed this posted.

Every part of the console doesn't have to be 1000X preformance, just the agregate.

Beyond this talk of he said, she said - sweetie, do you just want to argue? I mean, a tFLOP is roughly 1/30th the computing preformance of the worlds ranking supercomputer. It's about 1/10th that of the most advanced ASCI seies the DoD and DoE use - all in a single chip. The costs of a ASCI are in the millions of dollars for electricity alone. And their going to put this in a frickin' game console.

So, to make a comment like "It's not 1000X, so I can bash IBM/SCE/Toshiba and feel good about myself" is a bit retarded.

Glonk · Jan 8, 2003

Kolgar · Jan 8, 2003

Yeah.

I was with him until that last bit.

Kolgar

V3 · Jan 8, 2003

Not to sound like I'm kissing Sony's ass;

But you do sound like that

Anyway I went through the patent.

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed. In a preferred embodiment, local memory 406 contains 128 kilobytes of storage, and the capacity of registers 410 is 128.times.128 bits. Floating point units 412 preferably operate at a speed of 32 billion floating point operations per second (32 GFLOPS), and integer units 414 preferably operate at a speed of 32 billion operations per second (32 GOPS).

That's how Inquirer got 4 GHz and 256 GFLOPS. But that's not final though.

[0071] FIGS. 5-10 further illustrate the modular structure of the processors of the members of network 104. For example, as shown in FIG. 5, a processor may comprise a single PE 502. As discussed above, this PE typically comprises a PU, DMAC and eight APUs. Each APU includes local storage (LS) . On the other hand, a processor may comprise the structure of visualizer (VS) 505. As shown in FIG. 5, VS 505 comprises PU 512, DMAC 514 and four APUs, namely, APU 516, APU 518, APU 520 and APU 522. The space within the chip package normally occupied by the other four APUs of a PE is occupied in this case by pixel engine 508, image cache 510 and cathode ray tube controller (CRTC) 504. Depending upon the speed of communications required for PE 502 or VS 505, optical interface 506 also may be included on the chip package.

It also seems to be flexible. Those Attached Processor Units can be varied.

[0137] Other dedicated structures can be established among a group of APUs and their associated sandboxes for processing other types of data. For example, as shown in FIG. 27, a dedicated group of APUs, e.g., APUs 2702, 2708 and 2714, can be established for performing geometric transformations upon three dimensional objects to generate two dimensional display lists. These two dimensional display lists can be further processed (rendered) by other APUs to generate pixel data. To perform this processing, sandboxes are dedicated to APUs 2702, 2708 and 2414 for storing the three dimensional objects and the display lists resulting from the processing of these objects. For example, source sandboxes 2704, 2710 and 2716 are dedicated to storing the three dimensional objects processed by, respectively, APU 2702, APU 2708 and APU 2714. In a similar manner, destination sandboxes 2706, 2712 and 2718 are dedicated to storing the display lists resulting from the processing of these three dimensional objects by, respectively, APU 2702, APU 2708 and APU 2714.

The glimpse of PS3.

Vince · Jan 8, 2003

Um, ok.. whatever V3.

Cloest thing I've ever hear of would have to be the P10 architecture; you can dynamically form 'pipelines' for specific tasks using the PU as an arbitrator of individual APUs (which are said to be SIMD like) operating on data in their sandboxs. - I'm not so clear on on the 'Software Cell' part...

randycat99 · Jan 8, 2003

Man, this thing just gets more and more interesting...

V3 · Jan 8, 2003

using the PU as an arbitrator of individual APUs (which are said to be SIMD like) operating on data in their sandboxs.

According to that patent the PU would have the same ISA. The APUs is more flexible.

- I'm not so clear on on the 'Software Cell' part...

Check this part out.

0119] The present invention also provides a new programming model for the processors of system 101. This programming model employs software cells 102. These cells can be transmitted to any processor on network 104 for processing. This new programming model also utilizes the unique modular architecture of system 101 and the processors of system 101.

[0120] Software cells are processed directly by the APUs from the APU's local storage. The APUs do not directly operate on any data or programs in the DRAM. Data and programs in the DRAM are read into the APU's local storage before the APU processes these data and programs. The APU's local storage, therefore, includes a program counter, stack and other software elements for executing these programs. The PU controls the APUs by issuing direct memory access (DMA) commands to the DMAC.

[0121] The structure of software cells 102 is illustrated in FIG. 23. As shown in this figure, a software cell, e.g., software cell 2302, contains routing information section 2304 and body 2306. The information contained in routing information section 2304 is dependent upon the protocol of network 104. Routing information section 2304 contains header 2308, destination ID 2310, source ID 2312 and reply ID 2314. The destination ID includes a network address. Under the TCP/IP protocol, e.g., the network address is an Internet protocol (IP) address. Destination ID 2310 further includes the identity of the PE and APU to which the cell should be transmitted for processing. Source ID 2314 contains a network address and identifies the PE and APU from which the cell originated to enable the destination PE and APU to obtain additional information regarding the cell if necessary. Reply ID 2314 contains a network address and identifies the PE and APU to which queries regarding the cell, and the result of processing of the cell, should be directed.

Vince · Jan 8, 2003

Ok, but what exactly dictates or descides what data/code/et al resides with in the "body" of the software-cell?

I picked up on the global identification system, but still curious how someone would code for this. It's definatly cool though. Am also curious as to what extent this distributed computing will go - as far as the initial speculation went? Or how much of this flexibility will be seen by developers; basically how low-level they will code.

megadrive0088 · Jan 8, 2003

Lets say the PS3's Cell or Cells has 1 TFLOPs of computational power. lets assume that PS3 is significantly more effecient than PS2. The PS2 has 6.2 GFLOPs/sec but is fairly ineffecient. I have heard comments that PS2 sustains as little as 1 GFLOP/sec. Could it be that if PS3's 1 TFLOP is going to be sustained performance, that PS3 could be closer to 1000 times PS2's performance than 1 TFLOP vs 6.2 GFLOPs might suggest at first thought?

V3 · Jan 8, 2003

Ok, but what exactly dictates or descides what data/code/et al resides with in the "body" of the software-cell?

I think it depends on the apps. This one looks at the streaming MPEG.

[0135] FIG. 26B illustrates the steps for processing streaming MPEG data by this dedicated pipeline. In step 2630, APU 2508, which processes the network apulet, receives in its local storage TCP/IP data packets from network 104. In step 2632, APU 2508 processes these TCP/IP data packets and assembles the data within these packets into software cells 102. In step 2634, APU 2508 examines header 2320 (FIG. 23) of the software cells to determine whether the cells contain MPEG data. If a cell does not contain MPEG data, then, in step 2636, APU 2508 transmits the cell to a general purpose sandbox designated within DRAM 2518 for processing other data by other APUs not included within the dedicated pipeline. APU 2508 also notifies PU 2504 of this transmission.

In this example one APUs examines the header for MPEG data. So the APUs doesn't have the knowledge of what's in the software cell at first. So the determination of what's goes into software cell is not that important as long as the header describe what it is.

megadrive0088 · Jan 8, 2003

Another possibility (of many possibilities) is that the graphics processor, Graphics Synth 3, has its own 1 TFLOP 4-cell chip for geometry & lighting calculations.

Also, it's very likely, almost gaurnteed, that Graphics Synth 3 will be completely floating point throughout its pipelines, like the current ATi R300 and Nvidia NV30, thus adding more performance to the equasion, even if Graphics Synth 3 does not have its own Cell for geometry & lighting.

V3 · Jan 8, 2003

Another possibility (of many possibilities) is that the graphics processor, Graphics Synth 3, has its own 1 TFLOP 4-cell chip for geometry & lighting calculations.

The possibility is that one of the Processor Element would have the APUs being pixel engines. So they might have a one chip solution. Which would be good for cost.

PlayStation III Architecture

Nite_Hawk

marconelly!

DVFtaxman

mr

bryanb

V3

Vince

Tagrineth

murr

Vince

Glonk

Kolgar

V3

Vince

randycat99

V3

Vince

megadrive0088

V3

megadrive0088

V3

Similar threads