PS3 Visualizer

* Hiroshige's Goto Weekly overseas news *
Natural shape of the heart " Cell " of PlayStation 3 of SCEI
- Processing efficiency of 1teracFlops


- Excessively the powerful Cell processor

It is presumed CPU " Cell " of PlayStation 3 of 2005 years loads 32 sub processors onto the one tip/chip perhaps, it is possible to do the operation of maximum of 1,024 32bit data in parallel, with the operational unit 256. Processing efficiency, the largest 1teracFlops (the Floating Operations per Second) with, probably will reach to like the ち ょ and the supercomputer before with floating point arithmetic. With operational efficiency, it is thought x86 CPU of the same time, for example " the Nehalem (the ネハレム)" of the Intel and so on it surpasses much. However, also the number of transistors of the tip/chip, 10 hundred million level and, probably will largely rise from several hundred million.

The graphic tip/chip of PlayStation 3 and perhaps the Cell technology base, external memory of the Cell is supposed " the Yellowstone (the yellow stone)" of the Rambus inter- connecting between the DRAM and the tip/chip " the Redwood (the redwood)" of the Rambus technology is used. In addition, core tip/chip set " Emotion Engine of PlayStation 2 (the emotion engine)" " the Graphics Synthesizer (the graphic synthesizer)" it is presumed the compatible chip which makes into one chip, the I/O (excludes network) the tip/chip is loaded.

The SONY * computer entertainment (the SCEI), still concerning the technical summary of the next generation PlayStation you have not revealed altogether. But, the SCEI several of the patent which is applied in Japan and America pertaining to the Cell has started being released, from the contents PlayStation the contour of the Cell network which surrounds 3 and PlayStation 3 is becoming clear.

If you look at the patent document, the Cell is scalable architecture extremely. As said from the time before, not only PlayStation 3, wide field can be covered from the PDA to the server. And, it has become the architecture which can make the Cell of the constitution which differs to for the respective use.

Because of that, from the patent document, constitution of the Cell for PlayStation 3 it is difficult to specify completely. But, when the constitution which is assumed the ideal (preferred embodiment), in other words it is desirable in patent, is supposed one for PlayStation 3, the part where coherence is agreeable is many. Because of that, here so supposing, we would like to try presuming the structure of PlayStation 3.

- Like the マトリョーシカ the Cell architecture

Feature in the hardware surface of architecture of the Cell processor is ' nest '. Like the マトリョーシカ, the processor is housed with layered structure.

Becoming the basis of the Cell " the Processor Element (the PE)" with the CPU core which is called. This, becomes the smallest unit which can be operated as a single unit. With the example where the Cell processor which is the patent document is desirable, this PE 4 units is included in the one tip/chip. The possibility also the Cell of PlayStation 3 being the structure is high. If the normality CPU you think, you should have thought it is the multichip constitution which loads four CPU cores. Of course, the Cell processor of the constitution 1, with the one for portable equipment and the like you can think the PE. Each PE inside the Cell " the Broadband Engine (the BE) is connected with the bus which is called the Bus ".

As for the Cell differing from the usual CPU, furthermore subordinate processor unit in this PE " Attached Processing Unit (the APU)" plural it is the point which is loaded. These other things, controls each APU " the Processing Unit (the PU)" with, takes charge of memory access " the Direct Memory Access Controller (the DMAC)" is included in the PE. The PU " the APU remote procedure call (the ARPC)" with using the order inside it calls, controls the APU group. As PC thought, each APU processes the individual thread, with thread parallel conversion CPU, the PU can also see as thread scheduling unit.

The quantity of APU which is built in to 1 PE is not fixed. But, it is assumed with example of the patent document it is desirable to build in 8 APU to 1 PE. The possibility also the Cell of PlayStation 3 being this structure is high. However, the APU 4 and 6 calls also constitution is thought depending upon the equipment. The PU and the DMAC and the APU in the PE inside it is called the " Local PE bus " are connected with the bus.

Plural operational units are included in each APU. When it is example of the patent document, the floating point arithmetic unit 4 and the integer arithmetic unit 4 are loaded. It is assumed both the SIMD (the Single Instruction, the Multiple Data) is desirable to be processing unit. These other things, in each APU 128 128bit registers (with floating point and integer common? With local memory of the 128KB is loaded.

By the way, as for the operational unit group and the bus between the register, the register -> the operational unit group the 384bit and the operational unit group -> the register the 128bit. In other words, it is the case that 3 it can read out four operational units, do + 1 entry in parallel vis-a-vis the 128bit register. This is same the floating point unit group and the integer unit group.

By the way, from the fact that bus width and register width are the 128bit, as for the APU both the floating point / integer it is found that the SIMD of 128bit width is supposed. If typical the 32bit×4, in other words, the floating point, it is presumed it can calculate single precision data 4 SIMD. Quick story is the same type as the SSE2 unit of x86 type CPU and the Programable Shader of the GPU.

Processor Element (PE)
As for PDF edition this way PlayStation 3 Block Diagram
As for PDF edition this way AttachedProcessing Unit (APU)
As for PDF edition this way


PlayStation 3 Main Chip
As for PDF edition this way PlayStation 3 Main Chipset
As for PDF edition this way

- The operation 512 is done in parallel the Cell

So, when you suppose the Cell of PlayStation 3, is the same constitution as the ideal example of the patent document, as for processing efficiency how becoming? First, you will try looking at the degree of parallel of operation.

Each Cell is formed with four PE, each PE has 8 APU, we assume each APU each operational unit the SIMD can calculate usually 4 data with the floating point unit 4 and the integer unit 4. So when it does, it becomes as follows.
PE quantity inside Cell 4
APU quantity inside Cell 32
Floating point unit number inside Cell 128
Integer unit number inside Cell 128
32bit floating point arithmetic parallel of Cell 512
32bit integer arithmetic operational parallel of Cell 512

It becomes the operation 4 data ×4 operational unit ×8apu×4pe = 512. Therefore, the Cell per 1 clock, being maximum, is the case that it is the ability to do the 32bit floating point arithmetic 512 and the 32bit integer arithmetic 512 simultaneously.

So, operational frequency of the Cell designating the around which as the target? According to the patent document, floating point unit of the Cell, is assumed efficiency of the 32GFLOPS is desirable. When it is the Japanese patent document, this even like efficiency of the floating point unit 1 you can read, but when it is the American document, because it becomes plurals, as for this when it is the 32GFLOPS at total of the unit 4, it is understood clearly. So when it does, when it calculates backward, as for operational frequency of the Cell it is found that the 2gHz is anticipated.

So, with parallel processing of 512 data with 2gHz operation efficiency how becoming? With the 512×2gHz, as for floating point arithmetic peak becomes the 1teracFlops. Also integer arithmetic is the same. By the way, with the announcement data of the SCEI of the time before, it was assumed " one Cell achieves the operational efficiency of TeracFlops class ". In other words, efficiency of the Cell with constitution of the patent document and, it is the case that it agrees exactly. It is presumed even from this the Cell of PlayStation 3, has been similar to the constitution example where the patent document is desirable.

Concerning the Cell, it can presume the extent many thing which is surprised to these in addition to from the patent document. Network interface of the constitution of software object and on-chip, interesting architecture such as APU and the DRAM control which makes the memory bank coincide is fully loaded to the Cell to in addition to. It is close to the on parade of the architecture which pierces unexpectedness.

But, there is many also a thing which is not found yet from among patent. For example, in just the patent document, you do not know whether main memory of the Cell being the Embedded DRAM whether it is the external DRAM. But, believes with external, the Yellowstone DRAM of the Rambus is used concerning this the reason which is enough is several. In the future, we would like to report also another side of architecture of such PlayStation 3, consecutively.

□ back number

(May 29th of 2003)

[ Hiroshige Reported by Goto (Hiroshige Goto) ]


This is the translation of the article...

Courtesy of Babelfish...
 
These guys' guess is that each APU has 4 FP Units and 4 Integer Units ( and so far we agree )...

Old view:

The 4 FP Units are used together to do parallel SIMD operation ( 128 bits SIMD )... so each APU would produce 8 FP ops/clock.

32 GFLOPS @ 4 GHz

Their view:

Each FP Unit can do a 128 bits SIMD operation...

Each APU would produce 4*4 = 16 FP ops/clock

32 GFLOPS @ 2 GHz

Even assuming their configuration, it would be 1 GHz that we need...

What they do wrong ?

Each 128 SIMD instruction could produce 8 FP ops/clock instead of the 4 FP ops/clock they have in the article...

This is the case for FP MADD instructions, at least...

R1 = ( R2 * R3 ) + R4;

This is done in a single cycle ( pipelined )...

So 32 FP ops/clock and 32 GFLOPS at 1 GHz...
 
I think even the Broadband Engine would have e-DRAM... even the fast Yellowstone DRAM cannot feed fast enough such a hungry powerhouse ( 1 TFLOPS and 1 GOPS )...

Plus the patent specifies e-DRAM... important for bandwidth and latency...

Also commenting the new 65 nm manufacturing process they talked about ~large ( 32 MB ) amounts of e-DRAM on CPUs... they were mentioning how small their e-DRAM cell is...

In the patent the external RAM is connected to the I/O ASIC ( which would be the EE+GS chip basically ) and not to the Broadband Engine which would use e-DRAM...
 
Panajev2001a:

So basically it's double the FLOPs and OPs per clock, but it's also meant to achieve only 2Ghz now, right?
 
Panajev2001a said:
Each 128 SIMD instruction could produce 8 FP ops/clock instead of the 4 FP ops/clock they have in the article...

I don't understand, if they are operating on 8 32bit chunks of data, that's 256 bits, not 128.
 
Each FP Unit can do a 128 bits SIMD operation...

I thought about this before, but the buses to the register isn't wide enough to accomodate it.

If you have say 16 "pipes", I don't see any reason why each one couldn't work on a separate micro-polygon, presupposing they have the same shader and associated state (textures) associated with them. After all, with the Reyes approach, you aren't rasterizing (lerping parameters across triangles in screenspace) anymore, you are dicing geometry into micro polygons... no?

Yeah, my tought the same as well, but we really have no info on it, just some speculation.
 
Someone enlighten me please:

How do one handle shadows in a Reyes renderer ?
How do one handle reflections on non-planar surfaces (environment maps diced to micropolys?)

Cheers
Gubbi
 
Panajev2001a said:
These guys' guess is that each APU has 4 FP Units and 4 Integer Units ( and so far we agree )...

Old view:

The 4 FP Units are used together to do parallel SIMD operation ( 128 bits SIMD )... so each APU would produce 8 FP ops/clock.

32 GFLOPS @ 4 GHz

Their view:

Each FP Unit can do a 128 bits SIMD operation...

Each APU would produce 4*4 = 16 FP ops/clock\

that means the new view is powerful than the old view

32 GFLOPS @ 2 GHz

Even assuming their configuration, it would be 1 GHz that we need...

What they do wrong ?

Each 128 SIMD instruction could produce 8 FP ops/clock instead of the 4 FP ops/clock they have in the article...

This is the case for FP MADD instructions, at least...

R1 = ( R2 * R3 ) + R4;

This is done in a single cycle ( pipelined )...

So 32 FP ops/clock and 32 GFLOPS at 1 GHz...
 
i guess this whole *1000x more powerful* crap has more to do with the things PS3 will be able to do that would slow PS2 down like an Xbox trying to do MGS2 particle effects... ( :LOL: sorry couldnt resist)...

i mean, i guess things like displacement mapping will be used in ps3 games, in conjunction with hi res textures and bump mapping all over the place. things like that all together would melt PS2 down...

i dont think we really need *multi-teraflop performance* to achieve that *1000x more powerful* goal.

i dont think that they will give us a 6.4 TFlops machine just because ps2 was 6.4 Gflops just to make us see the *1000x more powerful* thing in numbers....... 1Tflop will be enough for a long time.... without saying that the 6.4Gflop PS2 is not a real world figure anyway...
 
V3 said:
Each FP Unit can do a 128 bits SIMD operation...

I thought about this before, but the buses to the register isn't wide enough to accomodate it.

?

You have a 128 bits bus ( if you go by the old view ) from the FP Units to the Register File and a 384 bits bus from the register file to the FP Units.

For FP MADD instructions ( 128 bits SIMD ) you need three source operands ( 3 * 128 bits = 384 bits ) and 1 destination operand ( 128 bits )...

I do not see any problem regarding bus' width...
 
i guess this whole *1000x more powerful* crap has more to do with the things PS3 will be able to do that would slow PS2 down like an Xbox trying to do MGS2 particle effects... ( sorry couldnt resist)...

i mean, i guess things like displacement mapping will be used in ps3 games, in conjunction with hi res textures and bump mapping all over the place. things like that all together would melt PS2 down...

i dont think we really need *multi-teraflop performance* to achieve that *1000x more powerful* goal.

i dont think that they will give us a 6.4 TFlops machine just because ps2 was 6.4 Gflops just to make us see the *1000x more powerful* thing in numbers....... 1Tflop will be enough for a long time.... without saying that the 6.4Gflop PS2 is not a real world figure anyway...



wonderful post London-boy - I think you are right on target here. that combined with PS3's increased efficiency over PS2, thanks to all the caches (128k per APU) and eDRAMs placed in the CPU and GPU, and a more thoughtful bus configuration, etc, should help PS3 achieve the now famous '1000x PS2" but it will be an overall improvement. not just 1000x this figure or 1000x that figure.

If PS3 was clocked down / brought down to 6.2 GFLOPs it would still likely wipe the floor with PS2 in real-word performance and specially in things like bump-mapping, highres texturing (procedural stuff too) pixel shading, programmable FX, displacement mapping, FSAA, etc. ...all those things you mentioned.
 
megadrive0088 said:
in real time or raw if its raw that would be very sad because the next gen systems are suppose to surpass 4g polys raw maybe in real time or more


who says next gen systems are supposed to surpass 4 billion polys, raw or not raw?

maybe they will only do 1 billion raw, and 200~300 million with textures, shaders, lighting, FSAA, etc. there is no performance level given by Sony, MS or Nintendo, for next gen systems yet.


I personally expect 5-10 billion raw and 1-2 billion with stuff. AT MOST. but it WILL depend on the limits of the architecture of all the next gen systems. T&L rate, fillrate, bus bandwidth, memory size, latency, compression.

any bottlenecks will limit what next gen systems can push onto the screen to be displayed, in-game play (or realtime cut-scene) just like current systems have limits.

even if next-gen console A can do 100 billion polys raw, if there is a bottleneck in any part of the architecture that limits you to 500 million, that is what you're going to get.

you love wild speculation most people think that the ps3 cand do 300x times the raw polygon power of the ps2
 
qwerty2000 said:
you love wild speculation most people think that the ps3 cand do 300x times the raw polygon power of the ps2

do u even read other people's posts?

my WONDERFUL ( :LOL: so said megadrive :LOL: ) post made it clear that whatever the numbers, the result will be the same, something around 1000x better than ps2. even if the polygon throughput is not 1000x more than ps2, all things together will make it that fast.

just read my WONDERFUL post :LOL: will u...
 
eh, thats kind of silly, 1000x. would you say PS2 is 1000x better than PS1? and would that mean, that PS3 will be 1,000,000x better than PS1!? and PS4 will be a billion times better than PS1??? and PS5 will be a trillion times better???

PS2 is maybe 1000x better than monochrome Pong :)
 
Back
Top