PlayStation III Architecture

well I could see render-farms made with this kinda of architecture...

Well if they can make renderfarm out of GeforceFX, they definitely can make a renderfarm out of this.

With this kind of performance, I don't think programming will be the main hurdle, but art resources will.

Anyway, from our earlier speculation, do you think this thing has any chance of putting a dent on PC ?

Also regarding the visualizer, what I am thinking is that those, pixel engine, cache and CRTC, just poin to some soup up GS. So basically each visualizer is like a GScube, Sony already said GScube is a visualizer from sometimes ago if anyone remember.
 
The Pixel Engines+Image Cache+CRTC being a souped up GS... uhm yes and no...

These 4 pixel pipelines are fully programmable and the main load of processing is going to be done by the APUs and the PU and not in the pixel engine... I do not think we should expect a uber-powerful DX7++ class rasterizer this time :) or do you want to keep all those APUs and PUs idle ? ;)

Also the Image cache is not the whole VRAM since we can still see that there is e-DRAM ( look at the diagram )... maybe that DRAM will store pixel programs, incoming vertex data, textures, etc... and the Image cache will store the front and back buffers and all the CRTC outputs are going to be merged together along the lines of what happened on the GScube...

With this kind of performance, I don't think programming will be the main hurdle, but art resources will.

I agree with you on this one... it will be an issue :(

Anyway, from our earlier speculation, do you think this thing has any chance of putting a dent on PC ?

who knows... ? it depends on how diffused the CELL architecture is and then how fast it can run Linux ( or a ported cough Windows ;) ) and do everyday's computing job as a regular workstation would...

If PS3 estabilish a large userbase, if CELL finds its way in Sony and Toshiba's consumer-electronics products ( TVs, Stereos, etc... ) and in some IBM's own server line we might even see Vaio's running Linux and being powered by CELL ( which could run Windows in emulation mode )...

Putting CELL into PS3 and Toshiba and Sony's other consumer leve products will do great good because it will create a HUGE userbase for CELL and justify more and more applications and tools to be ported to CELL...
 
These 4 pixel pipelines are fully programmable and the main load of processing is going to be done by the APUs and the PU and not in the pixel engine... I do not think we should expect a uber-powerful DX7++ class rasterizer this time or do you want to keep all those APUs and PUs idle ?

GS is not even DX7 class, Anyway what I mean by soup up, is just a really fast GS. Remember last time they gave out information of a really fast e-DRAM, with more bandwidth than GS, I think they might use it here. (You got the link ?)

Also the Image cache is not the whole VRAM since we can still see that there is e-DRAM ( look at the diagram )... maybe that DRAM will store pixel programs, incoming vertex data, textures, etc... and the Image cache will store the front and back buffers and all the CRTC outputs are going to be merged together along the lines of what happened on the GScube...

Yeah the image cache is seperate from the central DRAM. It need to be seperate because they migh used what I mentioned above.

who knows... ? it depends on how diffused the CELL architecture is and then how fast it can run Linux ( or a ported cough Windows ) and do everyday's computing job as a regular workstation would...

Do you think Linux would be an efficient OS for this thing ?

If PS3 estabilish a large userbase, if CELL finds its way in Sony and Toshiba's consumer-electronics products ( TVs, Stereos, etc... ) and in some IBM's own server line we might even see Vaio's running Linux and being powered by CELL ( which could run Windows in emulation mode )...

Putting CELL into PS3 and Toshiba and Sony's other consumer leve products will do great good because it will create a HUGE userbase for CELL and justify more and more applications and tools to be ported to CELL...

Yeah, I wonder.
 
would you be suprised to see this CELL architecture ( not the Broadband Engine and the CELL based 256 GFLOPS rasterizer, but the PEs ), the PEs being used in PDAs, TVs and Stereo ? There are several parts of the patentt that really let me think this architecture was designed to scale in a whole different variety of products...

No I would not be surprised, that's what I was alluding to (I've read the publication numbers and diagrams). Note I said how "soon" not what devices...
 
GS is not even DX7 class, Anyway what I mean by soup up, is just a really fast GS. Remember last time they gave out information of a really fast e-DRAM, with more bandwidth than GS, I think they might use it here. (You got the link ?)

I know... foregot about HW T&L, etc...

and yes I have the link to it... It is from RWT ( Paul DeMone from ISSCC 2001 )

http://www.realworldtech.com/page.cfm?AID=RWT022001001645

What I remember several people saying that ( including Fafalada, archie and TK-Mai who works at Sony R&D labs IIRC ) it was a GS 1.5, a prototype of the GS with better e-DRAM and slightly changes to the rendering core... GS 3 as EE 3 was rumored to have a brand new architecture... as we are seeing :)

Still we might see the basic Pixel Engines interfaces and the e-DRAM beign the same... we do not need the old GS triangle set-up logic, etc... the APUs in the new Visualizer can work on that too :)

that e-DRAM should find place... the author of that article was seeing it implementable in .13u... well Sony has .10um SOI technology from IBM and has developed with Toshiba a 65nm technology too...



Yeah the image cache is seperate from the central DRAM. It need to be seperate because they migh used what I mentioned above.

I was thinking about the central DRAM being shared by the special "gfx" ( with pixel engine ) PEs in the Visualizer and this is different from the DRAM you have in the Broandband engine...

Do you think Linux would be an efficient OS for this thing ?

IBM is investing a lot of money in Linux R&D and Linux is heavvily being rumored as being the choosen one ( by IBM and Sony ) as functioning as basis of the CELL OS...
 
Sorry archie, I misunderstood the point you were making and I apologize...

as for "how soon?"... it depends... we should see CELL coming out of the fabs before PS3 is launched, so it seems quite soon :)

Realistically aside from Network Equipment and IBM Servers and then PS3 I'd see maybe CELL coming out in Blu-Ray players and then spreading in the rest opf Sony and Toshiba's product... all IMHO
 
BTW, archie... I re-posted all those paragraphs perhaps to add more to my point, to make sure all the dots were connected also for people who haven't read the patent yet and are following this debate...
 
Thanks for the link Pana :)

Still we might see the basic Pixel Engines interfaces and the e-DRAM beign the same... we do not need the old GS triangle set-up logic, etc... the APUs in the new Visualizer can work on that too

Don't you think the triangle setup would be advantageous to have ? Though some dev might not use polygon, most probably will.

that e-DRAM should find place... the author of that article was seeing it implementable in .13u... well Sony has .10um SOI technology from IBM and has developed with Toshiba a 65nm technology too...

That eDRAM could be suitable for image cache. How big do you think the image cache for each visualizer need to be ? 8 MB enough ?

I was thinking about the central DRAM being shared by the special "gfx" ( with pixel engine ) PEs in the Visualizer and this is different from the DRAM you have in the Broandband engine...

The Central DRAM should operate in similar manner for the chip with Visualizer and Broadband Engine. They all will be shared, and the PE or Visualizer probably can acquire memory from each other.

But the image cache would be seperate and dedicated for the Pixel engine.
 
I don't know if the Visualizer and the Broadband Engine share the same DRAM... gotta study the patents more... it seems to me that each has its DRAM pool, plus the visualizer's Pixel Engines have their Image Cache ( I will have to think about the size of it and it will be also dependent on this issue of DRAM shared with the Broadbamd Engine or not )


As far as a triangle set-up... well if not all developers will use it and your "microcode" ( done by the API ) works fast enough why not doing it with APUs and PUs ? :) we will see how it truns out :)
 
Would singular cell based chips be suitable for embedded processing? Given the close architectural ties betweeen CELL and MIPS processors, maybe it can find uses in embedded applicances. For example, the Sony Minidisc uses an ARM chip as the processor and possibly assist in DSP. Could Sony replace most of its licenced embedded processors with its own CELL chips? Given that it's very fast, and its purpose is general, it would seem anywhere you need processing power you can shove in a CELL in place of the existing MIPS/ARM/custom solution.

Would the lower cost of R&D for chips and royalties offset the possibly hotter and more expensive to manufacture CELL chips?
 
I don't know if the Visualizer and the Broadband Engine share the same DRAM... gotta study the patents more... it seems to me that each has its DRAM pool, plus the visualizer's Pixel Engines have their Image Cache ( I will have to think about the size of it and it will be also dependent on this issue of DRAM shared with the Broadbamd Engine or not )

Well two BE can share the DRAM pool using that switch unit, so if this one is made like so, they could.

[0082] BE 1201 also includes switch unit 1212. Switch unit 1212 enables other APUs on BEs closely coupled to BE 1201 to access DRAM 1204. A second BE, therefore, can be closely coupled to a first BE, and each APU of each BE can address twice the number of memory locations normally accessible to an APU. The direct reading or writing of data from or to the DRAM of a first BE from or to the DRAM of a second BE can occur through a switch unit such as switch unit 1212.

As far as a triangle set-up... well if not all developers will use it and your "microcode" ( done by the API ) works fast enough why not doing it with APUs and PUs ?

Well triangle set-up is such trivia thing, do you really want to microcode it ? Its better to let the APUs do more interesting stuff, than triangle setup.
 
So, not counting image cache in the Visualizer or other caches in the BroadBand Engine, just eDRAM. it will most likely have 96-128 MB of eDRAM. 64 MB for the BE, plus a seperate 32-64MB for the Visualizer... even if the Visualiser has just 32MB eDRAM, 96 MB of eDRAM total is alot.
nothing to scoff at :)
 
Just some thoughts...

All performance numbers quoted in that patent seem to indicate an fp-apu to be a simple 4 way single precision fmac (+ 1 way fdiv/trigonometric). I wonder how they want to carter to workstation/pc users with only supporting sp fp (though certainly enough for an Entertainment System).

Reasoning: 128Bit fp registers and their 32Gflop perfomance numbers (2(FMAC)*4(Exec.Pipelines)*4GHz).

I just flyed over the patent, but all references to using the ppc isa for the PEs do not make too much sense to me. The patent definitly goes to great length promoting the java idea of a unified abstract high level ISA (as the JVM). From my understanding this is to be implemented by the PEs (basically CELL isa to APU isa interpreters), so basically every PE knows its own Resources, accepts CELL packages, allocates an appropriate amount of its execution resources and memory (Sandbox), respectivly decodes the CELLs code into the APUs designated memory area and starts their execution (or too wrap it up, a much simplified JVM like hardware implementation). Surely an interesting idea, but it has it fallacities:

- real world efficency will propably be not up to par with traditional architectures (though they can claim outragous GFlop numbers, which is good for hype and pleases the fanboys).

- the core problem (and where the scientific breakthroughs are requiered) will be compilers, as they need to extract parallel code out of (still mostly) inhearently sequential algorithms at least in the order of a magnitude higher then the current state of the art (guesstimate), otherwise i'd say about 90% of your execution resources will sit idle at a given time for GP-code (though DSP friendly stuff (such as Transform & Lightning) will be quite suited for such an architecture), if you want to challenge the traditional workstation marktet.

- this approach will generate lots of computing and memory overhead (for respective thread handling & syncronisation, even though they are imlementing some supporting logic for this into their DRAM controllers (thread specific page locks))

- interesting times are definitly ahead as this is the first time in two decades that someone deviates from the RISC way of engineering mpus on a broad economic scale. It remains to be seen how the real world performance relates to traditional consumer mpus, which will propably sport Gflop (as if they meant anything beyond marketing, though) numbers at least a magnitude lower than this in the low cost space. My guess though would be more or less equality for 2005, as massivly parallel architectures will be heavily dependant on advances in compilers/programming paradigms.
 
it'd be interesting to know if there's any indications on the maximum number of FP and Integer units per APU, the max number of APUs per PE.
Perhaps it's unlimited, meaning as many as die-size/fab technology will allow....
 
Just to give you an established (Sony marketing numbers,this is a real best case scenario for T'n L operations, others such as vector normalisation are lots more costly (in a way inefficient)) example. Take ps2s VU1: Using software pipelineing techniques, a perspective transformation can be archieved through software pipelining with a throughput of 7 cycles. this includes 4*Muls (1Flop/each), 4 MulAs (1Flop/each), 8*MADDAs (2Flops/each), 4*MADD (2Flops/each) and 1 DIV (1Flop). this adds to 33flops/7cycles (or 1,414 Gflops) compared to a theoretical maximum of 3,2 Gflops for vu1. This assumes you can mask all loop control & LS operations under DIVs 7 cycle latency, your whole environment to be infinitly fast (VIF1,MEM,CPU/VU0, Bus), so this also represents kind of an upper limit to what is possible with VU1. To make something senseful, you`d also have to do other operations on your geometry data, like vector normalisations (throughput of 13 cycles on VU1, best case, includes 4*MULS, 2*MADDS, 1*RSQRT (1Flop) representing a "sustained" performance of 9Flops/13cycles or 0,2 GFlops of VU1 theo. max of 3,2 Gflops). So to give a realistic theoretical bestcase wihle regarding VU1s surounding to be infinitly powerful, you'll end with perhaps about 0,7 GFlops peak. In a real world scenario you would propably be quite satisfied if sustaining 50-70% of that rate over prolonged periods (>500 ms). Just for you not misunderstanding me, similiar calculations apply for nearly all architecture/code cases.
 
V3 said:
The Central DRAM should operate in similar manner for the chip with Visualizer and Broadband Engine. They all will be shared, and the PE or Visualizer probably can acquire memory from each other.

But the image cache would be seperate and dedicated for the Pixel engine.

Depends on what you mean with sharing. Data wont be shared implictly like in a traditional SMP (ie. no snooping to keep on die ram coherent). Data will have to be explicitly sent from one chip to the other in packets.

The reason the image cache is there is because the central Edram isn't exactly suited to fine grain random access. The Edram is divideded into blocks that each has to be locked by the PUs before the APUs under that PU can access it (all arbitration is through a PU). This will add *alot* of overhead dealing with packets for texture reads. Also various texture filtering methods influence texture cache architecture, and I would expect the image caches to reflect this.

A final comment: The packet/block organization of the Edram/main memory lends itself well to virtual texturing, no ?

Cheers
Gubbi

edit: typos, PE->PU
 
BTW. I agree with Pinky, getting good performance (ie. >25% peak) out of CELL is going to take black magic. I see the tool chain as being the single biggest technological challenge in CELL.

Cheers
Gubbi
 
Pinky, while I completely agree about 'sustained' flops, particularly for SIMD units, being nowhere near their theoretical numbers, I disagree that small loops are anywhere near best case scenario, especially for VUs.

Longer, more complex loops are likely to have a higher FLOP rating simply by having more madds to do and allowing for more optimal scheduling, particularly with high latency elementary operations like normalize which are executed on the upper pipe and typically end up completely hidden in longer loops... :p

An shader requiring normalize would probably do a fair bit more then 13madds of work at the same time, so you could easily see ~2.5gflops in such a loop.
 
I agree, best case might be a bit misleading, but i still think these numbers to be appropriate if optimizing for vertice throughput. But honestly, sustaining an average 2,5 GFlops during e.g. TnL for a single frame on VU1 in-game seems not realistic to me, but feel free to correct me, (but spare me the sitting directly in front of a wall scenario ;)) .

BTW, completely OT: My deepest regards for your situtation on the korean penninsula, as a German in these times, i wish and still believe in a similiar peaceful future for your country (not from an economic POV though).
 
Back
Top