PlayStation III Architecture

Imagine a handheld with a cut down CELL chip with one PU and say 2 APUs, together with a single pixel engine. The media for the handheld conforms to the memory stick standard.

Now imagine the PS3 which has the same basic architecture, only with vastly more computing power (and power consumption). You would then be able to run the same game on the PS3 as on the handheld (but in higher resolutions etc.)

The nice thing is that this is all soooo "possible" :) The architecture is designed to be THIS modular... clock frequency ( one would hope so ;) ) can be changed, number of PEs in a chip can vary, number of APUs in a PE can vary and number of Integer or FP units in a APU can vary and so can the quantity of e-DRAM... a program written for the CELL architecture ( edit: I should say written for the APU ISAs as those are the basic blocs it seems... with the PE being the CELL ) will work on all CELLs in the network ( not sure if compatibility problems could arise with the Visualizer... it won't if they keep the Pixel Engine abstracted from the code... the Visualizer PEs are like normal PEs, just with a variable portion of image cache and a Pixel ENgine instead of four of the eight APUs... as long as we do not make assumptions about image cache or other things which might vary in one particular Visualizer implementation, which should be avoided if we want to achieve this "compatibility" effect, this should work [albeit at different speed] on all Visualizers )...

What I would do would be changes in T&L ( like tesselation of HOS patches )... running on the Handheld CELL based machine we would have a certain LOD selected and on PS3 we would have much more detailed meshes ( as we would tesselate better )...
 
why do you Deadmeat hate this architecture ?

It MUST be a LOGICAL reason :) Maybe the same one which so accurately predicted every developer on earth jumping ship from PS2 right ?

Seriously, let's see what you had to say...

1. I/O bandwidth limitation - The old saying goes like "Your computer is as fast as the slowest part of your computer", and this is why mainframes with gigabyte I/O continue to blow away mega PC servers in SQL performance with PC-grade I/O. The original EE design suffered from the onchip backbone bus bandwidth bottleneck and the situation has actually worsened with this "Broadband Engine" thing, with all these bandwidth hungry VU2s screaming for data.

The original EE's problems were not bus bandwdith ( FIND me a 300 MHz consumer chip with a 2.4 GB/s FSB and then we will TALK... Pentium 4's FSB debuted at 3.2 GB/s much later, at a much higher clockspeed [CPU] and with a nice .18um process ( soon becoming .13u )... )

EE's main bus was actually in the same realm as the Direct RDRAM memory ( 3.2 GB/s max and close to 2.4 GB/s in the realworld )...

EE's main mistake was caching architecture lacking a solid L2 ( to avoid memory bus contention [any developer has larger L2, and possibly larger SPRAM, for the EE's RISC core as one of the most wanted improovements for the PS2 HW] ) and small micro-memories for the VUs ( which would have again lowered the dependence the VUs would have had with main memory and how long they could work without getting starved )...

IBM, even well before Sony came into the picture, was trying to design an architecture that solved the next generation problem in computing: efficient DATA MOVEMENT... Their research went DIRECTLY in this direction: processing an element in a certain stream of data was thought to become a much smaller concern than moving the data stream across the processor ( in and out the processor ) as the data set increases more and more...

Of course since they are in partnership with Sony and this is a candidate for PS3's HW ( more than a candidate IMHO :) ) here we go... alll Kutaragi's fault ( now you're ADMITTING he DESIGNED it aLL himself then... back to step one... he is again the inventor of CELL... does your head begins to feel dizzy with all these 180 degree turns ? )...

What is the bus bottleneck of the supposed Broadband Engine ?

the PE's bus ? the 1,024 bits bus that is running at probably over 1 GHz ( >128 GB/s [256 GB/s at 2 GHz and 0.5 TB/s at 4 GHz] ) ?

the connection with the embedded e-DRAM ( of which we have 64 MB of it... good enough to keep a part of it as buffer to stream from external memory and the rest could run an OS not even coming out of the e-DRAM ) ?

the connection with the external memory ? That connection which would be probably Yellowstone RAMBUS RAM yelding ~12-20 GB/s of bandwidth ?

or is it the support of multiple Fiber Optic Links that is the bandwidth limiting factor ?

or is the LS ( Local Storage... last I heard we have 128 KB per APU and each APU has one hundred thirty-eight 128 bits registers ) for each APU ?
 
Panajev2001a said:
EE's main mistake was caching architecture lacking a solid L2 ( to avoid memory bus contention [any developer has larger L2, and possibly larger SPRAM, for the EE's RISC core as one of the most wanted improovements for the PS2 HW] ) and small micro-memories for the VUs ( which would have again lowered the dependence the VUs would have had with main memory and how long they could work without getting starved )...

..<snip>..

or is the LS ( Local Storage... last I heard we have 128 KB per APU and each APU has one hundred thirty-eight 128 bits registers ) for each APU ?

Caching is one of the things that can boost performance by exploiting spatial and temporal locality of an upredictable workload. So one has to wonder why Sony, again, has chosen an architecture without caches. Of course the scratchpad is going function as an explicit caches, but it's still no substitute.

And that 1024 fat pipe from the Edram to the PUs is going to be needed.

Imagine what happens on a context switch. One PE, PU with 8 APUs, each with 128KB local memory, that's potentially 1MB of state that has to be saved on a context switch, *ouch*. Or am I completely mistaken ?

Cheers
Gubbi
 
Caching is one of the things that can boost performance by exploiting spatial and temporal locality of an upredictable workload. So one has to wonder why Sony, again, has chosen an architecture without caches. Of course the scratchpad is going function as an explicit caches, but it's still no substitute.

First... I do not understand this... when people say that Sony is helping IBM with CELL together with Toshiba everybody answers this is all an IBM project which Sony begged for...

Then suddenly it's all Sony's fault for every questionable decision... :( sigh


This is not directed to you Gubbi...

however let's see again what you said...

Caching is one of the things that can boost performance by exploiting spatial and temporal locality of an upredictable workload.

First, we have to remember, as far as PS3 is concerned, that its main purpose will be related to 3D graphics and vector calculations... this involves constant streaming in and out of massive quantities of data with this data not spending a huge amount of time in the processor being processed over and over ( of course more advanced vertex programs will do more and more job per vertex reducing the potential speed the data streams in and out )...

We might have good spatial locality as we can organize triangle data in memory in such a way, but temporal locality would not be the major factor we want to take advantage of...

Large buffers/memory pools might be even MORE useful than caches: they can take advantage of spatial locality by prefetching a bit aggressively and software caching can be done to take advantage of the temporal locality that is offered by the code we are running ( the efficiency of software caching will not be astonding maybe compared to havin g an extra dedicated cache, but it can help ).

See Flipper big Texture Cache vs the Graphics Synthesizer's bigger VRAM ( e-DRAM )... think about heavvy render-to-texture operations ( or wanting to save temporary results in a local buffer )... Flipper will have to access the external main-RAM while the Graphics Synthesizer will have the VRAM to write to...

I do believe that if EE's RISC core SPRAM would have been 32-64 KB ( preferrably 64 KB ) and VU0's micro-memories would have been 16 KB each and VU1's micro-memories would have been 32 KB each that the need for a fat L2 would have been much less as a lot of problems could have been avoided by better use of these local buffers.

Managing of the buffers as caches is not easy and can lead to poor caching performance, but it can be done ( PS2 developers did so with VUs and RISC core's SPRAM )...

Let's look at the memory hierarchy of a single PE ( and let's start from an APU )...

We have 128 KB of Local Storage ( SRAM memory ) per APU and each APU has 128x128bits registers, then if what we want is not in the local storage ( 128 KB is 4x the total Instruction+Data micro-memory VU1 had and 8x the SPRAM the RISC core of the EE had ) and not in the resgisters we could always look in the LS of other APUs ( I am not 100% sure of this because I have not foudn yet a decisive enough declaration in the patent regarding this, I will look more for it tough ) in the same PE, then we might look at the PU's cache ( the PU should have small but existing L1 caches )...

Then we have the 64 MB of DRAM to look for the data we need and that is quite a big space ( Xbox's main RAM is 64 MB :) )... still the way I see part of the DRAM would be used as a prefetch buffer for data so that we have a much higher chance not to have to wait for the external memory to provide us slowly with the wanted piece of data... ( the external memory won't be slow, but it will be surely not as fast as the embedded DRAM )...

Making some quick calculations...

8 APUs/PE * 128 KB/APU * 4 PEs/BE = 4 MB of SRAM used only as local storage ( LS )

8 APUs/PE * 128 registers/APU * 16 bytes/register * 4 PEs/BE = 64 KB of space with registers alone

And then we have 64 MB of fast DRAM whose bus is 1,024 bits wide too...

And then we have external RAM with like 12-20 GB/s of bandwidth we keep and keep on streaming from ( we might have a bit more by 2005 )...

The resources are there... and let's not forget the benefits that local RAM has over caches: you can write and read... bus to main RAM is busy ? not to worry you have the RAM right there..
 
1. I/O bandwidth limitation - The old saying goes like "Your computer is as fast as the slowest part of your computer", and this is why mainframes with gigabyte I/O continue to blow away mega PC servers in SQL performance with PC-grade I/O.

We don't know anything on I/O bandwidth. There is no reason to not stick gigabyte I/O in there.

Suppose you have 1 million instances of triangle strip object sharing one static object containing the transform matrix and lighting vector of 512 byte in size. All triangle strips must access this static object to perform their transform operation and the bandwidth cost is 512 Mbyte/frame * 60 frames/s = 30 GB/s presuming a static object broadcast. Ouch. Now try to run a physics and collusion detection calculation between all these objects and the bandwidth problem magnifies to the order of terabyte/s.

That's alot of instances to illustrate your point, but the hardware will only get faster, so terabytes of bandwidth is only a matter of time.
 
....

The original EE's problems were not bus bandwdith ( FIND me a 300 MHz consumer chip with a 2.4 GB/s FSB and then we will TALK... Pentium 4's FSB debuted at 3.2 GB/s much later, at a much higher clockspeed [CPU] and with a nice .18um process ( soon becoming .13u )... )
We already went over this, that P4's 3.2 GB/s was all for itself while EE's 2.4 GB/s was shared between half a dozen devices.

What is the bus bottleneck of the supposed Broadband Engine ?
Not enough to feed into and move data out of 8 VUs, that's for sure

the connection with the external memory ? That connection which would be probably Yellowstone RAMBUS RAM yelding ~12-20 GB/s of bandwidth ?
That's how fast CELL will go, 3~4x over current P4 at best. Don't expect any miracles here.

Then suddenly it's all Sony's fault for every questionable decision...
Maybe it is just me, maybe CELL really is a brave new world for everybody else. It is just that I figured out something even better...

Large buffers/memory pools might be even MORE useful than caches: they can take advantage of spatial locality by prefetching a bit aggressively software caching can be done to take advantage of the temporal locality that is offered by the code we are running.

EE VUs didn't really suffer from the cache issue because it couldn't see the external memory, all it could see was the internal memory built inside the VU and worked off it. This is not the case with CELL VUs, as the patent documents clearly indicate that they could address external memory and uses sandbox technique to isolate one microprocess from another, so cache and external memory bandwidth clearly becomes a major issue with CELL VU.

The days of hardware engineers coming up with their "dream" hardware and expecting coders to swallow it is over, it is now the hardware that must accomodate the needs of coders.
 
Making some quick calculations...

8 APUs/PE * 128 KB/APU * 4 PEs/BE = 4 MB of SRAM used only as local storage ( LS )

8 APUs/PE * 128 registers/APU * 16 bytes/register * 4 PEs/BE = 64 KB of space with registers alone

And then we have 64 MB of fast DRAM whose bus is 1,024 bits wide too...

Are they still targeting 0.01u process for 2005 ?
 
This is not the case with CELL VUs, as the patent documents clearly indicate that they could address external memory and uses sandbox technique to isolate one microprocess from another, so cache and external memory bandwidth clearly becomes a major issue with CELL VU.

Bandwidth is a concern, but the buses are wide.

The days of hardware engineers coming up with their "dream" hardware and expecting coders to swallow it is over, it is now the hardware that must accomodate the needs of coders.

Not in cases where real time performance is a concern.

Maybe it is just me, maybe CELL really is a brave new world for everybody else. It is just that I figured out something even better...

What brave new world ? Cell concept isn't new at all. Everyone can figured out better concept, but from concept to reality, is much harder.
 
Re: ...

DeadmeatGA said:
For those who don't understand why I criticize the CELL architecture, here are my reasons.

Um.. alrighty then.

1. I/O bandwidth limitation - The old saying goes like "Your computer is as fast as the slowest part of your computer", and this is why mainframes with gigabyte I/O continue to blow away mega PC servers in SQL performance with PC-grade I/O. The original EE design suffered from the onchip backbone bus bandwidth bottleneck and the situation has actually worsened with this "Broadband Engine" thing, with all these bandwidth hungry VU2s screaming for data.

Where the hell do you pull this from? Seriously? So far, we know the following: (a) Based on the Patent, Cell will have tremendous on-chip bandwith with ample space. (b) SCE/Toshiba/IBM has liecensed YellowStone

Going back even to 2000, we find that the talk around Rambus was that it's next generation technology was being sought by SCE and the numbers being tossed around were 30GB/sec - give or take. Even my buddy Tom Pabst (heh) talked of this (30GB for NGConsole) a year or two ago, after an RDF. If you listen to Rambus, Yellowstone scales well to 50GB/sec and 100GB/sec using a 128-bit bus. I'd reckon' ~30GB is a good medium based on past talk around RDF's.

I fail to see where the bandwith problem is infact. In addition to the sheer physical bandwith, as architectures such as this assimilate the flexibility that they do; they can turn to other solutions for both bandwith and storage effeciency.

CELL does nothing to address these problems, and I have given up on my previous CELL-like vision for something better and more logical.

Care to enlighten us on what uber-architecture this is thats not only scalable across a range of products, but commercially viable?

Are they still targeting 0.01u process for 2005 ?

Current talk based on the projections Sony/Toshiba released a bout a month ago indicate 65-nanometer by 2005/late 2004. IBM and AMD recently talked of the joint-development of 45-nm with a target introduction of 2006(?). 65nm seems to be a good bet.

Bandwidth is a concern, but the buses are wide.

Agreed. Bandwith will allways be a concern, although computational resouces can be used to alleviate this to an extent - especially on a development platorm like a console. SCE's David Brickhill and his GDC presentation touches upon this infact.
 
Re: ...

... and I have given up on my previous CELL-like vision for something better and more logical.

Care to enlighten us on what uber-architecture this is thats not only scalable across a range of products, but commercially viable?

The X-Box 2? Nah, just joking.

Sony is not stupid (And IBM, too), they know exactly what was wrong with the PS2 architecture. They are doing everything possible for huge bandwith in the Cell. I'm still amazed about the 1024 bit bus, wonder how they implement it. If we had all the exact numbers, we could start to think about the problems of this architecture (especially the bandwith problem), but as long as we don't know the exact numbers we can only speculate if it's enough.

Fredi

Update: Just found this at arstechnica:

Since I've been skimming this patent for a while, I will clear some things up:
A "Cell" is the unit of software.
A Processing Element (PE) is the unit of hardware. It contains a control processor (PU) and up to 8 attached processors (APU). It also contains a direct memory access controller (DMAC).
There is 64meg of dram attached to each PE. This does not mean a sytem is limited to just 64meg. Some of the shown chip designs in the patent feature 2 PE's or more, each attached to 64meg of ram. Also, 64meg is refered to as "perferred embodiment" so it's highly likely you can hook up as much as you want.
There are a couple different varations on the basic PE design. One has 8 APU's. Another has 4 APU's and a Raster Engine to do graphics. This is NOT 4 of the APU's being "used" to generate graphics. It is replacing the chip area of 4 APU's with graphics stuff. Most likely "Raster Engine" means "GS3", the sucessor of PS2's Graphics Synth.
It's unknown how many PE's will be used in the PS3, or if the PS3 will have additional hardware not mentioned in the patent. Be careful with picking any of the examples in the patent and saying "this is PS3".
Inside each PE:
The PU serves as a control processor. It runs some sort of trusted software, think of this as the OS/BIOS. It decides what work the APU's will be given, and when. It will be very much like a traditional micro-processor core. Perhaps PowerPC. Personally, based on some of the text in the patent, I believe it will be MIPS derived. Either way, this is the part that's rather conventional, and will most likely have the typical cache heirarchy, etc.
The APU's in each cell are given work by the PU. They each operate independantly. They're assided software cells refered to as "apulets", mangling the java applet term. Each APU has a small amount of local memory. This memory is NOT used as a cache. It is a seperate scratch memory.
Each APU appears to be very similar to the vector units used in the PS2. They're each capable of applying the same operation to 4 32bit pieces of data. This matches the PS2, and for those of you who arn't familiar, is comarable to SSE. It's important to note that the APU's *only* operate in this way, that they cannot do 4 different operations at once. (ie, 4 additions ok, 1 addition 1 multiplication and 2 divisions not ok). One notable difference is that the APU units can do a multiply accumulate (MAC) in each cycle. This multiplies two numbers, then adds them to a running total, all in the same cycle. This is extremely useful for processing polygons for rendering, or for doing any kind of matrix based math like say, physics simulation. For these operations it often nearly doubles performance.
There's also the DMA Controller, which cordinates moving data between the shared global memory (~64meg attached to each chip) and the local memory for each APU. This means that the PU as well as the APU's do not have to be tied up doing repetative work moving data around.
The patent seems to suggest that the PE will be clocked at around 1ghz. So for each PE that has a full 8 APU's, we'll have a peak of 32billion 32bit operations per second. That is definately fast. Compare to a 2ghz P4 which, using SEE, is capable of just 8billion 32bit operations.
Also, since a pda is refered to as using a single PE, and a graphics workstation as using 2 PE's or more, PS3 is likely to have performance above this. Any way you slice it, there will be a lot of raw power available. The question is weither real software will be able to utilize this.
Also worth noting, is that the "Raster Engine" appears to be capable of operating in parellel. Some of the example designs show multiple PE's, each containing a Raster Engine, and these RE's being connected together to drive a single TV signal. Although it's not clear just how they co-operate, it would be safe to think of this as SLI for the Graphics Synth. In particular, one image shows 4 RE's working together, not just 2.
While most people are calling this the PS3 patent, the cell architecture is about much more than just PS3. The idea is that it can be used for everything from pda's to big servers, network routers to digital tv's. That everything that needs a processor will have one, that it will be the same processor, and that they'll all be able to easily share resources over the network. So a lot of stuff in the patent has no direct bearing on PS3.
So that's the hardware. Where's "the Cell"?
The Cell is the software. Each cell is basicly a network packet. This packet can contain both data and instructions. The control processors (PU) in each PE cordinate sending these cells around, and schedualing them to run on APU's. This means that a single "program" can actually spread across the network.
An example might make things clear. We're going to imagine we're in future, where everything uses these processors Let's say you're watching a live MPEG4 stream on your pda. Your PDA is using a wireless connection to your desktop computure, your desktop is hooked into your cable modem, which is pulling the data off a server accross the net.
Let's say your pda can't decode the mpeg4 at it's full resolution. So the PU's cordinate your desktop doing some of the decoding work, then forwarding the partial results to your pda. Let's say that the server notices it's sending the same stream to several people in your neighborhood. It cordinates one of the routers on the cable network to do some work splitting the streams out to each viewer, so that it only has to send one stream to your neighborhood.
There's definately potential to this software model, but it's not at all clear how to actually make such programs easily. One of PS2's problems has been it was hard to design games that fully utilized the 2 vector units. The cell architecture brings the same problem to a whole new scale, and adds the issue of network latency to it as well.
I think the one thing to take away from the patent is that the architecture is a very bold step. But that it has a lot of risk as well, and with the information that's available, it's hard to know how they plan to meet these challanges.

I bolded an interesting part
 
Panajev: I agree with you that vertex and pixel shading will fly on CELL.

Where it wont fly is in complex physics and AI codes. These will essentially have to be programmed using a message passing paradigm (with small 128KB nodes), either explicitly with something like MPI or implicitly with autoparallelizing compilers using something like OpenMP as the message passing API. A complex physics code with sparse matrices (compressed of course) could *really* use a cache, since it has limited spatial but high temporal locality.

CELL seems like a special purpose network and graphics architecture but with enough brute force to execute general purpose codes at a decent pace.

Cheers
Gubbi
 
Rambus to boost engineering staff for big order
By Therese Poletti
Mercury News


While many semiconductor companies have been doing layoffs or downsizing in other ways, Rambus, the Los Altos designer of memory chip technologies, said that it plans to increase its engineering staff by 25 percent.

The hiring is due to a major contract Rambus recently received from Sony and Toshiba. Earlier this month, Rambus said it had signed a deal to license two new interface technologies, code-named Yellowstone and Redwood, to Sony and Toshiba, for what is believed to be the next-generation Sony PlayStation 3. Rambus declined to comment on product specifics.

The company told analysts in a conference call to discuss its first quarter earnings that it started to add employees and that it will accelerate hiring next quarter. Rambus currently has about 180 employees, most of whom are engineers.

Rambus Chief Executive Geoff Tate said the Sony/Toshiba deal will be worth about $28 million in revenue over the next two to three years.

In the quarter ended Dec. 31, Rambus reported net income of $5.5 million, or 6 cents a share, down from $6.2 million, or 6 cents a share, in the year-ago period. Revenue was up 3 percent to $25.7 million, from $24.9 million a year ago, with $24.3 million of the revenue coming from royalties from licensing of its memory chip designs.

The company, which is involved in several lawsuits involving its patents and a Federal Trade Commission investigation, said that total costs rose in the quarter due to an increase in litigation costs. Costs associated with litigation rose to $4.5 million in the quarter, up from $2.8 million in the prior quarter.

So it's now proven that Sony really uses Yellowstone for the PS3. (Was it proven before?)

So the PS3 team is now: Sony, IBM, Toshiba, Rambus

Anyone with an idea how many peoples are working on the PS3 right now?

Fredi
 
...

Bandwidth is a concern, but the buses are wide.
The same thing was said about GS's 512 bit bus before. Unfortunately, it wasn't enough to sustain a decent rendering performance.

(a) Based on the Patent, Cell will have tremendous on-chip bandwith with ample space.
128 byte * 4 Ghz = 512 Gbyte/s. Is it enough to feed 32 VUs??? And you are expecting an ample space from eDRAM?

(b) SCE/Toshiba/IBM has liecensed YellowStone
Rambus has become pretty much irrelevant in the computer industry. If Sony chooses to stick with "non-standard" Rambus, that's their choice. But Toshiba will be the sole supplier and Sony will pay through their nose for it.

Going back even to 2000, we find that the talk around Rambus was that it's next generation technology was being sought by SCE and the numbers being tossed around were 30GB/sec
Rambus always talked big numbers but what they actually delivered was always comparable to contemporary DRAM standard.

I fail to see where the bandwith problem is infact.
Presuming a VU cache miss rate of 10%, you have an 80% I/O bus utilization rate with 8 VUs. Now multiply this number by 4 PEs and you have an I/O utilization rate of 320%. Yap, not even a 128 byte wide bus is not enough to feed all 4 PEs.

they can turn to other solutions for both bandwith and storage effeciency.
At the burden of developers, of course.
 
Re: ...

DeadmeatGA said:
(a) Based on the Patent, Cell will have tremendous on-chip bandwith with ample space.
128 byte * 4 Ghz = 512 Gbyte/s. Is it enough to feed 32 VUs??? And you are expecting an ample space from eDRAM?

Of course it will. That's 16GB/s bandwidth per VU or 2 billions vertices/VU/s (4 billion for dynamically created vertices) more than enough for the computational resources. Pixel shaders are likely going to be long and advanced as well.

Cheers
Gubbi
 
Gubbi,
AI doesn't need to fly that fast locally - I kinda like the idea of distributed AI model, if it would really work 8) Particularly its use in generating persistant worlds etc.

I disagree about physics though. Obviously it calls for different algoryhtms, but from what research I did on the subject, particularly matrix factorization can be fairly well adapted to paralel approaches.
And if I could make a vectorized version of cholesky factorization and linear solver that can be processed in isolated parts, working entirely on scratch pad principle, I'm sure a lot better can be done too :)

Application in supercomputer arrays were always attractive research topic I think...
http://www.computer.org/tpds/td1997/l0502abs.htm


Deadmeat,
Rambus has become pretty much irrelevant in the computer industry. If Sony chooses to stick with "non-standard" Rambus, that's their choice. But Toshiba will be the sole supplier and Sony will pay through their nose for it.
I guess that's why Intel sticks to Rambus in all their highend configurations too :) because they are irellevant.
 
The patent seems to suggest that the PE will be clocked at around 1ghz. So for each PE that has a full 8 APU's, we'll have a peak of 32billion 32bit operations per second. That is definately fast. Compare to a 2ghz P4 which, using SEE, is capable of just 8billion 32bit operations.

I disagree with this assesment...

Ok we know that the APUs can each perform SIMD operations and if we keep the PS2 VUs' model each APU can do ( pipelined ) 4 MADDs/cycle ( FMAC, fuse multiply-add ) and that is 8 FP ops/cycle per APU...

From the patent:

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed. In a preferred embodiment, local memory 406 contains 128 kilobytes of storage, and the capacity of registers 410 is 128.times.128 bits. Floating point units 412 preferably operate at a speed of 32 billion floating point operations per second (32 GFLOPS), and integer units 414 preferably operate at a speed of 32 billion operations per second (32 GOPS).

Each APU is rated indeed at 32 GFLOPS...

And since we know each APU can do a max of 8 FP ops/cycle...

8 FP ops/cycle * 4 GHz = 32 GFLOPS

And this is for each APU: suggested speed is indeed 4 GHz


Quoting again the quote I just posted I have to disagree that these are "simple" VUs like PS2's ones... first of all we haven't been presented with the 4 FMACs structures and one or two FDIVs... the only thing we know is that we have four FP Units: for all we know each could pack an FDIV, for all we know each could be an EFU-like unit...

Another thing: if the four FP Units were indeed 4 FMACs tied together and being able to work as a SIMD unit only ( no independant operation allowed and only support for 4-way parallel SIMD operations ) how would we explain THIS ( here is the quote I was presenting again as I said few lines above ):

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed.

Look at the underscored portion of the text...

"[...]a greater or lesser number of floating points units 512 and integer units 414 can be employed [...]"

And we also know that the "ISA is constant across all APUs"... even if we change the number of FP Units, no changes to the ISA or changes to the code should be planned...

How could this work in a standard SIMD VU architecture ?

To me the workarounds in Instruction decoding and Control Unit operation to make sure a 4-way MADD SIMD instruction is performed with 2 FMACs or even 1 FMAC as if we had 4 would have a certain degree of complexity involved... while the opposite would not be...

What we would need would be the FP Units to be able to work in two modes: independent mode and SIMD mode ( all together )...

Impossible ?

Uhm... but I thought I saw that before... somewhere, i must have been a super-computer with insane budgets... BEEEP!!! WRONG!!! :) We saw it in the EE: as you can quickly check the Integer Units of the RISC core in the EE were two separate 64 bits IUs, but they could work as a single 128 bits VU and this is quite close to what I think it's going on with the APU's Integer Units and FP Units... it is indeed "prooven" and already "pioonered" technology, present in consumer chip for quite sometimes ( the EE )...

One of the ways that would come up to my mind to do the "other" approach which has fixed in the ISA that each APU is basically made of two standard SIMD VUs and that we can variate the number of FP and Integer Units without sacrificing program compatibility is that in each chip that uses more or less Execution Units the instruction gets micro-coded ( think if you had to perform a 4-way SIMD MADD with a single FMAC... you would loop it ~4 times through the FMAC and each time working on a different field of the 128 bits vectors )...

Or we could have 4 APUs with one FMAC each do the operation while working in parallel, but that would be quite a waste...

After all the patent says...

The APUs preferably are single instruction, multiple data (SIMD) processors.

And that compared with

this

[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed.


and

this

These processors also preferably all have the same ISA and perform processing in accordance with the same instruction set.

tells me something is a bit unclear in this patent...

I still have some other comments, but I wanted to get these off my chest first...
 
Gubby, I have one comment...

The 1,024 bits busses ( e-DRAM, feeding the APUs ) are not likely to run at 4 GHz IMHO... 2 GHz sounds like a safer bet to me... as it is custom in many architectures local busses are not runninng at max speed, but at a fraction of it...

2 GHz * ( 1,024 bits / 8 ) = 256 GB/s

The bus that feeds the APUs ( the one that runs along the PE connecting PU and all the 8 APUs ) is connected to the e-DRAM... ( for now the details of having each DMAC accessing a separate memory bank, etc... are left aside )

each PE bus ( feeding the 8 APUs and the PU ) can transfer 256 GB/s then so that means 8 APUs ( theoretical situation in which what the PU needs is irrelevant ) get each 256 / 8 = 32 GB/s

Even if this bus ran at 1 GHz we could still deliver 128 GB/s per bus and that would mean 16 GB/s per APU

Actual transfers will be not so near these max values ( that is why I hope the bus is clocked at 2 GHz or half the speed of the processing units ) as the DRAM can only service 256 GB/s of bandwidth ( DRAM clocked at 2 and this would mean only 64 GB/s for each of the four PEs ( thinking about having 4 PEs having to share bandwidth. Of course this 64 GB/s is a theoretical max figure )...

So from the DRAM it seems to me that a maximum of 8 GB/s can come to each APU in the current BroadBand engine design...

It could be more and I could have made some gross mistakes that lowered the figure still... the extra bandwidth in the PE bus can be used by the APUs exhanging data amongst themselves and communicating with the PU which makes sense...
 
another thing...

We already went over this, that P4's 3.2 GB/s was all for itself while EE's 2.4 GB/s was shared between half a dozen devices.

Still it would have been easier to put a big fat L2 or increase significantly the micro-memories ( and that is where one of their advantages over caches can be seen... )...

Are you saying that if the RISC core had 32-64 KB of SPRAM instead of 16 KB and the VU0 had 16-32 KB of total micro-memory and the VU1 had 32-64 KB of micro-memory the situation would have been the same ?

FSB speed is tough to increase and that is why in a hybrid-UMA system as the PS2 is we have fat ( well sort of ;) ) local buffers to avoid extreeme bus contention...


Not enough to feed into and move data out of 8 VUs, that's for sure

No, that is not sure 8)
 
The same thing was said about GS's 512 bit bus before. Unfortunately, it wasn't enough to sustain a decent rendering performance.

You're referring to the texture bus...

While that does indeed bring to the situation in which we have 2.4 GPixels/s and "only" 1.2 GTexels/s ( bi-linear filtering ) that is not the limiting factor... add the capacity of multi-texturing via loopback and you will see the GS fly wuite a bit more as much less Triangle Set-up, T&L and rendering resources would end up being wasted unless we really NEED multi-pass rendering...
 
Back
Top