PlayStation III Architecture

Entropy · Jan 13, 2003

PiNkY said:
- the core problem (and where the scientific breakthroughs are requiered) will be compilers, as they need to extract parallel code out of (still mostly) inhearently sequential algorithms at least in the order of a magnitude higher then the current state of the art (guesstimate), otherwise i'd say about 90% of your execution resources will sit idle at a given time for GP-code (though DSP friendly stuff (such as Transform & Lightning) will be quite suited for such an architecture), if you want to challenge the traditional workstation marktet.

Quite a bit has been done on auto-parallellizing+vectorising compilers. As to how successful they are - well, YMMV. Auto-vectorization is relatively simple, whereas the auto-parallellization has been a thornier problem (generalising shamelessly). Graphics tasks, as you point out, will do just fine obviously. If the OS likes lightweight threads, it will be relatively simple to get good performance out of the parallell processors. If the problem is to get maximum performance for a particular thread however, I'd generally suspect that writing the parallell code "by hand" is the way to go. That's not necessarily all that hairy though. We'll see what kind of development tools they come up with. Or you will - I doubt I'll have the time to play around with it. :?

- interesting times are definitly ahead as this is the first time in two decades that someone deviates from the RISC way of engineering mpus on a broad economic scale. It remains to be seen how the real world performance relates to traditional consumer mpus, which will propably sport Gflop (as if they meant anything beyond marketing, though) numbers at least a magnitude lower than this in the low cost space. My guess though would be more or less equality for 2005, as massivly parallel architectures will be heavily dependant on advances in compilers/programming paradigms.

General purpose parallell computing (lets neglect systolic arrays and similar) has always had the drawback of greater number of physical chips and greater overall system complexity. System cost has never allowed that approach to contend for the mainstream. When you fit several processors onto a single chip however, that picture changes. Although it is difficult to reach a high degree of processor utilization on a multiprocessor, it is equally obvious that at some point, investing all the gates that finer lithography allows into making a single thread execute as fast as possible will not be optimal even for general purpose computing. Intels hyperthreading is an implicit confirmation that so much of their CPU resources are sitting unused even now, that it is easy to schedule in another thread to execute in parallell.

While "massivly parallel architectures will be heavily dependant on advances in compilers/programming paradigms" is certainly true, you have to remember that a four way design hardly qualifies as massively parallell these days.

When we get beyond forty thousand, you have more of a case.

When parallell computing is more cost/performance efficient than PC level single processors for general computing is a difficult call to make. It depends too much on what code you want to run. The PC arena will probably be quite resistant to mainstream parallell processing for legacy/OS/liscensing reasons. But even there we are likely to see dual-on-a-chip designs within a few years at most, occupying the higher end of the PC cost spectrum at first. It would be a surprise to everyone if Intel had a multi-processor-on-a-chip ready for inclusion into a cheap console anytime soon though, and even when they do, they still have to design within the PC paradigm.

Yup, interesting times are definitely ahead.

I wonder how confined this technology will be to consoles, or to what extent IBM/Sony will use it to drive low cost wedges into other markets.

Entropy

Panajev2001a · Jan 13, 2003

Pinky, just to make a little note... but VU1 has two FDIV units and simple Transform can be done in 5 cycle ( max )...

Edit: sorry that sounded quite MR. Smartypants like... I am sorry...

Efficiency of SIMD units might be the reason why people like Creative went with scalar FP units for the P10...

Still I hope that even if we have 128 bits registers the FP units in the APU are not just arranged in a SIMD unit, but they can be allocated more flexibly...

After-all the EE's RISC core has 128 bits GPRs, but it has two scalar ALUs ( yes they can ALSO work as a 128 bits SIMD unit )...

Panajev2001a · Jan 13, 2003

This is interesting...

[0070] APU 402 further includes bus 404 for transmitting applications and data to and from the APU. In a preferred embodiment, this bus is 1,024 bits wide. APU 402 further includes internal busses 408, 420 and 418. In a preferred embodiment, bus 408 has a width of 256 bits and provides communications between local memory 406 and registers 410. Busses 420 and 418 provide communications between, respectively, registers 410 and floating point units 412, and registers 410 and integer units 414. In a preferred embodiment, the width of busses 418 and 420 from registers 410 to the floating point or integer units is 384 bits, and the width of busses 418 and 420 from the floating point or integer units to registers 410 is 128 bits. The larger width of these busses from registers 410 to the floating point or integer units than from these units to registers 410 accommodates the larger data flow from registers 410 during processing. A maximum of three words are needed for each calculation. The result of each calculation, however, normally is only one word.

Isn't this interesting ? Going to the FP units or Integer Units from the Registers we have a 384 bits bus and going back to the Registers we have a 128 bits bus...

Panajev2001a · Jan 13, 2003

There are only two busses going between FP Units and Registers ( 420 ) and two between the Integer Units and the Registers ( 418 )

Bus 420 going to the FP Units is 384 bits wide and going back to the Registers us 128 bits ( to store the whole 128 bits word back into the destination register )...

I think this should mean each FP Units is probably an FMAC and a FDIV... if not I'd say "WHERE is the FDIV ?"

There are few options... either each of the FP Units is a 128 bits Vector Unit with embedded FDIV or each FP Units is just that, an FPU with an FMAC and an FDIV and in theory if they have been arranged this way it could mena that they can be used also in non-SIMD fashion ( maybe )...

I'd like to learn more about these FP Units...

PiNkY · Jan 13, 2003

For a 4 way sp FMADD you need up to 3 128 Bit operands (d <-- a*b+c), so the result bus is 128Bit, while the Bus to the FP-Unit is 384 Bit wide, nothing fancy there.

You might be right about VNs, VU1's second FDIV, while part of the EFU supports RSQRT (though there is no EDIV). Though Suzuoki himself claimed a thruput of 1/13 cycles in one Sony/Toshiba presentation.

Since you only have a single DIV with a thruput of 1/7 cycles i do not see how you could do a perspective transform with a repeat time of 5?

Entropy, i agree with your statements about migrating multi core architectures into the low-end consumer/desktop market, but honestly i don't think it will take too long until we see dual-core dies in PCs. Though i think the sweet spot for the high end desktop/entry level workstation market will be dual (maybe first quad cores in 3-4 years) cores while still improving individual cores single thread / SMT performance. For example i am still wondering about the increase in gate complexity for prescott (though i admit i know almost nil about that chip), there might be a slim chance it actually is a dual core P4 implementation (much more likely though are increased cache-sizes), with Intel employees seeming quite confident they can match whatever AMD brings to the table (in all interviews i can remember). Another (though poor) indication might be the increase in I/O bandwidth (why that large jump to an 800 MHz FSB)

Fafalada · Jan 13, 2003

Pinky said:
I agree, best case might be a bit misleading, but i still think these numbers to be appropriate if optimizing for vertice throughput.

Well, I was mostly just trying to point out that particular processors themselves aren't that poorly utilized. It's more the external factors that tend to break down utilization efficiency more. And to be fair, those are rather unpredictable - a vertex program can be perfectly efficient, but then end up waiting half of the time for rasterizer to draw the triangles. Doesn't mean FLOP rating is not good, it's just the balance of the pipeline that is the problem.

Besides, we all know flops ratings were never particularly meaningfull measurement of performance anyhow - where did you ever see a math equation consisting solely of a single instruction that just happens to perform most floating point operations at the same time. Not to mention most vector math happens on only 3 components anyhow...

In regards to DIV - EFU also has reciprocal instruction. An unrolled transform loop can be written with two DIVs and one Reciprocal per iteration, totalling 15clocks/3vertices. This would also run well over 2gflops - although I don't find that a particularly usefull example.

But honestly, sustaining an average 2,5 GFlops during e.g. TnL for a single frame on VU1 in-game seems not realistic to me

That's undoubtedly true - sampling the throughput of any given unit on 'per frame' basis will yield numbers WAY low in anything but specially architected tech demos.
A game that is to run at constant fps will probably spend closer to 50% of frame time with T&L on average, and approach 100% only in rarest of situations where lots of things happen on screen.

BTW, completely OT: My deepest regards for your situtation on the korean penninsula, as a German in these times, i wish and still believe in a similiar peaceful future for your country (not from an economic POV though).

A bit of a misunderstanding here... I'm actually European - but I work in Korea.
But since you mentioned this, specifically I'm from Slovenia (ex-Yugoslavia one), so I'm also quite hopefull these two countries can work out a peacefull solution, unlike some things that happened in our ex-one.

Panajev2001a · Jan 13, 2003

Thanks Pinky for your comments: good reasonate points help a discussion

About the Pentium 4 what we know:

1) it is going to have 100+ MTransistors

2) it is going to have 2nd generation Hyper-Threading support ( 4 Threads ? The actual implementation allow such an expansion as instructions in the trace cache are marked with thread ID already... we would need some more execution units though

)...

The rest ( better FPU, fully 32 bits fast ALUs, more L2 etc... ) is left to speculation...

V3 · Jan 14, 2003

Depends on what you mean with sharing. Data wont be shared implictly like in a traditional SMP (ie. no snooping to keep on die ram coherent). Data will have to be explicitly sent from one chip to the other in packets.

Yeah, what I mean is that memory can be shared, between chips. So its not limited to what they have on chip.

i don't think it will take too long until we see dual-core dies in PCs.

Didn't Intel plan to have 4 cores on a die for its 1 billion transistor chip ?

DeadmeatGA · Jan 14, 2003

Another failure.

Another engineering disaster coming out of Kutaragi's aging head, he is at where I was 4 years ago, but I have since moved on with better ideas.

The only thing that concerns me in the Sony patent filing is the object packet thing, since my approach still relies on object packet for interprocessor communication, there is no other way.... Other than that, the rest of filing was pure garbage.

zurich · Jan 14, 2003

Re: Another failure.

DeadmeatGA said:
Another engineering disaster coming out of Kutaragi's aging head, he is at where I was 4 years ago, but I have since moved on with better ideas.

The only thing that concerns me in the Sony patent filing is the object packet thing, since my approach still relies on object packet for interprocessor communication, there is no other way.... Other than that, the rest of filing was pure garbage.

And who are you?

marconelly! · Jan 14, 2003

And who are you?

He is Deadmeat. A person who was banned at Gaming Age forums for spilling BS like this over and over and over and OVER. He was always full of all kinds of 'information' that has, without exception, always been proven wrong every single time. He is still famous for that, and still being mentioned every now and then.

Best suggestion would be to ignore him.

Entropy · Jan 14, 2003

Re: multicore PC processors.
Sure they will come.
But they have a huge drawback in that they will be, well, two (or four) PC processors on a single die instead of on physically distinct chips. That's it.
That is not how you would choose to go about it if you were designing a multiprocessor architecture from scratch. Not only do they carry the x86 architectural baggage around, but perhaps more seriously, they carry PC "architecture" baggage as well, and they have to implement multiprocessing in a way that is consistent with existing concepts and code.

Basically, PC-style multiprocessing works decently when the processors work on different problems, with distinct datasets.

PS3 style multiprocessing bears with it the promise of something far better, but it remains to be seen how this promise is fulfilled.
There! On topic again.

Entropy

London Geezer · Jan 14, 2003

Well... i know i went through this already in the past....

but, let aside the overall power of the machine which we ALL know it's gonna be a beast...

what i'm concerned about is the COOLING SYSTEM a machine of this power (and i mean electrical power here) would need to avoid unhappy customers and fires all around the world.......

not only that, but the NOISE level that the system would produce.... as i said before my PS2 is pretty bad when u consider that it is supposed to be this cool-living-room-entertainment-system-for-the-family-with-DVD-playback-included (TM)... i mean i put on Lord of the Rings the other night and i just couldnt help noticing the huumm coming from the machine during quiet scenes........ and i've got a DTS surround sound system which pretty much masks the effect otherwise it would be even worse.......

Gubbi · Jan 14, 2003

Entropy said:
PS3 style multiprocessing bears with it the promise of something far better, but it remains to be seen how this promise is fulfilled.
There! On topic again.
Entropy

Actually programming for CELL will be more restrictive than programming for a multi processor. Not only do you need a workload than can be parallelized, but it has to do so in discrete chunks (cells).

Network processing and media streams seems to fit this very well. But given a general purpose (ie. sequential as in a typical C/C++/Java) program a CELL processor essentially reduces to one PU with one APU (The PU can only schedule work to one APU at a time) with explicit caching (ugh!). Against a go-for-broke-single-thread-performance CPU it will suck, unless the compiler does some *crazy* stuff.

Cheers
Gubbi

McFly · Jan 14, 2003

At a recent seminar, Sony President Kunitake Ando commented, "The next-generation PlayStation will be linked to the Memory Stick to form a new platform...Given this momentum, we could soon be surrounded by the Memory Stick."

So it is clear to say that Sonys strategy is to use Cell based technology, The Memory Stick and some kind of (Firewire?) technology for all their new products? (PS3, Vaio, Cams, TVs ...)

Just an idea. Would it be possible to make a cell based TV that does the whole rastering of the game directly on the cell (Or cells) of the TV (rastering optimized for that particular TV). Say you wanna sell a deluxe TV for hardcore gamers with a huge screen res and quality? This could create high res versions of every PS3 game (I sure would buy that TV

).

About the memory stick: You could have a personal memory stick with all your settings of any of your Sony products (Game save data, TV channel settings, equalizer settings, ...)

Fredi

PS: Hi everyone.

PiNkY · Jan 14, 2003

to Entropy:

Sure, a new architecture designed to be explicitly parallel and scalar would differ quite a bit from a possible x86 multicore chip, but the "x86 architectural baggage" is imo becoming less and less a factor. AMDs 64Bit extensions make all (then) 16 Registers general purpose. This was propably one of the biggest advantages of RISC machines from a compiler pov. Another example would be x87 fp cop, which will basically be replaced by sse1/2 over the next 2 years. IMO Intel has actually done a good job there by integrating a simple version of that feature for comp, but at the same time forcing the use of sse for newer performance critical apps (and licencing it to AMD), so goodbye stack based fp. With IBM willing to introduce 64 Bit PPC architecture into the desktop market, i expect Intel will sooner or later go that route as well, hopefully keeping comp. with AMDs efforts. I am getting way OT.. :?

Gubbi · Jan 14, 2003

To McFly:
I think it is far more likely that Sony is going after Nintendo's handheld market.

Imagine a handheld with a cut down CELL chip with one PU and say 2 APUs, together with a single pixel engine. The media for the handheld conforms to the memory stick standard.

Now imagine the PS3 which has the same basic architecture, only with vastly more computing power (and power consumption). You would then be able to run the same game on the PS3 as on the handheld (but in higher resolutions etc.)

Would this be competitive with GBA (and next generation GBA)? I believe so.

Cheers
Gubbi

McFly · Jan 14, 2003

Gubbi said:
I think it is far more likely that Sony is going after Nintendo's handheld market.

That was on my list anyway.

I think a GFlop (say 10) Cell would be more then enough to fight the GBA II.

I'm thinking more and more that the cell will be used in any Sony product very soon (Maybe before or at the same time as the PS3), anyway if they need Tera or GigaFlops or less. Just to make sure that they can comunicate in a way.

Fredi

megadrive0088 · Jan 14, 2003

(off-topic)
Deadmeat! haven't heard from you in ages! remember that chip process for a system-on-a-chip design that NEC had, that could have been 200M-250M transistors, that you said MIGHT best used for a DC2--what happened to that, and if that was a real technology, could it be applied to the Nintendo+NEC CPU rumored chip? i no most people dont believe a thing you say, but i found some of your posts very interesting.

cheers.

DeadmeatGA · Jan 14, 2003

...

if that was a real technology, could it be applied to the Nintendo+NEC CPU rumored chip?

1. DC2 was a real technology, straight from the mouth of Yuji Naka. Sega just ran out of cash needed to bring it out.

2. Nintendo will not bring out the a gen console, there is no point of it and Nintendo will focus on what they do best, handhelds.

For those who don't understand why I criticize the CELL architecture, here are my reasons.

1. I/O bandwidth limitation - The old saying goes like "Your computer is as fast as the slowest part of your computer", and this is why mainframes with gigabyte I/O continue to blow away mega PC servers in SQL performance with PC-grade I/O. The original EE design suffered from the onchip backbone bus bandwidth bottleneck and the situation has actually worsened with this "Broadband Engine" thing, with all these bandwidth hungry VU2s screaming for data.

2. Data dependency - The thinking behind CELL is "Let's turn every object into individual micro-processes and run them independent of each other to boost performance!" This sounds good(It did sound good to me as well 3~4 years ago) until the data dependency problem is introduced. Objects inherently share data of each other, from top(The container object that hosts the object) and the bottom(The superclass static object shared among all instances) of an object structure hierarchy. To synchronize the data between objects, you need some kind of IPC mechanism which is generally slow. Even if you come up with a lighting fast IPC mechanism, you would still not get far.

Suppose you have 1 million instances of triangle strip object sharing one static object containing the transform matrix and lighting vector of 512 byte in size. All triangle strips must access this static object to perform their transform operation and the bandwidth cost is 512 Mbyte/frame * 60 frames/s = 30 GB/s presuming a static object broadcast. Ouch. Now try to run a physics and collusion detection calculation between all these objects and the bandwidth problem magnifies to the order of terabyte/s.

To design a workable MPP architecture, you must overcome above two fundamental problems, I/O bandwidth bottleneck and the data dependency that prevents calculations from being done in parallel. CELL does nothing to address these problems, and I have given up on my previous CELL-like vision for something better and more logical.

PlayStation III Architecture

Entropy

Panajev2001a

Panajev2001a

Panajev2001a

PiNkY

Fafalada

Panajev2001a

V3

DeadmeatGA

zurich

Kendoka

marconelly!

Entropy

London Geezer

Gubbi

McFly

PiNkY

Gubbi

McFly

megadrive0088

DeadmeatGA

Similar threads