Why does each CELL core pack 8 VUs????

I understand the logic of packing 4 VUs into CELL(to handle 4x4 transform matrices), but why is Kutaragi packing 8 VUs into CELL? That would mean that CELL is optimized to handle 8xSomething matricies, but what could those be??? Any developer input would be appreciated.
 
Actually they are not VUs like PS2's VUs.

The things in PS3's Cell-based CPU (and GPU) are called APUs. they have 4 FPUs AND (here's the difference) 4 Integer Units, each. The PS2 VUs did not have Integer Units, I am fairly certain of that.

that means a single Processing Element will have

1 PPC core
8 APUs
128K per APU
32 FPUs
32 Integer Units
DMA
and some other stuff

PS3's CPU will have 4, 8 or 16 Processing Elements, so that will mean:

32, 64 or 128 APUs
128, 256 or 512 FPUs
128, 256 or 512 Integer Units


I kinda doubt that the PS3 CPU would actually have 16 PEs though. that would mean 128 APUs, thus 512 FPUs and 512 Integer Units. that sounds way too crasy. Only 1GHz would be needed to reach 1 TeraFlop
performance.

I am betting more on 8 PEs. thus 8 PPC cores, 64 APUs, 256 FPUs and 256 Integer Units. Then 2GHz would give us 1 TeraFlop.

Or they could just go with the 4 PEs. thus 4 PPC cores, 32 APUs, 128 FPUs and 128 Integer Units. Then 4GHz for 1 TeraFlop.
 
...

PS3's CPU will have 4, 8 or 16 Processing Elements, so that will mean:

32, 64 or 128 APUs
128, 256 or 512 FPUs
128, 256 or 512 Integer Units
So all my lectures about the logic gate density and what's possible on 65 nm process has gone to waste... I need to give up on this "enlightenment" business.
 
DeadmeatGA said:
I understand the logic of packing 4 VUs into CELL(to handle 4x4 transform matrices), but why is Kutaragi packing 8 VUs into CELL? That would mean that CELL is optimized to handle 8xSomething matricies, but what could those be??? Any developer input would be appreciated.

A single transform is not going to be split up across APUs. Each APU (in each PE) is working on different vertices (and possibly matrices) in parallel.

Cheers
Gubbi
 
A team of 8 PPCs, 64 APUs and within those APUs, a total of
256 FPUs (plus 256 Integer Units) would be an awesome thing.

I don't see why we couldn't have this many FPUs on a single chip.

look a 3DLabs P10, it's got 76M transistors and for that, you get 200
32-Bit SIMD processors spread out for geometry and pixel processing.

A chip that will be between 500 million and 1 billion transistors, such as PS3's CPU, will have MUCH more processing power and more parallelism than the P10 VPU, a year-2002 PC chip.
 
Megadrive1988 said:
A team of 8 PPCs, 64 APUs and within those APUs, a total of 256 FPUs (plus 256 Integer Units) would be an awesome thing. I don't see why we couldn't have this many FPUs on a single chip.

look a 3DLabs P10 - it's got 76M transistors and for that, you get 200
32-Bit SIMD processors spread out for geometry and pixel processing.


a chip that will be between 500 million and 1 billion transistors, such as PS3's CPU, will have MUCH more processing power and more parallelism than the P10 VPU, a year-2002 PC chip.

How does the P10 compare performance wise to the GPU in a Radeon 9800 Pro? BTW how do you know CELL will have between 0.50 - 1.0 billion transistors?? What is the basis for your assumption?
 
Re: ...

DeadmeatGA said:
So all my lectures about the logic gate density and what's possible on 65 nm process has gone to waste...

If only Sony, IBM and Toshiba had hired you instead of that moronic crack team of IC designers they could have saved SO many billions of $$$.


*G*
 
"How does the P10 compare performance wise to the GPU in a Radeon 9800 Pro? BTW how do you know CELL will have between 0.50 - 1.0 billion transistors?? What is the basis for your assumption?"


The R300 and R350 used in Radeon 9700 Pro and Radeon 9800 Pro are much more powerful than P10. 8 pipes vs 4 pipes PS 2.0+ vs PS 1.1

As for EE3/Cell - its always been assumed that it would be 500M or more transistors ever since the first Sony annoucement of PS3 back in 1999. the old EEtimes article for instance. I have no idea if PS3's CPU is still slated to be 500M transistors or close to a billion. But I have seen it said here that it might be 700~800M transistors, especially with that 64 MB eDRAM.
 
Just a guess, from my programmers point of view, 8 vector units might just mean the same thing as having multiple integer / float / load / store / whatever units. You then need a processor cpable of dispatching 8 simd instructions per cycles, one to each units, maybe in bundles to make decoding simpler. Probably no easy task, we'll see if they can do it, and how good it is...
 
CELL doesn't work like this.
The standard general purpose CPU that sits in front of the 8 APUs doesn't decode or dispatch anything.
Each APU run its program, each APU decodes and executes its instructions.
The general purpose cpu is needed to syncronize and feed all the APUs orchestra..
We don't even know at this moment if the developer will be in charge to run code on the CPUs.
I hope that's the case cause there is a lot of code in game (like the AI code..) that could run poorly on an APU architecture..imho.
I just can say that all this CELL stuff started to make a lot of sense to me since I started to work on the PS2 last May....and now I even love to write microcode for the VUs :)
I don't know if this is the best way to go in the future..but I just know that even the PS2 performance could be boosted a lot with a few tweaks..so this time Sony has the chance to expand the previous ideas (and looking to the patents u can see that the guys developed the PS2 tech are the same guys submitted the CELL patents) and balance them better.

ciao,
Marco
 
Deadmeat said:
I understand the logic of packing 4 VUs into CELL(to handle 4x4 transform matrices), but why is Kutaragi packing 8 VUs into CELL?
Last I checked the patent each APU is suggested 4way FPU + 4way INT which is equivalent to a single souped up VU.
The FPU count of APUs combined is irellevant since each of them is processing its own code.
I also think 4x4 single cycle transform would lead to a lot of waste - the most commonly used operands are still vectors, not to mention how dreadfully wastefull any scalar processing would be in such a setup.

nAo said:
The general purpose cpu is needed to syncronize and feed all the APUs orchestra..
I was under impression that APUs may have the capacity to start DMA transfers by themselves - in other words they can 'feed' themselves. After all this is the one thing that is the most sorely missing from a VU to make it usefull for more "general purpose" algorythms - eg. this is what VU0 should have had to make it properly usable.
 
Fafalada said:
I was under impression that APUs may have the capacity to start DMA transfers by themselves - in other words they can 'feed' themselves.
Is that on the CELL patents? Anyway...it would be really nice.
Even if I can see cases where it's still better to have an extern source that 'drives' the VU..
After all this is the one thing that is the most sorely missing from a VU to make it usefull for more "general purpose" algorythms - eg. this is what VU0 should have had to make it properly usable.
100% agreed.

ciao,
Marco
 
question: (this could belong in ANY Cell/PS3 thread but I'll ask here)

with the APUs having 4 INTEGER units each, unlike VU0/VU1 in PS2, can the integer units be used at the same time, in parallel, while FP processing is being done by the 4 FPUs?

dumb question perhaps, oh well.
 
Another question: how many FPUs are in each of the PS2's VUs?

I know also, PS2 has an additional FPU that is part of the MIPS CPU.....
 
I Don't know if a integer op can be issued with a fp op..on the APU integer and fp units share the same registers, on the VU they don't. ANyway..integer and fp ops are issued and executed at the same time in the VUs, so I can't see why can't be the same on the APUs :)

VU0 and VU1 have 4 FMAC units and one reciprocal/division unit. VU1 has also an additional unit (EFU) that is used to perform special instructions (like trigonometric stuff..) but that can be useful to make even the standard perspective tranformation loop faster.

ciao,
Marco
 
Extending vector processing beyond 4 wide isnt very usefull for games ... it will only work for some tasks, and leave large sections of the pipeline useless for others. Most of the time parallelism will be used with multiprocessing, not vector processing.
 
Re: ...

Grall said:
If only Sony, IBM and Toshiba had hired you instead of that moronic crack team of IC designers they could have saved SO many billions of

A crack team of IC designers will not manage the impossible any better than a team of IC designers on crack.
 
MfA,
What's your point?

I'm fairly certain Sony/IBM/Toshiba would not pour billions into something impossible. Deadmeat does not seem convinced, but then again... If one's curious about his intellectual level, just look at his handle. :devilish: :LOL:


*G*
 
The engineers will do the best they can, they arent working to an exact plan ... they make their own.

The patent was not handed down on stone tablets from the mount Sinai to them.
 
Back
Top