Is this IBM Patent the APU for CELL?

j^aws

Veteran
APU-block-h.jpg


Abstract

An improved processor implementation is described in which scalar and vector processing components are merged to reduce complexity. In particular, the implementation includes a scalar-vector register file for storing scalar and vector instructions, as well as a parallel vector unit comprising functional units that can process vector or scalar instructions as required. A further aspect of the invention provides the ability to disable unused functional units in the parallel vector unit, such as during a scalar operation, to achieve significant power savings.

and

SUMMARY OF THE INVENTION
[0015] It is, therefore, an object of the invention to provide a microprocessor implementation which (1) reduces the overall processor core size such that vector (multimedia) processing and scalar data processing are integrated and share common execution resources, (2) achieves this by merging the capabilities of scalar and SIMD data processing, (3) does not compromise SIMD data processing performance and (4) does not unnecessarily increase power consumption and heat dissipation.

Source: Processor implementation having unified scalar and SIMD datapath

I'm guessing the patent looks like the APUs for Cell, as the inventors below have been involved with previous Cell patents.

Inventors: Gschwind, Michael Karl; (Mohegan Lake, NY) ; Hofstee, Harm Peter; (Austin, TX)

With 4 FPUs, 4 FXUs per APU, the design has an overall reduction of processor core size, power and heat consumption with no loss of performance! 8) How many of these bad boys will we eventually get in the Broadband Engine?
 
it seems so. i remember seeing this diagram last year, i believe. if not then very early this year.


edit: Panajev will know for certain 8)
 
I posted this a loooooooot of times and a long time ago for the first time lol.

The way the APU in that patent works is this:

either

1 Vector FP instruction/cycle: peak of 8 FP ops/cycle.

or

1 Vector FX instruction/cycle: peak of 8 FX ops/cycle.

or

1 Scalar FP instruction/cycle: peak of 2 FP ops/cycle.

or

1 Scalar FX instruction/cycle: peak of 2 FX ops/cycle.


As you can see while processing Scalar instructions all the other unused units can be shut off for a nice power saving effect or just process NOPs.

When I talk about peaks I mean using MADD style of instructions like:

R1 = (R2 * R3) + R4

which are executed in one cycle pipelined (the pipeline when is full can execute one instruction every cycle).
 
I always thought concurrent FX and FP execution would be too good to be true. Hell, it's still a monster badass mofo of a processor if realized in the 'preferred implementaton'. :)

Now I just have to remind myself NOT to hold my breath until next E3 so I can see what it is they'll actually deliver! All these patents might not even be worth the paper they're printed on... :devilish:
 
Ahh...Already posted then! :LOL:

Well...any patent worth a damn has already been posted...twice! :D

This non-concurrent use of the FPU and FXU is news to me. So strictly speaking when the Suzoki original Cell patent mentions the performance of one APU in term of integer and floating point,

Floating point units 412 preferably operate at a speed of 32 billion floating point operations per second (32 GFLOPS), and integer units 414 preferably operate at a speed of 32 billion operations per second (32 GOPS).

it really means an APU does, 32 GFlops *and* 0 Gops or,
0 GFlops *and* 32 Gops or,
anything in between?

So the BE at 1TFlop=0Top, or 0TFLop=1Top and anything inbetween but not 1Tflop *and* 1Top?

How would one market the spec of the BE as there's plenty of scope to mislead! ;) The best compromise would be 1/2 TFlop *and* 1/2 Top peak, no?

Silly question but why can it do 8 vector ops/cycle but only 2 scalar ops/cycle?
 
Jaws said:
it really means an APU does, 32 GFlops *and* 0 Gops or,
0 GFlops *and* 32 Gops or,
anything in between?

Probably, if the FX and FP units share hardware. Besides, the APU's SPRAM can't deliver enough instructions/data per cycle to feed both of them running simultaneously anyway.

Overall, if there are 32 APUs, I am sure nobody's going to complain about the performance of the thing anyway. ;)
 
I knew this from last year, but to reiterate, PS3 won't be able to use its full floating point performance and its full integer performance at the same time, because the floating point units and integer units within the APUs share hardware/transisitors/silicon, among other reasons, correct?
 
1 MADD instruction:

Vec_R1 = (Vec_R2 * Vec_R3) + Vec_R4;

two operations done in a single cycle (pipelined).

This is a scalar instruction, using a single FP or FX unit ( upper 96 bits would be ignored for the source operands if we are dealing with 32 bits math for FP or FX calculations ).


1 Vector MADD instruction:

Vec_R1.xyzw = (Vec_R2.xyzw * VecR_3.xyzw) + Vec_R4.xyzw;

This translates roughly into:

Vec_R1.x = (Vec_R2.x * VecR_3.x) + Vec_R4.x;

Vec_R1.y = (Vec_R2.y * VecR_3.y) + Vec_R4.y;

Vec_R1.z = (Vec_R2.z * VecR_3.z) + Vec_R4.z;

Vec_R1.w = (Vec_R2.w * VecR_3.w) + Vec_R4.w;

Each of these instructions can be executed by a single FP or FX unit and each does two operations in a single cicle (pipelined) producing a total of 8 operation/cycle (pipelined).

Vec_Rx is the name I am using here for this example and these Vector Registers would be 128 bits registers and we can logically subdivide them into components each 32 bits long for these kind of 4-way Vector Instructions.
 
Megadrive1988 said:
I knew this from last year, but to reiterate, PS3 won't be able to use its full floating point performance and its full integer performance at the same time, because the floating point units and integer units within the APUs share hardware/transisitors/silicon, among other reasons, correct?

That is what patents like this seem to suggest and it would make sense power-consumption wise (you would use less power compared to having two sets of independent FP and FX Units).
 
You'd also need twice the number of ports to the register file and scratchpad RAM, or else the units would just starve each other of data. That is costly at best, maybe not even feasible.
 
Pana, thanks for the explanation! :)

I could see this efficiency from a heat dissipation point of view...but I also see at any given time, at best, half the APU logic on the BE being redundant, with all the pipelines full? Is that really an efficient use of silicon?
 
Jaws said:
Pana, thanks for the explanation! :)

I could see this efficiency from a heat dissipation point of view...but I also see at any given time, at best, half the APU logic on the BE being redundant, with all the pipelines full? Is that really an efficient use of silicon?

No it won't. Floating point and integer operations happen in the same chunk of logic (the vector ALU/FPU). I'd imagine integers going through the floating point execution units in de-normalized form.

It makes alot of sense. You very rarely do floating point and integer operations at the same time. As a bonus you only need one issue port for all data types.

It looks like there are really only two issue ports. The vector ALU/FPU and the load/store unit. Branching is handled as part of the instruction fetch process (as it is in contemporary PPC CPUs like the G3/G4/G4+). This means that a larger percentage of the die is going into execution units instead of scheduling logic.

Cheers
Gubbi
 
Gubbi said:
It makes alot of sense. You very rarely do floating point and integer operations at the same time. As a bonus you only need one issue port for all data types.
I don't agree on this..my VU code (mostly inner loops) is full of integer and floating operations executing at the same time. Actually one of the most interesting things about VUs is the capability to run integer and floating point code at the same time!

ciao,
Marco
 
Gubbi said:
Jaws said:
Pana, thanks for the explanation! :)

I could see this efficiency from a heat dissipation point of view...but I also see at any given time, at best, half the APU logic on the BE being redundant, with all the pipelines full? Is that really an efficient use of silicon?

No it won't. Floating point and integer operations happen in the same chunk of logic (the vector ALU/FPU). I'd imagine integers going through the floating point execution units in de-normalized form.

It makes alot of sense. You very rarely do floating point and integer operations at the same time. As a bonus you only need one issue port for all data types.

It looks like there are really only two issue ports. The vector ALU/FPU and the load/store unit. Branching is handled as part of the instruction fetch process (as it is in contemporary PPC CPUs like the G3/G4/G4+). This means that a larger percentage of the die is going into execution units instead of scheduling logic.

Cheers
Gubbi

It would make alot of sense to share the FP and Integer with the same chunk of ALU logic but the diagram shows distinct FPU and FXU units?
 
Jaws said:
It would make alot of sense to share the FP and Integer with the same chunk of ALU logic but the diagram shows distinct FPU and FXU units?

The diagram probably isn't a very accurate depiction of the actual hardware implementation.

Anyway, with the speed we might expect these APUs to run at, it doesn't matter that one has to alternate between FX and FP, it'll be plenty fast anyway. Actually, that the chip might not be able to execute both at once might even be a good thing, in a way. FX units that just sit there while the game runs exclusively FP code, or the other way around, is just a waste of money, and to expect all games to effectively leverage both at once is probably to ask for too much.
 
Guden Oden said:
Jaws said:
It would make alot of sense to share the FP and Integer with the same chunk of ALU logic but the diagram shows distinct FPU and FXU units?

The diagram probably isn't a very accurate depiction of the actual hardware implementation.

Anyway, with the speed we might expect these APUs to run at, it doesn't matter that one has to alternate between FX and FP, it'll be plenty fast anyway. Actually, that the chip might not be able to execute both at once might even be a good thing, in a way. FX units that just sit there while the game runs exclusively FP code, or the other way around, is just a waste of money, and to expect all games to effectively leverage both at once is probably to ask for too much.

Maybe this design is intrinsic to how Cell APUlets work, distributing them ino packets of FP and FX APUlet code so that all the APUs are fully utilised?
 
Jaws said:
..<snip>..
Maybe this design is intrinsic to how Cell APUlets work, distributing them ino packets of FP and FX APUlet code so that all the APUs are fully utilised?

This is how all SIMD extensions in modern CPUs work. SSE1/2, 3Dnow! and Altivec all work this way. So it's nothing new.

Cheers
Gubbi
 
Gubbi said:
Jaws said:
..<snip>..
Maybe this design is intrinsic to how Cell APUlets work, distributing them ino packets of FP and FX APUlet code so that all the APUs are fully utilised?

This is how all SIMD extensions in modern CPUs work. SSE1/2, 3Dnow! and Altivec all work this way. So it's nothing new.

Cheers
Gubbi

No, what I meant was that a Cell APUlet is either purely an FP or an FX packet and when the PU distributes these, at any given time, all the APUs are nicely load balanced, irrespective of the ratio of FP:FX APUlets.
 
No reason why that should be the case, it would just lead to waste as some (for example) sections of FX code will be short snippets compared to the FP part of an algorithm, or FX and FP might be staggered in a complicated way in some other algorithm, thus leading to lots of state changes switching apulets back and forth if they only contain exclusively code from one data type.

There's no overhead when mixing FX and FP, it just doesn't run both in the same clock cycle (probably). There's no reason to separate the two.
 
Can one APUlet be run across multiple APUs or is it exclusive to one APU?

IIRC, APUlets contain both instructions and data, so when they are compiled, I suppose Devs wouldn't need to worry about where they go and the *real* clever work is done by the compiler! :cool:

Is an APUlet running on Cell the equivalent of a Thread running on a conventional processor? For instance, rumours suggest that the Xenon CPU can run 6 simultaneous threads but the BE with 32 APUs could run 32 simultaneous APUlets, no? Can they be compared in this way as processes?
 
Back
Top