Vince,
At the time it was performing 64 operations per cycle at 550MHz.
No silicon details were provided. But can be interpreted a few ways. 32bit / 64bit
4 ops per cycle unit = 16 units / 2 ops per unit = 32 units
An APU is composed of 4 or 8 total units = 2IU + 2FPU / 4IU + 4FPU
So a guess of 2-8 APU
64 ops per cycle = 32 FP MADD or 32 Integer MADD per cycle.
This means 8 APUs as, if you looked at the APUs patent posted several times here, you cannot do Integer ops at the same time as FP ops on the same APU ( the APUs, given by the specific IBM patent, can only do either of the following: 1 FP Vector op, 1 Integer Vector op, 1 Scalar FP op or 1 Scalar integer op [all per cycle] which would make sense [simplier Register file, simplier APU logic, reduced power consumption] ).
This configuration and 550 MHz of clock-speed in 2H 2003 ?
Pretty impressive, thanks for the leak ( if real [no offense David, really] ).
You should keep those numbers private, your sources will not like having too much data exposed ( well, in this case, it would be particularly good news, so they might have encouraged you to post it [if this is real] ).
Edit: the performance numbers seem too good to be real... I am really starting to doubt the leak. Also, I do not believe each APU to be doing 4 FP MADD or Integer MADD operations per cycle.
BTW, having 2 of "your" kind of APUs is basically the same as having 8 of the "IBM patent" kind of APUs [which is what I think is reflected in Suzuoki's CELL patent] if you really think about it ( performance wise ).
(4 * 2 ops/cycle ) * 8 = 64 ops/cycle
( (4* 2 ops/cycle) * 4 ) * 2 = 64 ops/cycle
Edit 2: Still, your kind of APU might be supported by IBM's APU patent if you do not assume it would be doing FP operations along Integer operations ( either 1x4 FP Vector or 1x4 integer Vector or 1x4 FP Scalar or 1x4 Integer Scalar [I say 1x4 because you have 4 groups of sub-units in the APUs ).
Your APU would trade in logic ( quite a bit of it ) for performance and would not need 4 GHz to reach 1 TFLOPS ( which I believe will be what the CELL based Broadband Engine and the CELL based Visualizer would achieve if you combine their performance together ).
This would make the chip a bit bigger than I was thinking about and the effort to shrink it using 45 nm SOI ( witgh capacitor-less e-DRAM cells ) would be even more pressing for SCE ( they could still do it ).
With your kind of strong APUs ( 32 ops/cycle per APU ) you could do this:
1 BE with 2 PEs and 8 APUs per PE running at 1.5 GHz
+
1 VS with 2 PEs with 4 APUs per PE running at 1.25 GHz
= ~1 TFLOPS combined
With the weak APUs ( 8 ops/cycle per APU ) we would need to have 4 PEs in each BE and VS:
8 APUs/PE * 8 ops/cycle per APU * 4 PEs * 3 GHz = 768 GFLOPS
+
4 APUs/PE * 8 ops/cycle per APU * 4 PEs * 1.5 GHz = 192 GFLOPS
= 960 GFLOPS
In both cases we are pretty close to the 1 TFLOPS target.
Now, pushing the 65 nm process quite a bit more we could imagine 1 TFLOPS for the Broadband Engine alone.
This could come in two configurations:
a.) 4 PEs ( 32 APUs at 8 ops/cycle each ) at 4 GHz
b.) 4 PEs ( 32 APUs at 32 ops/cycle each ) at 1 GHz
Now, we have to think and decide which one of the two choices is the design that is closer to what SCE's plants could manufacture.