The patent seems to suggest that the PE will be clocked at around 1ghz. So for each PE that has a full 8 APU's, we'll have a peak of 32billion 32bit operations per second. That is definately fast. Compare to a 2ghz P4 which, using SEE, is capable of just 8billion 32bit operations.
I disagree with this assesment...
Ok we know that the APUs can each perform SIMD operations and if we keep the PS2 VUs' model each APU can do ( pipelined ) 4 MADDs/cycle ( FMAC, fuse multiply-add ) and that is 8 FP ops/cycle per APU...
From the patent:
[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed. In a preferred embodiment, local memory 406 contains 128 kilobytes of storage, and the capacity of registers 410 is 128.times.128 bits. Floating point units 412 preferably operate at a speed of 32 billion floating point operations per second (32 GFLOPS), and integer units 414 preferably operate at a speed of 32 billion operations per second (32 GOPS).
Each APU is rated indeed at 32 GFLOPS...
And since we know each APU can do a max of 8 FP ops/cycle...
8 FP ops/cycle * 4 GHz = 32 GFLOPS
And this is for each APU: suggested speed is indeed 4 GHz
Quoting again the quote I just posted I have to disagree that these are "simple" VUs like PS2's ones... first of all we haven't been presented with the 4 FMACs structures and one or two FDIVs... the only thing we know is that we have
four FP Units: for all we know each could pack an FDIV, for all we know each could be an EFU-like unit...
Another thing: if the four FP Units were indeed 4 FMACs tied together and being able to work as a SIMD unit only ( no independant operation allowed and only support for 4-way parallel SIMD operations ) how would we explain THIS ( here is the quote I was presenting again as I said few lines above ):
[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed.
Look at the underscored portion of the text...
"[...]a greater or lesser number of floating points units 512 and integer units 414 can be employed [...]"
And we also know that the "ISA is constant across all APUs"... even if we change the number of FP Units, no changes to the ISA or changes to the code should be planned...
How could this work in a standard SIMD VU architecture ?
To me, the workarounds in Instruction decoding and Control Unit operation, to make sure a 4-way MADD SIMD instruction is performed with 2 FMACs or even 1 FMAC as if we had 4 FMACs, would have a certain degree of complexity involved...
What we would need, to have a quasi-optimal solution, would be the FP Units to be able to work in two modes: independent mode and SIMD mode ( all together )...
Impossible ?
Uhm... but I thought I saw that before... somewhere, i must have been a super-computer with insane budgets... BEEEP!!! WRONG!!!
We saw it in the EE: as you can quickly check the Integer Units of the RISC core in the EE were two separate 64 bits IUs, but they could work as a single 128 bits VU and this is quite close to what I think it's going on with the APU's Integer Units and FP Units... it is indeed "prooven" and already "pioonered" technology, present in consumer chip for quite sometimes ( the EE )...
One of the ways that would come up to my mind to do "another" approach, which has basically fixed in the ISA that each APU is basically made of two standard SIMD VUs and that we can still variate the number of FP and Integer Units without sacrificing program compatibility, is this:
in each chip that uses more or less Execution Units than the standard 4 tied FMACs the instruction gets micro-coded ( think if you had to perform a 4-way SIMD MADD with a single FMAC... you would loop it ~4 times through the FMAC and each time working on a different field of the 128 bits vectors )...
Or we could have 4 APUs with one FMAC each do the operation while working in parallel, but that would be quite a waste...
After all the patent says...
The APUs preferably are single instruction, multiple data (SIMD) processors.
And that compared with
this
[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed.
and
this
These processors also preferably all have the same ISA and perform processing in accordance with the same instruction set.
tells me something is a bit unclear in this patent...
I still have some other comments, but I wanted to get these off my chest first...