this patent defines the PE but it also goes much farther than that... talking about how you inter-operate different machines based on this technology and how a standard ISA and other tricks help you in doing so...
[0068] FIG. 4 illustrates the structure of an APU. APU 402 includes local memory 406, registers 410, four floating point units 412 and four integer units 414. Again, however, depending upon the processing power required, a greater or lesser number of floating points units 512 and integer units 414 can be employed. In a preferred embodiment, local memory 406 contains 128 kilobytes of storage, and the capacity of registers 410 is 128.times.128 bits. Floating point units 412 preferably operate at a speed of 32 billion floating point operations per second (32 GFLOPS), and integer units 414 preferably operate at a speed of 32 billion operations per second (32 GOPS).
I think that this is the nicest part so far of the whole thing IMHO... it pretty much sanctions that the number of functional units in each APU ( Integer Units ), the clock speed of the beast, the number of APUs, the number of PEs... it all doesn't matter...
what does matter is that the ISA is constant, the same in all APUs and that the PEs interact with the APUs using standardized packets that can be processed by ANY APUs on any CELL technology based device
btw, here is the most recent patent ( filed the same day as the other one but with minor differencies it seems )...
http://makeashorterlink.com/?K5C122D03
as far as maybe a picture of EE3 look at this...
http://makeashorterlink.com/?W2D162D03
Another interesting claim is the following
[0020] In another aspect, the present invention provides an absolute timer for the processing of tasks. This absolute timer is independent of the frequency of the clocks employed by the APUs for the processing of applications and data. Applications are written based upon the time period for tasks defined by the absolute timer. If the frequency of the clocks employed by the APUs increases because of, e.g., enhancements to the APUs, the time period for a given task as defined by the absolute timer remains the same. This scheme enables the implementation of enhanced processing times by newer versions of the APUs without disabling these newer APUs from processing older applications written for the slower processing times of older APUs.
[0021] The present invention also provides an alternative scheme to permit newer APUs having faster processing speeds to process older applications written for the slower processing speeds of older APUs. In this alternative scheme, the particular instructions or microcode employed by the APUs in processing these older applications are analyzed during processing for problems in the coordination of the APUs' parallel processing created by the enhanced speeds. "No operation" ("NOOP") instructions are inserted into the instructions executed by some of these APUs to maintain the sequential completion of processing by the APUs expected by the program. By inserting these NOOPs into these instructions, the correct timing for the APUs' execution of all instructions are maintained.
again this Highlight the fact that speed and execution units can variate, but the applications (most of them, naturally not all CELL devices should have the Visualizer... I wonder if the ISA of the Visualizer is the same in all the CELL based devices ) should still run... a program written following closely the generic CELL specs should run fine on a CELL PDA ( how to fit it ? less execution units, less APUs, less PEs, slower clock frequency than say PS3's Broadband Engine ), on a CELL equipped TV or on a CELL based microwave oven...
Here is a picture that should help to understand this idea...
http://makeashorterlink.com/?M2E131D03
Btw, the descritpion of the local memory of the APU makes it seem like it is not a regular cache but more like a Scratch-pad SRAM like the SPRAM in the EE's RISC core or the Micro-memories in the VUs...
it could be simple L1 cache after all they talk about coherency protocols, but it might be used like the e-DRAM is, as a local buffer and not a cache...
Advantage ? you could read as well as directly WRITE into it... local RAM gives you more flexibility than a cache... you can do caching in software with a local RAM pool ( done on on the VUs, on the GS, etc... ), but you can also use it while the e-DRAM is being used by another APU to do some work locally...