DaveBaumann said:
It is likely that the link between the two chips is going to be be quite fast IMHO.
Well, I would hope so. But I raise the point – this can equally be so on any other closed system, can it not?
Also, we have on each chip a nice amount of e-DRAM or fast external DRAM the chip is connected to and we can use that to buffer data sent between the two chips.
Again, this applies to anything that’s designed in that fashion.
The point is that there is still the possibility of load balancing as an APU can be configured to run almost any kind of job: if the Broadband Engine is at almost peak utilization it might off-load some work to APUs that are free on the Visualizer and then get the result back.
Yup, load balancing can occur in a DX system as well – it already does in OTHER AREAS (re. the old Mitsubishi IMPAC-GE geometry processor). DX10 makes this quite feasible as well as if unified shader model is employed then under low fragment processor usage the graphics processor can be executing both geometry and however under heavy fragment processing some of the geometry work could be shifted to the CPU for processing leaving more of the shader ALU cycles of the graphics processor concentrating on fragment processing (the texture lookup in the Vertex Shader may be a fly in the ointment to a certain extent, but that can probably be circumvented)
Interesting development for DirectX Next I have to admit.
The link between two chips can be fast in any system, this is of course not a CELL exclusive feature.
The idea with CELL is that, technology willing, you could keep adding APUs and then PEs and balance the work around pretty nicely.
Apulets ( which contain, as explained in Suzuoki's patent ) both program and data also have Routing Information ( Destination ID, Reply ID and Source ID and yes there is the provision that those can contain IP addresses ) and a global ID that allows you to identify that packet anywhere.
Apulets can migrate across PEs or acrosss chips or across computers on a LAN or a bigger network.
As soon as two CELL systems ( might be four PEs in the same chip, two chips in the same system or two devices in the same room, etc... ) are aware of each other they can trade inforamtion and share work.
Depending on the latency requirements on your applications you might want to send your Apulets outside of your device to be processed by anothr CELL system on the network or not.
Same thing happens between PE and PE.
Apulets ( let's limit to the ones that need only 1 APU at a time ) can be executed/processed by any APUs on any CELL system, whether you as a programmer take advantage of it or not is not CELL's fault.
This overview the concept the best way possible ( from the CELL patent by Suzuoki Masakazy ):
A computer architecture and programming model for high speed processing over broadband networks are provided.
The architecture employs a consistent modular structure, a common computing module and uniform software cells.
The common computing module includes a control processor, a plurality of processing units, a plurality of local memories from which the processing units process programs, a direct memory access controller and a shared main memory.
A synchronized system and method for the coordinated reading and writing of data to and from the shared main memory by the processing units also are provided.
A hardware sandbox structure is provided for security against the corruption of data among the programs being processed by the processing units.
The uniform software cells contain both data and applications and are structured for processing by any of the processors of the network. Each software cell is uniquely identified on the network.
A system and method for creating a dedicated pipeline for processing streaming data also are provided.
CELL was designed to be highly modular and scalable and the Apulet was created to fit this idea.
CELL achieves power by concurrency, by adding more and more processing units in parallel while avoiding the problems a single Threaded processor would have with decodign, fetching and issuing instructions for so many internal execution units.
In a lot of ways CELL is not new per se, the fundamental ideas present in it have been talked about in a while.
A lot of what is in CELL is a series of technologycal and architectural tricks that make it all work together.
Transistor wise and testing wise this is not an inefficient method.