I wanted to add something in regards to DeanoC's example about different nodes transferring quite large amounts of vertices and wasting incredible amounts of bandwidth.
I think that configuring a parallel system you have always to think about how much bandwidth you have to share between the nodes and how many nodes you can add before the performance boost becomes rather insignificant.
How much data you have to "constantly" pass back and forth depends on the programmers, not on the ISA nor the processor implementation ( well, to a certain degree at least ).
The problem is, for example, each APU working on 100 MVertices and each Vertex being 10 bytes in size: each APU could need 1 Gbps of bandwidth just to stream that data around.
Perhaps, increasing the bandwidth between the CELL nodes is not feasible n the project ( depending on the costs you keep increasing the bandwidth of the network until the network is not the bottleneck anymore and hopefully you picked up some performance along the way ).
But, is the proposed set-up "Transform Node, Projection Node, etc..." an optimal one or a worst case scenario thought to show what high node throughput can cause to a distributed system ?
If network bandwidth is the limit, but local processing power has not hit the sweet-spot yet, if it is possible ( CELL is not here to solve all of humanity's computing problems... at least not until each APU is a Quantum Processor [
seriously, as an aside the pure fact that algorithm that have execution time of O ( x^y ) or similar would become feasible with a Quantum Processor just really excites me: the benefits of storing -1, 0 and 1 at the same time ] ), maybe you should try to maximize the computing workload of each node and minimize the data you transfer ( re-calculate and do not transmit ).
As far as geometry processing: I doubt that each APU would be able to push 100 MVertices with a decently complex Shader running ( maybe tiling the scene intellignetly between the nodes might be a good idea ? ).
Maybe a good idea is to spend each node's processing power in compressing geometry data further ( like Fafalada is saying ).
The problem as you said as well is the software, is how we share computational load between nodes.
The idea is "how many black boxes can we connect together to distribute performance ?". The design of the software has to follow our choice and the technology limitations of the interconnects as well ( network bandwidth and latency ).
The root of the problem, IMHO, is not really the node itself, it can be a balck box with infinite computing power and internal bandwidth for all we care ( at this step of our analysis process ).
Panajev2001a said:
[0125] Implementation section 2332 contains the cell's core information. This information includes DMA command list 2334, programs 2336 and data 2338. Programs 2336 contain the programs to be run by the APUs (called "apulets"), e.g., APU programs 2360 and 2362, and data 2338 contain the data to be processed with these programs. DMA command list 2334 contains a series of DMA commands needed to start the programs. These DMA commands include DMA commands 2340, 2350, 2355 and 2358. The PU issues these DMA commands to the DMAC.
[0126] DMA command 2340 includes VID 2342. VID 2342 is the virtual ID of an APU which is mapped to a physical ID when the DMA commands are issued. DMA command 2340 also includes load command 2344 and address 2346. Load command 2344 directs the APU to read particular information from the DRAM into local storage. Address 2346 provides the virtual address in the DRAM containing this information. The information can be, e.g., programs from programs section 2336, data from data section 2338 or other data. Finally, DMA command 2340 includes local storage address 2348. This address identifies the address in local storage where the information should be loaded. DMA commands 2350 contain similar information. Other DMA commands are also possible.
[0127] DMA command list 2334 also includes a series of kick commands, e.g., kick commands 2355 and 2358. Kick commands are commands issued by a PU to an APU to initiate the processing of a cell. DMA kick command 2355 includes virtual APU ID 2352, kick command 2354 and program counter 2356. Virtual APU ID 2352 identifies the APU to be kicked, kick command 2354 provides the relevant kick command and program counter 2356 provides the address for the program counter for executing the program. DMA kick command 2358 provides similar information for the same APU or another APU.
[0128] As noted, the PUs treat the APUs as independent processors, not co-processors. To control processing by the APUs, therefore, the PU uses commands analogous to remote procedure calls. These commands are designated "APU Remote Procedure Calls" (ARPCs). A PU implements an ARPC by issuing a series of DMA commands to the DMAC. The DMAC loads the APU program and its associated stack frame into the local storage of an APU. The PU then issues an initial kick to the APU to execute the APU Program.
[0129] FIG. 24 illustrates the steps of an ARPC for executing an apulet. The steps performed by the PU in initiating processing of the apulet by a designated APU are shown in the first portion 2402 of FIG. 24, and the steps performed by the designated APU in processing the apulet are shown in the second portion 2404 of FIG. 24.
[0130] In step 2410, the PU evaluates the apulet and then designates an APU for processing the apulet. In step 2412, the PU allocates space in the DRAM for executing the apulet by issuing a DMA command to the DMAC to set memory access keys for the necessary sandbox or sandboxes. In step 2414, the PU enables an interrupt request for the designated APU to signal completion of the apulet. In step 2418, the PU issues a DMA command to the DMAC to load the apulet from the DRAM to the local storage of the APU. In step 2420, the DMA command is executed, and the apulet is read from the DRAM to the APU's local storage. In step 2422, the PU issues a DMA command to the DMAC to load the stack frame associated with the apulet from the DRAM to the APU's local storage. In step 2423, the DMA command is executed, and the stack frame is read from the DRAM to the APU's local storage. In step 2424, the PU issues a DMA command for the DMAC to assign a key to the APU to allow the APU to read and write data from and to the hardware sandbox or sandboxes designated in step 2412. In step 2426, the DMAC updates the key control table (KTAB) with the key assigned to the APU. In step 2428, the PU issues a DMA command "kick" to the APU to start processing of the program. Other DMA commands may be issued by the PU in the execution of a particular ARPC depending upon the particular apulet.
An APU cannot use the DMAC to ask for data from a remote CELL device, the APU has its own LS and the shared DRAM the PE's DMAC is connected to and IMHO this is all an APU can access and see.
The way I see it an Apulet contains in the data field: data (
), the Program Counter or PC setting, a "Virtual" Address ( to locate the data in the shared DRAM of the system as the APU sees DRAM partitioned into local sandboxes so I would say that the address is relative to the local sandbox ), etc...
APU's cannot execute code or work on data outside of their Local Storage or LS and they need to load in their LS ( backing up their current context ) the instructions and data that the Apulet contains.
The Apulet when it is received is stored in the Shared DRAM first until its content is DMA'ed into the right APU's LS.