PS3 distributed computing without internet limitations ques.

okay, I've been wanting to ask this for a few days now, even though it's been discussed before. I myself have probably asked this or mentioned it in various threads. but it's been on my mind and feel it should be discussed. how feasible is it for PlayStation3s to share processing and rendering resources if they are connected together LOCALLY. not on the internet whatsoever, with the internets bandwidth and latency limitations.
what if we have 2, 3, 4 or more PS3s connected together, via link cable, like we have had with PS1, Saturn, Dreamcast, PS2, Xbox, etc, for multiplayer games where each player has their own screen...but not for multiplayer gaming in this case, because now we want to combine the processing and graphics rendering power of at least several PS3s... or if not link cable, then whatever is the fastest direct connection. serial, ethernet, whatever. all of our sharing PS3s are on a table, linked up. no internet delays to worry about. we want to use all the resources of all the PS3s to provide more simulation and rendering performance for a game.

we have PS3s so close together, almost like the PS2 chipsets in the GSCube. well, maybe not THAT close, but not far away like across the internet.


with internet limitations removed in my senereo, what possibilities does this open up for realtime rendering/gaming with the combined power of say, 4 to 16 PS3s, connected directly.


one more thing...there was an article or post from 2003, that gave an example of 4 PS3s connected in the same room. 1 PS3 would dish out work to the 3 other PS3s. I've been looking for it as I put this little thread together but haven't found it, maybe someone will remember where that is (im sure it got posted here on B3D) and post it here.
 
Well, if massive multiplayer games are possible - where a multitude of player positions and states are exchanged rapidly - then I see no reason why stuff with a low sampling rate, like AI, physics, sound, and maybe even distant geometry shouldn’t be distributable, even with relatively slow, high latency DSL connections.
The actual rendering, and high precision geometry transformations, on the other hand, would be ill suited to distributed computing, even with a high-speed optical fibre connection.
 
I'm going to get some flack from some people but....

The basic problem with the distrubuted Cell system (i.e. lots of PS3 chained together) is data locality.

Take a look at the IBM patent that discribes each APU executing a DMA transfer. Theres your problem, Cell programs can only migrate IF the data they use is local.

Consider a standard vertex transform APU program. It has some local state (in APU local eDRAM), it has vertex data in PU local main ram local and outputs into PU local main ram.
Everythings fine, with the program transforming a vast amount of data ready for processing by the GPU.
Now add another Cell, it has its own PU and its own set of APU's, you decide to shift the vertex program onto this new Cell.
So its has local state (in APU local eDRAM), but where is its vertex data? The answer is in the other Cell main ram, so you either have to find a place where both Cell can access or copy the data back and forth between Cells as needed.
So we add some system RAM that both Cell's can see, and we (prehaps) have the PS3 everybody thinks Sony will release.

Now connect another PS3 over a network, a some point your going to have to copy source and destination data from the APU. If the source transform data is static you have to copy several Mb of transformed vertex data OR several Mb of framebuffers.

While it may be possible to make some work be distrubutable, most games work won't be able to migrate off the main main memory system. It will be hard enough having to move to a manual memory system (adding DMA calls to get the data you need into APU memory space) for game code, let alone a full on distrubuted model.
 
FWIW I agree with Deano the only types of tasks you can practically retarget at runtime are those that package input , ouput and code locally. The output buffers could be used as inputs to other processes, with dependencies listed by process.

I was just recently speculating about a an ideal scripting system for a parallel environment, and ended up with this basic design, but as far as I can determine there isn't any existing scripting language that efficiently addresses this issue.

Once data movement starts to take time (i.e. similar amounts of time to the process) you have a really tricky problem to solve in scheduling and you start to run the risk of slowing down the system by adding more processors.

Shared memory with no snoop controller and local pools are also really scary things, I've seen benchmarks (for server type applications) that show negligible improvements as the numbers of processors scale, simply because manageing the shared memory pool effectively serialises the process. I'm not saying this will be the case with cell, I'm just saying predicting the how effectively something will scale is not simple.
 
DeanoC said:
I'm going to get some flack from some people but....

The basic problem with the distrubuted Cell system (i.e. lots of PS3 chained together) is data locality.

Take a look at the IBM patent that discribes each APU executing a DMA transfer. Theres your problem, Cell programs can only migrate IF the data they use is local.

Consider a standard vertex transform APU program. It has some local state (in APU local eDRAM), it has vertex data in PU local main ram local and outputs into PU local main ram.
Everythings fine, with the program transforming a vast amount of data ready for processing by the GPU.
Now add another Cell, it has its own PU and its own set of APU's, you decide to shift the vertex program onto this new Cell.
So its has local state (in APU local eDRAM), but where is its vertex data? The answer is in the other Cell main ram, so you either have to find a place where both Cell can access or copy the data back and forth between Cells as needed.
So we add some system RAM that both Cell's can see, and we (prehaps) have the PS3 everybody thinks Sony will release.

Now connect another PS3 over a network, a some point your going to have to copy source and destination data from the APU. If the source transform data is static you have to copy several Mb of transformed vertex data OR several Mb of framebuffers.

While it may be possible to make some work be distrubutable, most games work won't be able to migrate off the main main memory system. It will be hard enough having to move to a manual memory system (adding DMA calls to get the data you need into APU memory space) for game code, let alone a full on distrubuted model.

Unless you are distributing the game over several PlayStation 3 machines connected to this awesomly fast network fabric and each of them is running the same game ( it would be a neat idea to work on regarding a MMORPG ).

In non gaming environment you would have to find other ways to balance the load between the different CELL WorkStations: if they have enough system RAM they could dedicate part of it to load up the code ( let;s think about distributed compilation ), etc...

I agree with you that Distrubuting load for real-time games is not as easy as it could be imagined, there are some other problems that are addressed by balancing processing load over clusters of interconnected machines and in one sense it will be interesting to see how CELL does in that area: the main appeal is the fact that Sony and IBM would present a ready solution, top to bottom that would be ready to take advantage of a good deal of the FP computing power of the solution. You can do everything with Linux and PCs, but taking the right Hardware, writing the right programs, debugging and testing the whole thing is not something that companies want to do.
 
I don't see anything much coming from it--certainly nothing that will impact your games' performance--from the Internet as we know it now. I could, however, see it giving some effect from "local network" machines plugged together--however they're planning on enabling it. The "Cell TV" or multiple PS3's or whatnot. The main problem I figure lies with the developers' ability to tap more than just one PS3 for their games, and Sony's ability to get them to move on that. For general-purpose software or things built to take as much speed as you can throw at it...? I figure those will come around and work well.

As far as the Internet is concerned, CELL concepts may well hold some promise for the future (when their software gets more mature, and our connections get faster), but at launch I don't see it doing much more than what we see in PC's now. Perhaps easier to implement--I'm not sure. (And the only question then is whether abusing it would be easier to implement as well?)
 
ERP said:
FWIW I agree with Deano the only types of tasks you can practically retarget at runtime are those that package input , ouput and code locally. The output buffers could be used as inputs to other processes, with dependencies listed by process.

I was just recently speculating about a an ideal scripting system for a parallel environment, and ended up with this basic design, but as far as I can determine there isn't any existing scripting language that efficiently addresses this issue.

Once data movement starts to take time (i.e. similar amounts of time to the process) you have a really tricky problem to solve in scheduling and you start to run the risk of slowing down the system by adding more processors.

Shared memory with no snoop controller and local pools are also really scary things, I've seen benchmarks (for server type applications) that show negligible improvements as the numbers of processors scale, simply because manageing the shared memory pool effectively serialises the process. I'm not saying this will be the case with cell, I'm just saying predicting the how effectively something will scale is not simple.

There is a reason why CELL does not take the place of BlueGene/L and that IMHO does not relies simply on better DP FP performance ;).

I see CELL as effective in Renderfarms mainly and in game development as far as distributed processing is concerned ( the idea is to maximize the productivity by distributing compilation, content generation, etc... across the dev-kits/WorkStations the developer is using: the classic "hey why should compilation take 45 minutes if all these other PCs are basically not being maximally used ?" issue that wants to be resolved ).

I can understand what you are saying: if we have to transfer between nodes too much work, then transferring data becomes the limiting factor no matter how many systems are connected to the network ( network speed does matter a lot in these scenarios ).

For Pixar, the most precious thing is the code that balance the rednering load across their Sun machines and the fact that this code runs well and is stable on these Sun WorkStations.

You could not simply ask them to upgrade the CPUs, even to clearly more powerful ones, and re-write some of their code as it would be a big downtime to learn the new architecture and how to push it properly: if we could develop for them a complete solution as I said with the (CELL OS ) OS that does its best to abstract load distribution ( taking that away from the Pixar+Sun software that was designed to do that ) then it would be easier for them to adapt their PRman setup to the new Renderfarm. This is where I see them going with the whole CELL WorkStation strategy ( I see them costing less than $25k ).
 
hmmm, well if realtime rendering is difficult even with incredibly fast optical fiber connections, then just let me stack some PS3s together, connecting them with no cables at all, but more like GSCube or better yet, SGI visualization systems (RE, RE2, IR, IR2, IR3, etc) where at least 16 pipes can be connected together (each pipe a set of 3 boards: Geometry Engine board, Raster Manager board, Image Generator board)

btw, how does SGI's architecture actually work (compared to distributed computing) where you can combine the graphics processing of upto 16 or more SGI systems, to focus on one realtime problem/application ?
 
The only thing worth talking about for massively multi-processing systems is connections and there bandwidth, its described as dimensions in parellel processing literature.

(I may well have get these bits wrong, so I'm very interested if anybody reads the patents differently...).

My reading of Cell is that each APU is 8D connected (each APU is connected to its PU and every other APU), PU are then 4D connected to other PU's in its broadband engine(BBE) by a shared memory pool. This where Sony go quiet and start talking about cyberspace.

Assuming a 1000Mb/s point to point network connection between 2 BBE's we get a 1D link.

So BBE are 1D, PU 4D and APU 8D. moving data from APU0 on BBE0 to APU0 on BBE1 involves moving the data 3 times, with the bottleneck being to slowest link (the network link).

Now expand it out, lets try increasing the BBE dimensionality....

Assuming a 1000Mb/s point to point network connection between 16 BBE's we get a 16D link.

Again there is 3 moves from APU to APU on a different BBE but you network connection is now effectively 16 times slower...

O.K. lets try a different topology. Lets make BBE's connect in hierachies.

4 BBEs connect in a star pattern to other groups of 4BBEs.
To move from APU0 on BBE0 to APU0 on BBE15 (as far away as possible), we have to copy from APU0 to BBE0 RAM, to a center BBE, from that BBE to BBE15 and then from BBE15 to APU0. 4 moves.

The network connection is 1000Mb/s shared by 4 BBE but you have to travel over 2 different connections.

Check on Connection Machine 2, it had 65535 1 bit processors in a 12D star topology. Ran LISP well...
 
DeanoC said:
The only thing worth talking about for massively multi-processing systems is connections and there bandwidth, its described as dimensions in parellel processing literature.

Well, people working on BLueGene/L and CELL ( well, IBM top level designers ) already went on quote with the "computation will not be the limit, shifting data ( data flow ) will" in regards to the needs for IBM's flavour of Cellular Computing.

Does that make you feel any better :) ?


P.S.: I am also checking the rest of the post, but wanted to give you a quick reply.
 
Megadrive1988 said:
hmmm, well if realtime rendering is difficult even with incredibly fast optical fiber connections, then just let me stack some PS3s together, connecting them with no cables at all, but more like GSCube or better yet, SGI visualization systems (RE, RE2, IR, IR2, IR3, etc) where at least 16 pipes can be connected together (each pipe a set of 3 boards: Geometry Engine board, Raster Manager board, Image Generator board)

btw, how does SGI's architecture actually work (compared to distributed computing) where you can combine the graphics processing of upto 16 or more SGI systems, to focus on one realtime problem/application ?

Most SGI's use a duplicated memory system, each unit has a exact copy of the memory its used (i.e. you have 16 versions of texture memory all the same, 3DFX Voodoo5 and 6 copied the idea).

This help for a while, as each unit has fast access to its memory pool BUT it just moves the problem up one level. Change a texture and you have to duplicate that change 16 times. In the end you bottleneck becomes the updating of the memory pools.

I.e. Each IG has a 1D connection to its memory but the memory controller has a 1 to 16 replication problem. Anywhere where you have bandwidth being multiplied by 16 (or anything higher) becomes a big problem.

Thats why SGI's couldn't beat NVIDIA/ATI in the end. The was no simple way of scaling more units, all the can do is make each unit more efficient. ATI/NVIDIA got the jump on them and that why top end SGI now use 16 parallel ATI cards.
 
Panajev2001a said:
DeanoC said:
The only thing worth talking about for massively multi-processing systems is connections and there bandwidth, its described as dimensions in parellel processing literature.

Well, people working on BLueGene/L and CELL ( well, IBM top level designers ) already went on quote with the "computation will not be the limit, shifting data ( data flow ) will" in regards to the needs for IBM's flavour of Cellular Computing.

Does that make you feel any better :) ?


P.S.: I am also checking the rest of the post, but wanted to give you a quick reply.

I should hope so, it was originally said when I playing on my Spectrum, roughly 20 years ago :)

SIMD, MIMD, ASMP, SMP and clusters have all been tried with some success. Cell approach is certainly interesting (its technically a hybrid, mixing a bit of everything with a few new ideas to glue it all together)

The multi-processor world holds it breath and hopes...
 
DeanoC said:
The only thing worth talking about for massively multi-processing systems is connections and there bandwidth, its described as dimensions in parellel processing literature.

(I may well have get these bits wrong, so I'm very interested if anybody reads the patents differently...).

My reading of Cell is that each APU is 8D connected (each APU is connected to its PU and every other APU), PU are then 4D connected to other PU's in its broadband engine(BBE) by a shared memory pool. This where Sony go quiet and start talking about cyberspace.

Assuming a 1000Mb/s point to point network connection between 2 BBE's we get a 1D link.

So BBE are 1D, PU 4D and APU 8D. moving data from APU0 on BBE0 to APU0 on BBE1 involves moving the data 3 times, with the bottleneck being to slowest link (the network link).

Now expand it out, lets try increasing the BBE dimensionality....

Assuming a 1000Mb/s point to point network connection between 16 BBE's we get a 16D link.

Again there is 3 moves from APU to APU on a different BBE but you network connection is now effectively 16 times slower...

O.K. lets try a different topology. Lets make BBE's connect in hierachies.

4 BBEs connect in a star pattern to other groups of 4BBEs.
To move from APU0 on BBE0 to APU0 on BBE15 (as far away as possible), we have to copy from APU0 to BBE0 RAM, to a center BBE, from that BBE to BBE15 and then from BBE15 to APU0. 4 moves.

The network connection is 1000Mb/s shared by 4 BBE but you have to travel over 2 different connections.

Check on Connection Machine 2, it had 65535 1 bit processors in a 12D star topology. Ran LISP well...


Second quick reply:

I do not know if we can call the APUs 8D conencted as I do not know if an APU can send data to another APU without using the DMAC, I am pretty sure that according to the various patents that the only processor that can write at will inside the APU's LS ( though APU RPCs ) is the PU.

So, an APU is connected to the PU and the DMAC: 2D connected, if you count the DMAC as it is the way to send data to shared DRAM to be ahem shared with other APUs.

Each PU is connected to the DMACand to 4 other PUs through the BE's bus.

IS the PU 4D connected or 5D connected ?

Ok.... I checked... ARPC are implemented through DMAC commands:

[0128] As noted, the PUs treat the APUs as independent processors, not co-processors. To control processing by the APUs, therefore, the PU uses commands analogous to remote procedure calls. These commands are designated "APU Remote Procedure Calls" (ARPCs). A PU implements an ARPC by issuing a series of DMA commands to the DMAC. The DMAC loads the APU program and its associated stack frame into the local storage of an APU. The PU then issues an initial kick to the APU to execute the APU Program.


What I still like a lot about CELL remains the Apulet ( an excuse to post it again :D, what can I do, that is what working in Routing will do for you ):

[0120] The present invention also provides a new programming model for the processors of system 101. This programming model employs software cells 102. These cells can be transmitted to any processor on network 104 for processing. This new programming model also utilizes the unique modular architecture of system 101 and the processors of system 101.

[0121] Software cells are processed directly by the APUs from the APU's local storage. The APUs do not directly operate on any data or programs in the DRAM. Data and programs in the DRAM are read into the APU's local storage before the APU processes these data and programs. The APU's local storage, therefore, includes a program counter, stack and other software elements for executing these programs. The PU controls the APUs by issuing direct memory access (DMA) commands to the DMAC.

[0122] The structure of software cells 102 is illustrated in FIG. 23. As shown in this figure, a software cell, e.g., software cell 2302, contains routing information section 2304 and body 2306. The information contained in routing information section 2304 is dependent upon the protocol of network 104. Routing information section 2304 contains header 2308, destination ID 2310, source ID 2312 and reply ID 2314. The destination ID includes a network address. Under the TCP/IP protocol, e.g., the network address is an Internet protocol (IP) address. Destination ID 2310 further includes the identity of the PE and APU to which the cell should be transmitted for processing. Source ID 2314 contains a network address and identifies the PE and APU from which the cell originated to enable the destination PE and APU to obtain additional information regarding the cell if necessary. Reply ID 2314 contains a network address and identifies the PE and APU to which queries regarding the cell, and the result of processing of the cell, should be directed.

[0123] Cell body 2306 contains information independent of the network's protocol. The exploded portion of FIG. 23 shows the details of cell body 2306. Header 2320 of cell body 2306 identifies the start of the cell body. Cell interface 2322 contains information necessary for the cell's utilization. This information includes global unique ID 2324, required APUs 2326, sandbox size 2328 and previous cell ID 2330.

[0124] Global unique ID 2324 uniquely identifies software cell 2302 throughout network 104. Global unique ID 2324 is generated on the basis of source ID 2312, e.g. the unique identification of a PE or APU within source ID 2312, and the time and date of generation or transmission of software cell 2302. Required APUs 2326 provides the minimum number of APUs required to execute the cell. Sandbox size 2328 provides the amount of protected memory in the required APUs' associated DRAM necessary to execute the cell. Previous cell ID 2330 provides the identity of a previous cell in a group of cells requiring sequential execution, e.g., streaming data.

[0125] Implementation section 2332 contains the cell's core information. This information includes DMA command list 2334, programs 2336 and data 2338. Programs 2336 contain the programs to be run by the APUs (called "apulets"), e.g., APU programs 2360 and 2362, and data 2338 contain the data to be processed with these programs. DMA command list 2334 contains a series of DMA commands needed to start the programs. These DMA commands include DMA commands 2340, 2350, 2355 and 2358. The PU issues these DMA commands to the DMAC.
 
DeanoC said:
Panajev2001a said:
DeanoC said:
The only thing worth talking about for massively multi-processing systems is connections and there bandwidth, its described as dimensions in parellel processing literature.

Well, people working on BLueGene/L and CELL ( well, IBM top level designers ) already went on quote with the "computation will not be the limit, shifting data ( data flow ) will" in regards to the needs for IBM's flavour of Cellular Computing.

Does that make you feel any better :) ?


P.S.: I am also checking the rest of the post, but wanted to give you a quick reply.

I should hope so, it was originally said when I playing on my Spectrum, roughly 20 years ago :)

SIMD, MIMD, ASMP, SMP and clusters have all been tried with some success. Cell approach is certainly interesting (its technically a hybrid, mixing a bit of everything with a few new ideas to glue it all together)

The multi-processor world holds it breath and hopes...

Is the hope all relying on interconnect ( on chip busses, netowrk links, etc... ) speed ?

I do not expect CELL to scale till BlueGene/L levels, not until the interconnects are fast enough ( after-all, if you look at the Broadband Engine and at BlueGene/L you can see the PU as one of the PowerPC cores and the 8 APUs as the other PowerPC core in the node... we can always imagine each PE to contain only 1 or 2 APUs ;) ).

Also, we do not know yet how much will CELL will progress as an ISA and what tricks the system builders will find to scale CELL based machines up: I am sorry, but the day Intel unveiled McKinley/Itanium 2 I was not expecting 256-512 CPUs Altix systems :).

If you really wanted to be mean about scaling CELL I would question your 1 Gbps connection, is that the best you can do ? You do not want me to link-aggregate ( Layer 2, not incredibly expensive... routing 10 Gbps will be though ) 5-10 links on yo assssssss :).
 
Most SGI's use a duplicated memory system, each unit has a exact copy of the memory its used (i.e. you have 16 versions of texture memory all the same, 3DFX Voodoo5 and 6 copied the idea).

This help for a while, as each unit has fast access to its memory pool BUT it just moves the problem up one level. Change a texture and you have to duplicate that change 16 times. In the end you bottleneck becomes the updating of the memory pools.

I.e. Each IG has a 1D connection to its memory but the memory controller has a 1 to 16 replication problem. Anywhere where you have bandwidth being multiplied by 16 (or anything higher) becomes a big problem.

Thats why SGI's couldn't beat NVIDIA/ATI in the end. The was no simple way of scaling more units, all the can do is make each unit more efficient. ATI/NVIDIA got the jump on them and that why top end SGI now use 16 parallel ATI cards.


much thanks for that explaination DeanoC 8)

it also helped ATI that they had many of SGI's best engineers :devilish:
 
You probably right about the APU not being connected to each other but via the DMAC (effectively a manual memory cross-bar). If anything that makes thing worse...

What I don't see with the Apulet idea, is data. Where and how is the APU getting its data from? Is it contained in the Apulet, with no DMA? If so is portable and then can be easily shifted off onto other Cells but if but if it issues DMA than it is virtually no more portable than a direct address memory access.

The command to the DMAC would have to global address (IP:pU:Address) to work transparently across a network and then you hit the performace problem. If an APU can ask for data from a remote memory pool than its performance will be next to nothing.
 
[0125] Implementation section 2332 contains the cell's core information. This information includes DMA command list 2334, programs 2336 and data 2338. Programs 2336 contain the programs to be run by the APUs (called "apulets"), e.g., APU programs 2360 and 2362, and data 2338 contain the data to be processed with these programs. DMA command list 2334 contains a series of DMA commands needed to start the programs. These DMA commands include DMA commands 2340, 2350, 2355 and 2358. The PU issues these DMA commands to the DMAC.

[0126] DMA command 2340 includes VID 2342. VID 2342 is the virtual ID of an APU which is mapped to a physical ID when the DMA commands are issued. DMA command 2340 also includes load command 2344 and address 2346. Load command 2344 directs the APU to read particular information from the DRAM into local storage. Address 2346 provides the virtual address in the DRAM containing this information. The information can be, e.g., programs from programs section 2336, data from data section 2338 or other data. Finally, DMA command 2340 includes local storage address 2348. This address identifies the address in local storage where the information should be loaded. DMA commands 2350 contain similar information. Other DMA commands are also possible.

[0127] DMA command list 2334 also includes a series of kick commands, e.g., kick commands 2355 and 2358. Kick commands are commands issued by a PU to an APU to initiate the processing of a cell. DMA kick command 2355 includes virtual APU ID 2352, kick command 2354 and program counter 2356. Virtual APU ID 2352 identifies the APU to be kicked, kick command 2354 provides the relevant kick command and program counter 2356 provides the address for the program counter for executing the program. DMA kick command 2358 provides similar information for the same APU or another APU.

[0128] As noted, the PUs treat the APUs as independent processors, not co-processors. To control processing by the APUs, therefore, the PU uses commands analogous to remote procedure calls. These commands are designated "APU Remote Procedure Calls" (ARPCs). A PU implements an ARPC by issuing a series of DMA commands to the DMAC. The DMAC loads the APU program and its associated stack frame into the local storage of an APU. The PU then issues an initial kick to the APU to execute the APU Program.

[0129] FIG. 24 illustrates the steps of an ARPC for executing an apulet. The steps performed by the PU in initiating processing of the apulet by a designated APU are shown in the first portion 2402 of FIG. 24, and the steps performed by the designated APU in processing the apulet are shown in the second portion 2404 of FIG. 24.

[0130] In step 2410, the PU evaluates the apulet and then designates an APU for processing the apulet. In step 2412, the PU allocates space in the DRAM for executing the apulet by issuing a DMA command to the DMAC to set memory access keys for the necessary sandbox or sandboxes. In step 2414, the PU enables an interrupt request for the designated APU to signal completion of the apulet. In step 2418, the PU issues a DMA command to the DMAC to load the apulet from the DRAM to the local storage of the APU. In step 2420, the DMA command is executed, and the apulet is read from the DRAM to the APU's local storage. In step 2422, the PU issues a DMA command to the DMAC to load the stack frame associated with the apulet from the DRAM to the APU's local storage. In step 2423, the DMA command is executed, and the stack frame is read from the DRAM to the APU's local storage. In step 2424, the PU issues a DMA command for the DMAC to assign a key to the APU to allow the APU to read and write data from and to the hardware sandbox or sandboxes designated in step 2412. In step 2426, the DMAC updates the key control table (KTAB) with the key assigned to the APU. In step 2428, the PU issues a DMA command "kick" to the APU to start processing of the program. Other DMA commands may be issued by the PU in the execution of a particular ARPC depending upon the particular apulet.

An APU cannot use the DMAC to ask for data from a remote CELL device, the APU has its own LS and the shared DRAM the PE's DMAC is connected to and IMHO this is all an APU can access and see.

The way I see it an Apulet contains in the data field: data ( :) ), the Program Counter or PC setting, a "Virtual" Address ( to locate the data in the shared DRAM of the system as the APU sees DRAM partitioned into local sandboxes so I would say that the address is relative to the local sandbox ), etc...

APU's cannot execute code or work on data outside of their Local Storage or LS and they need to load in their LS ( backing up their current context ) the instructions and data that the Apulet contains.

The Apulet when it is received is stored in the Shared DRAM first until its content is DMA'ed into the right APU's LS.
 
Panajev2001a said:
If you really wanted to be mean about scaling CELL I would question your 1 Gbps connection, is that the best you can do ? You do not want me to link-aggregate ( Layer 2, not incredibly expensive... routing 10 Gbps will be though ) 5-10 links on yo assssssss :).

But thats the point, no matter how much bandwidth you have IF the connection is bad you will flood the bandwidth just moving data around.

You can have local RAM speeds and you still we hit problems... Its just occurs at different points.

Example :
Say an Apu is a vertex transform node processing procedural output of another Cell. Lets say the procedural APU can produce 100 million vertices of 10 bytes each second, so 1 Gb per second. The transform node processes at the same speed so produce another 1 Gb per second.
You just lost 2 Gb/s of bandwidth for 2 APU's. Now assume there are 20 APU using similar bandwidth and that 20 Gb/s.

Modern memory bandwidth is only in the region of 30 Gb/s... It would be hard to scale much further with local memory access let along across a network connection...

APU will be able to produce data probably faster the memory will keep up and almost certainly faster than network connections.
 
so basicly, realtime distributed computing for rendering just is not going to happen for time-critical highly detailed action games. anymore than raytracing was going to happen on Ultra 64.
 
DeanoC said:
Panajev2001a said:
If you really wanted to be mean about scaling CELL I would question your 1 Gbps connection, is that the best you can do ? You do not want me to link-aggregate ( Layer 2, not incredibly expensive... routing 10 Gbps will be though ) 5-10 links on yo assssssss :).

But thats the point, no matter how much bandwidth you have IF the connection is bad you will flood the bandwidth just moving data around.

You can have local RAM speeds and you still we hit problems... Its just occurs at different points.

Example :
Say an Apu is a vertex transform node processing procedural output of another Cell. Lets say the procedural APU can produce 100 million vertices of 10 bytes each second, so 1 Gb per second. The transform node processes at the same speed so produce another 1 Gb per second.
You just lost 2 Gb/s of bandwidth for 2 APU's. Now assume there are 20 APU using similar bandwidth and that 20 Gb/s.

Modern memory bandwidth is only in the region of 30 Gb/s... It would be hard to scale much further with local memory access let along across a network connection...

APU will be able to produce data probably faster the memory will keep up and almost certainly faster than network connections.

As always you have good points, the way I see it is the OS being able to check how many ms it would take to process the Apulet in the closest neighbour on a network and see if that is less than the time the next local APU gets freed ( this Apulet might require many APUs and if the local system does not even have that many or it will not have that many free at least until the next frame or something ) then you send the Apulet over the network.

I know I am oversimplifying things, but I was trying to look at the situation from a higher level perspective.

What kind of connection would you expect to have ?

I think it is also a software issue: in the case you proposed insane bandwidth is the only possibility.. pretty much what you would do in a local system.

If you had three black-boxes and each processed at the same throughput of 100 MVertices/s at 10 bytes per vertex we are in trouble if we expect to transfer the stream of data from one box to the next.

No matter what is inside each box.

What would you do here ? What would you wish to have ?
 
Back
Top