PS3 distributed computing without internet limitations ques.

Panajev2001a · Jun 3, 2004

Gubbi said:
The one big problem with CELL is going to be how to slice and dice your software so that it can run on a given number of PEs/APUs. The number of APUs in a system is fixed as is the amount of local storage attached to each APU, AFAIK no virtualization is implemented (neither PE/APUs or storage). So programs and data has to be squeezed into packets with a fixed (small) upper size.

You might enjoy this:

http://makeashorterlink.com/?J1B164038

Overlay Management

.

Panajev2001a · Jun 3, 2004

I wanted to add something in regards to DeanoC's example about different nodes transferring quite large amounts of vertices and wasting incredible amounts of bandwidth.

I think that configuring a parallel system you have always to think about how much bandwidth you have to share between the nodes and how many nodes you can add before the performance boost becomes rather insignificant.

How much data you have to "constantly" pass back and forth depends on the programmers, not on the ISA nor the processor implementation ( well, to a certain degree at least ).

The problem is, for example, each APU working on 100 MVertices and each Vertex being 10 bytes in size: each APU could need 1 Gbps of bandwidth just to stream that data around.

Perhaps, increasing the bandwidth between the CELL nodes is not feasible n the project ( depending on the costs you keep increasing the bandwidth of the network until the network is not the bottleneck anymore and hopefully you picked up some performance along the way ).

But, is the proposed set-up "Transform Node, Projection Node, etc..." an optimal one or a worst case scenario thought to show what high node throughput can cause to a distributed system ?

If network bandwidth is the limit, but local processing power has not hit the sweet-spot yet, if it is possible ( CELL is not here to solve all of humanity's computing problems... at least not until each APU is a Quantum Processor [seriously, as an aside the pure fact that algorithm that have execution time of O ( x^y ) or similar would become feasible with a Quantum Processor just really excites me: the benefits of storing -1, 0 and 1 at the same time ] ), maybe you should try to maximize the computing workload of each node and minimize the data you transfer ( re-calculate and do not transmit ).

As far as geometry processing: I doubt that each APU would be able to push 100 MVertices with a decently complex Shader running ( maybe tiling the scene intellignetly between the nodes might be a good idea ? ).

Maybe a good idea is to spend each node's processing power in compressing geometry data further ( like Fafalada is saying ).

The problem as you said as well is the software, is how we share computational load between nodes.

The idea is "how many black boxes can we connect together to distribute performance ?". The design of the software has to follow our choice and the technology limitations of the interconnects as well ( network bandwidth and latency ).

The root of the problem, IMHO, is not really the node itself, it can be a balck box with infinite computing power and internal bandwidth for all we care ( at this step of our analysis process ).

Panajev2001a said:
[0125] Implementation section 2332 contains the cell's core information. This information includes DMA command list 2334, programs 2336 and data 2338. Programs 2336 contain the programs to be run by the APUs (called "apulets"), e.g., APU programs 2360 and 2362, and data 2338 contain the data to be processed with these programs. DMA command list 2334 contains a series of DMA commands needed to start the programs. These DMA commands include DMA commands 2340, 2350, 2355 and 2358. The PU issues these DMA commands to the DMAC.

[0126] DMA command 2340 includes VID 2342. VID 2342 is the virtual ID of an APU which is mapped to a physical ID when the DMA commands are issued. DMA command 2340 also includes load command 2344 and address 2346. Load command 2344 directs the APU to read particular information from the DRAM into local storage. Address 2346 provides the virtual address in the DRAM containing this information. The information can be, e.g., programs from programs section 2336, data from data section 2338 or other data. Finally, DMA command 2340 includes local storage address 2348. This address identifies the address in local storage where the information should be loaded. DMA commands 2350 contain similar information. Other DMA commands are also possible.

[0127] DMA command list 2334 also includes a series of kick commands, e.g., kick commands 2355 and 2358. Kick commands are commands issued by a PU to an APU to initiate the processing of a cell. DMA kick command 2355 includes virtual APU ID 2352, kick command 2354 and program counter 2356. Virtual APU ID 2352 identifies the APU to be kicked, kick command 2354 provides the relevant kick command and program counter 2356 provides the address for the program counter for executing the program. DMA kick command 2358 provides similar information for the same APU or another APU.

[0128] As noted, the PUs treat the APUs as independent processors, not co-processors. To control processing by the APUs, therefore, the PU uses commands analogous to remote procedure calls. These commands are designated "APU Remote Procedure Calls" (ARPCs). A PU implements an ARPC by issuing a series of DMA commands to the DMAC. The DMAC loads the APU program and its associated stack frame into the local storage of an APU. The PU then issues an initial kick to the APU to execute the APU Program.

[0129] FIG. 24 illustrates the steps of an ARPC for executing an apulet. The steps performed by the PU in initiating processing of the apulet by a designated APU are shown in the first portion 2402 of FIG. 24, and the steps performed by the designated APU in processing the apulet are shown in the second portion 2404 of FIG. 24.

[0130] In step 2410, the PU evaluates the apulet and then designates an APU for processing the apulet. In step 2412, the PU allocates space in the DRAM for executing the apulet by issuing a DMA command to the DMAC to set memory access keys for the necessary sandbox or sandboxes. In step 2414, the PU enables an interrupt request for the designated APU to signal completion of the apulet. In step 2418, the PU issues a DMA command to the DMAC to load the apulet from the DRAM to the local storage of the APU. In step 2420, the DMA command is executed, and the apulet is read from the DRAM to the APU's local storage. In step 2422, the PU issues a DMA command to the DMAC to load the stack frame associated with the apulet from the DRAM to the APU's local storage. In step 2423, the DMA command is executed, and the stack frame is read from the DRAM to the APU's local storage. In step 2424, the PU issues a DMA command for the DMAC to assign a key to the APU to allow the APU to read and write data from and to the hardware sandbox or sandboxes designated in step 2412. In step 2426, the DMAC updates the key control table (KTAB) with the key assigned to the APU. In step 2428, the PU issues a DMA command "kick" to the APU to start processing of the program. Other DMA commands may be issued by the PU in the execution of a particular ARPC depending upon the particular apulet.

Click to expand...

An APU cannot use the DMAC to ask for data from a remote CELL device, the APU has its own LS and the shared DRAM the PE's DMAC is connected to and IMHO this is all an APU can access and see.

The way I see it an Apulet contains in the data field: data ( ), the Program Counter or PC setting, a "Virtual" Address ( to locate the data in the shared DRAM of the system as the APU sees DRAM partitioned into local sandboxes so I would say that the address is relative to the local sandbox ), etc...

APU's cannot execute code or work on data outside of their Local Storage or LS and they need to load in their LS ( backing up their current context ) the instructions and data that the Apulet contains.

The Apulet when it is received is stored in the Shared DRAM first until its content is DMA'ed into the right APU's LS.

Pepto-Bismol · Jun 3, 2004

Megadrive1988 said:
all of our sharing PS3s are on a table, linked up. no internet delays to worry about. we want to use all the resources of all the PS3s to provide more simulation and rendering performance for a game.

Judging from the GI article (below), I think Cell's domestic modus communicare will be wireless. It's simple for users. There are no cables to serialize the flow of information. And it will put a physical limit on capacity -- in case geeks get any bright ideas about supercomputing.

Wireless networking is set to form a key part of Sony's grand plan for its next generation hardware, gi.biz has learned, with the company's vision of the future banking heavily on the proliferation of high-speed wireless hotspots in the home and in public places.

It's already well known that the PlayStation Portable will feature hardware supporting the 802.11 wireless networking system, but new details of Sony's future vision reveal that this will not only be used for multiplayer between PSP devices - and explain why the company opted for the more expensive and power-hungry 802.11 standard rather than the seemingly more logical Bluetooth wireless system.

Well-placed sources have informed us that Sony plans to use wireless networking not only for multiplayer between PSP devices, but also to link the PSP with the next-generation home console, PS3, and with wireless internet "hot spots" to enable online multiplayer and internet communication functionality.

Source: gamesindustry.biz

Squeak · Jun 4, 2004

nAo said:
Squeak said:

A bit of on a tangent maybe, but how many bytes does an average vertice take up in a modern videogame? Sometimes it's quoted at 40 bytes, other sources say as low as 8 bytes (sounds very low to me).

Click to expand...

It depends upon implementation.
On the PS2 my biggest vertex has 15 bytes size and it has position, normal, a UV mapping, and a couple of vertex colors (can be interpolated between day time and night time..)
In my case all datas are quantized and compressed.
A non compressed vertex can buy a lot of space...It's easy to have 40 bytes vertices..

Fafalada said:
Like nAo says, it's up to implementation.
Smallest vertices with 3 or more vertex attributes that we use are 7bytes/vertex, and that's still all done with the "trivial" compression (scalar/vector quantization, sometimes with delta offsets).
Anyway once you start going to more exotic schemes (not feasible with current hw, but nexgen, who knows) you go down to a couple of bits per vertex, and lower...

I guess the real question is, what current scheme is most efficient?
A setup like PS2s, where all the geometry is clipped, culled and LODed before being send to the external bus, but have to use a non-compressed format (actually I don't know if the GS can handle any other compression than strips and fans, like low precision geometry for the background for example), or xbox and GCs way of doing it, where packed, quantized geomtry can be send to the on-die TnL hardware?

The same question (although in a much more complex way) is probably also relevant to distributed rendering architectures like Cell.

nAo · Jun 4, 2004

Squeak said:
I guess the real question is, what current scheme is most efficient?

Even in this case it's up to the hardware implementation.
The perfect architecture would work at peak in every condition, but trade offs have to be made. Dunno about XBOX and GC internals, but PS2 is very efficient in this regard. In the common case (triangle strip, a vertex with position, color and mapping) a vertex need 3 bus cycles to be transferred from EE to GS. That's rougly 50 MPoly/s..I'd say that's pretty balanced regarding VU1 trasform rate and GS primitives setup rate, in fact it would an overkill if textures would not be transfered on the same bus..

The same question (although in a much more complex way) is probably also relevant to distributed rendering architectures like Cell.

I bet CELL would be much more flexible in this regards. Like you pointed out GS can't cope with compressed data..I believe on a CELL like architecture this kind of problems can be addressed in exotic ways

Gubbi · Jun 4, 2004

Panajev2001a said:
Gubbi said:

The one big problem with CELL is going to be how to slice and dice your software so that it can run on a given number of PEs/APUs. The number of APUs in a system is fixed as is the amount of local storage attached to each APU, AFAIK no virtualization is implemented (neither PE/APUs or storage). So programs and data has to be squeezed into packets with a fixed (small) upper size.

Click to expand...

You might enjoy this:

http://makeashorterlink.com/?J1B164038

Overlay Management .

This only relates to how you manage the locally attached storage (is it 384KB per APU?).

Generally, while employment of the hybrid system provides high computational performance, it poses significant challenges to the programming model. One such problem relates to the APU. The APU cannot directly address system memory. ...

And it then goes one to describe how to split local storage RAM between program and data.

It in fact confirms my previous statement, that only the MPU (PE) can talk to main memory.

BTW I love how the Sony hype machine cranks out new buzzwords:

....and a specialized, or "attached" processor unit (APU), such as a Synergistic.TM. processor unit (SPU)....

Synergistic Processor Unit, nice.

Cheers
Gubbi

nAo · Jun 4, 2004

Gubbi said:
BTW I love how the Sony hype machine cranks out new buzzwords

That one is from IBM..

Panajev2001a · Jun 4, 2004

Gubbi said:
Panajev2001a said:

Gubbi said:

The one big problem with CELL is going to be how to slice and dice your software so that it can run on a given number of PEs/APUs. The number of APUs in a system is fixed as is the amount of local storage attached to each APU, AFAIK no virtualization is implemented (neither PE/APUs or storage). So programs and data has to be squeezed into packets with a fixed (small) upper size.

Click to expand...

You might enjoy this:

http://makeashorterlink.com/?J1B164038

Overlay Management .

Click to expand...

This only relates to how you manage the locally attached storage (is it 384KB per APU?).

Generally, while employment of the hybrid system provides high computational performance, it poses significant challenges to the programming model. One such problem relates to the APU. The APU cannot directly address system memory. ...

Click to expand...

And it then goes one to describe how to split local storage RAM between program and data.

It's 128 KB and it also deals with taking the code and data your code uses and splitting into modules so that you can write basically arbitrary long APU programs.

The Local Storage is Virtualized in a way.

It in fact confirms my previous statement, that only the MPU (PE) can talk to main memory.

The MPU of the PE would be the PU... however, you are correct, the APUs cannot talk directly to the shared DRAM, but they can DMA data in and out without having the PU spoon-feed them.

We do not know much about th design of the DMAC itself to know how many simultaneous requests it can service.

The Local Storage is APU's main RAM, btw

.

Gubbi · Jun 4, 2004

nAo said:
That one is from IBM..

My bad.

Panajev2001a said:
It's 128 KB and it also deals with taking the code and data your code uses and splitting into modules so that you can write basically arbitrary long APU programs.

The Local Storage is Virtualized in a way.

So code is DMAed into a predefined (at link time) chunk of local storage on an as demanded basis. This means that if you program size exceed available local storage temporal locality drops to zero ? Is this really a good idea.

Panajev2001a said:
The MPU of the PE would be the PU... however, you are correct, the APUs cannot talk directly to the shared DRAM, but they can DMA data in and out without having the PU spoon-feed them.

We do not know much about the design of the DMAC itself to know how many simultaneous requests it can service.

My guess is one DMA channel per APU, ie 32 channels for a 4 PE x 8 APU setup.

Also shared DRAM in this case is the on die EDRAM, right ? External DRAM transactions has to be done by the PUs. This effectively means you have to have the entire program and working set in EDRAM to be of any use to the APUs.

Cheers
Gubbi

Panajev2001a · Jun 4, 2004

Gubbi said:
nAo said:

That one is from IBM..

Click to expand...

My bad.

Panajev2001a said:

It's 128 KB and it also deals with taking the code and data your code uses and splitting into modules so that you can write basically arbitrary long APU programs.

The Local Storage is Virtualized in a way.

Click to expand...

So code is DMAed into a predefined (at link time) chunk of local storage on an as demanded basis. This means that if you program size exceed available local storage temporal locality drops to zero ? Is this really a good idea.

A root module would stay in the LS and could not be moved/overwritten.

Data that is touched by all modules would stay in the LS, the way I read it, as common data.

Each module would carry its own data-set.

Code and data ( only code if you really wanted to

) would be uploaded then on a per-need basis by the root module which has hooks for all attached modules.

It is not a perfect solution, but it is better than just giving programmers 128 KB of LS per APU and telling them to fit all their code in there ( and DMA the data like they do with the VUs ) or find their own way to manually update the code.

This way at least abstracts a much as possible the job of dividing manually the program in blocks and streaming the blocks in and out of the LS.

You can write your code like you would normally do, allowing the APUs to be used for also more general purpose tasks.

You can also probably write the code in APU ASM, make it fit the LS and the DMA data in and out yourself if you want maximum efficiency, but that is going to be up to the programmers IMHO.

Fafalada · Jun 4, 2004

Squeak said:
or xbox and GCs way of doing it, where packed, quantized geomtry can be send to the on-die TnL hardware?

Well this part PS2 already does - you are sending compressed geometry to your T&L hw. As for external interface to the rasterizer - so long as your chip packaging can handle the bus wide enough, it I don't really see much difference to having it on die either.
Still, it's a matter of hw implementation like nAo said, even on-die interface to the rasterizer doesn't gurantee optimal speed by itself.

Gubbi said:
So code is DMAed into a predefined (at link time) chunk of local storage on an as demanded basis. This means that if you program size exceed available local storage temporal locality drops to zero ? Is this really a good idea.

It worked pretty well on old PCs with swapping from horrendeously slow HDDs.

Also shared DRAM in this case is the on die EDRAM, right ? External DRAM transactions has to be done by the PUs.

The general impression I got was that shared Dram was external mem (and furthermore that BE will probably not have eDram at all, if you don't count APU storages).
Either way I doubt DMA would be limited to accessing eDram only either.

Gubbi · Jun 4, 2004

Fafalada said:
Gubbi said:

So code is DMAed into a predefined (at link time) chunk of local storage on an as demanded basis. This means that if you program size exceed available local storage temporal locality drops to zero ? Is this really a good idea.

Click to expand...

It worked pretty well on old PCs with swapping from horrendeously slow HDDs.

It worked. But I don't think you can say it worked well. The lack of fine grain memory access is going to result in a hefty bandwidth overhead penalty, of course BE seems to plenty of bandwidth so it might not be a problem. The real killer is latency.

I can see developers trading off bandwidth for latency, eg packing multiple tree-nodes into larger chunks (think B-trees).

Fafalada said:
Also shared DRAM in this case is the on die EDRAM, right ? External DRAM transactions has to be done by the PUs.

Click to expand...

The general impression I got was that shared Dram was external mem (and furthermore that BE will probably not have eDram at all, if you don't count APU storages).
Either way I doubt DMA would be limited to accessing eDram only either.

Ok, seems fair. But can each APU have multiple outstanding memory transactions (DMA transfers) or are they serialized?

Cheers
Gubbi

Fafalada · Jun 4, 2004

Gubbi said:
It worked. But I don't think you can say it worked well.

Fair enough, I rose-tinted it a bit

The real killer is latency. I can see developers trading off bandwidth for latency, eg packing multiple tree-nodes into larger chunks (think B-trees).

Agreed here. I could also see devs dropping stuff that jumps around the memory a lot, to PUs - provided they will be fast enough.

Ok, seems fair. But can each APU have multiple outstanding memory transactions (DMA transfers) or are they serialized?

This is one thing that I've been greatly wondering about myself too...

Gubbi · Jun 4, 2004

Panajev2001a said:
Code and data ( only code if you really wanted to ) would be uploaded then on a per-need basis by the root module which has hooks for all attached modules.

Re-read the patent and thought about this for a bit. The patent mentions that this loading mechanism could be hardware or software but preferably a program running on a MPU (software on the PU or APU).

This means that the root module is really just a master program that loads the submodules and executes them, probably with multiple sub-module regions and prefetching to hide latency. Kind of like this:

Code:

load_code(region1, module1);
load_code(region2, module2);
load_code(region3, module3);
wait_for_region_to load_and_call(region1);
load_code(region1, module4);
wait_for_region_to load_and_call(region2);
load_code(region2, module5);
wait_for_region_to load_and_call(region3);
wait_for_region_to load_and_call(region1);
wait_for_region_to load_and_call(region2);

Basically the compiler/linker does all the housekeeping for you.

Cheers
Gubbi

Squeak · Jun 6, 2004

nAo said:
Squeak said:

I guess the real question is, what current scheme is most efficient?

Click to expand...

Even in this case it's up to the hardware implementation.
The perfect architecture would work at peak in every condition, but trade offs have to be made. Dunno about XBOX and GC internals, but PS2 is very efficient in this regard. In the common case (triangle strip, a vertex with position, color and mapping) a vertex need 3 bus cycles to be transferred from EE to GS. That's rougly 50 MPoly/s..I'd say that's pretty balanced regarding VU1 trasform rate and GS primitives setup rate, in fact it would an overkill if textures would not be transfered on the same bus..

Fafalada said:
Squeak said:

or xbox and GCs way of doing it, where packed, quantized geomtry can be send to the on-die TnL hardware?

Click to expand...

Well this part PS2 already does - you are sending compressed geometry to your T&L hw. As for external interface to the rasterizer - so long as your chip packaging can handle the bus wide enough, it I don't really see much difference to having it on die either.
Still, it's a matter of hw implementation like nAo said, even on-die interface to the rasterizer doesn't gurantee optimal speed by itself.

But if you could compress transformed geometry you would have even more bandwidth for textures

, but the relatively slow page based nature of GS VRAM, is probably the main culprit regarding large textures.

The same question (although in a much more complex way) is probably also relevant to distributed rendering architectures like Cell.

Click to expand...

I bet CELL would be much more flexible in this regards. Like you pointed out GS can't cope with compressed data..I believe on a CELL like architecture this kind of problems can be addressed in exotic ways

Click to expand...

Also shared DRAM in this case is the on die EDRAM, right ? External DRAM transactions has to be done by the PUs.

Click to expand...

The general impression I got was that shared Dram was external mem (and furthermore that BE will probably not have eDram at all, if you don't count APU storages).
Either way I doubt DMA would be limited to accessing eDram only either.

If Sony is going for really dense geometry, good compression of that, is going to be much more important than texture compression, especially if there is no on-die eDRAM. But maybe the tessellation will be left exclusively to the Visualizer, which most certainly will have heaps of eDRAM?

nAo · Jun 6, 2004

Squeak said:
But if you could compress transformed geometry you would have even more bandwidth for textures , but the relatively slow page based nature of GS VRAM, is probably the main culprit regarding large textures.

The most important thing is balancing. PS2 architecture has many pitfalls, neverthless in this case is well balanced (if one can use PATH3 to transfer textures..

)

if Sony is going for really dense geometry, good compression of that, is going to be much more important than texture compression, especially if there is no on-die eDRAM. But maybe the tessellation will be left exclusively to the Visualizer, which most certainly will have heaps of eDRAM?

I don't think tesselation will be left exclusively to the Visualizer/Realizer. It would be a mistake...

ciao,
Marco

Squeak · Jun 6, 2004

nAo said:
Squeak said:

But if you could compress transformed geometry you would have even more bandwidth for textures , but the relatively slow page based nature of GS VRAM, is probably the main culprit regarding large textures.

Click to expand...

The most important thing is balancing. PS2 architecture has many pitfalls, neverthless in this case is well balanced (if one can use PATH3 to transfer textures.. )

But what happens when you use a texture that spans several pages, and bilinear (or worse still trilinear) is turned on? Won't the texture cache have to thrash between the two (or four!) pages for every single texel at the seam, to get correct interpolation?

nAo · Jun 6, 2004

Squeak said:
But what happens when you use a texture that spans several pages, and bilinear (or worse still trilinear) is turned on? Won't the texture cache have to thrash between the two (or four!) pages for every single texel at the seam, to get correct interpolation?

Umh..at this time I can't see how this question is relevant to the current discussion, I'm going to reply anyway

Mip mapping helps us there, forcing the HW to fetch texels in a way that texel/pixel ratio < 1, and thus avoiding a lot of pages break and texture cache trashing.
Trilinear can be a problem though..but it's possible to use it just on textures than can completely fit (mipmpas included) in a single page (like a 64x64 texture, that can be very effective tiled on a ground..)

ciao,
Marco

Squeak · Jun 6, 2004

nAo said:
Umh..at this time I can't see how this question is relevant to the current discussion, I'm going to reply anyway

Well, threads like these have a tendency to spread out into other subjects after a few pages, but I apologise for the blatant use of this thread for satisfaction of personal curiosity.

Anyway, thank you for the answer.

j^aws · Jun 6, 2004

Pepto-Bismol said:
Megadrive1988 said:

all of our sharing PS3s are on a table, linked up. no internet delays to worry about. we want to use all the resources of all the PS3s to provide more simulation and rendering performance for a game.

Click to expand...

Judging from the GI article (below), I think Cell's domestic modus communicare will be wireless. It's simple for users. There are no cables to serialize the flow of information. And it will put a physical limit on capacity -- in case geeks get any bright ideas about supercomputing.

Also...I was also under the impression that Apulets/ software cells get allocated a certain degree of computing power/ time budget and only if this is expired then the task at hand gets distributed to other local APU's. If this exceeds the local APU's resources then I presume it distributes it to the LAN/WAN as a last resort? It seems to me that LAN/WAN usage will only be applied if coded from the outset...possible for MMORPG...

A quote from the original Cell patent,

[0144] In the future, the speed of processing by the APUs will become faster. The time budget established by the absolute timer, however, will remain the same. For example, as shown in FIG. 28, an APU in the future will execute a task in a shorter period and, therefore, will have a longer standby period. Busy period 2808, therefore, is shorter than busy period 2802, and standby period 2810 is longer than standby period 2806. However, since programs are written for processing on the basis of the same time budget established by the absolute timer, coordination of the results of processing among the APUs is maintained. As a result, faster APUs can process programs written for slower APUs without causingconflicts in the times at which the results of this processing are expected.

This also states that future , faster APU's will not run your game's any faster. .

I presume this is to keep everying in sync and backwards compatible for the cell architecture. This implies that you have to always take your target platform into consideration when coding and daisy chaining several cell devices isn't going to 'compute' any faster even though there is a larger pool of resources, unless devs explicitly aim for this

PS3 distributed computing without internet limitations ques.

Panajev2001a

Panajev2001a

Pepto-Bismol

Squeak

nAo

Nutella Nutellae

Gubbi

nAo

Nutella Nutellae

Panajev2001a

Gubbi

Panajev2001a

Fafalada

Gubbi

Fafalada

Gubbi

Squeak

nAo

Nutella Nutellae

Squeak

nAo

Nutella Nutellae

Squeak

j^aws

Similar threads