DiGuru said:
Yes. But you're still assuming a global data model with a commonly used data structure, and so you answer is about how an architecture like Cell can do well in one of the worst scenarios we can think of. It can do much better if you use a different data model. One designed with such an architecture in mind.
Um, no, nothing in my previous post makes that assumption. The only assumption that is made is that a tree data structure is used. Global or non-global is orthogonal, since I said nothing about whether there are multiple copies in XDR or that the SPUs are performing updates.
The issue is, how much of the tree can fit into local memory. If you need to do a tree search, and the entire tree cannot fit into local memory, than it has to page in external memory. This holds tree whether you consider multiple SPUs or not. Even if you took a tree, and divided it into 7 subtrees as a forest, and gave each one to an SPU, you still might have to stream in data.
An example of such a large tree would be a BSP for the static geometry of the level, which allows very efficient world-object collision detection. This is going to exceed the combined memory of all local stores easily.
For object-object collisions, you might want to give each object its own tree (bounding spheres, axis aliented bounding boxes, or oriented bounding boxes). Now the problem is, for each dynamic object, you've got a tree, and ideally, you want to group them into a forest based on locality, and then pack those into an SPU local storage for detection and physics calculations.
So you need a datastructure which efficiently groups objects (and attached trees and other physics data ) by spatial locality *AND* is memory local (so can be easilly fetched into an SPU by a stream request instead of a gather operation) *AND* must be efficiently updatable so that that as objects move, the amortized cost to keep this data structure in sync is not too expensive.
<BrainStorm mode on> I think I'd want the PPE to decompose the physics problem as so:
1) find all spatial locales that will be updated (e.g. because some event is going to trigger something to move or apply a force) (could potentially be parallelized). Locales with no updates may or may not neccessarily require simulation based on whether or not the developer wants to simulate that which is too far away to see. In other words, if a tree falls in a forest and no one was there, did it happen?
Some people may not like this and may want the world to keep going when they aren't watching.
2) for each spatial locale, do a quick trivial check to see if any thing needs to be done (no need to upload data to the SPU and have it do work, if the check can be done faster than the transfer), else no collisions so proceed directly to time-step/constraint solver (SPU parallelizable)
3) if collision detection needs to be done, spatial local structures must be streamed to one or more SPUs. SPUs must compute collisions, and use that to feed into constraint solver (depending on how you resolve collisions and interpenetration) Here you may run into a problem if the local spans multiple SPUs since they need to communication this information.
4) spatial datastructures must be updated. Object bounding trees need to be recomputed if an object changed (bounding sphere/hull could be bigger or smaller). And the spatial datastructure which clusters together these objects in space must be updated if an object has moved.
I don't see a way to avoid not having a global data structure. Even if all SPUs kept a copy that was local for them (wither in SRAM or XDR), an SPU would still need to be able to determine which SPU "owned" a given spatial locale, so that if an object crossed a spatial boundary, that SPU would have to update its local tree. The cost of doing this may exceed just having a single tree for everything that has node-level locks on certain subtrees.
Let's say you were storing hierarchical bounding spheres per object. A sphere is determined by position and radius, let's say XYZW (32-bits) per sphere. The sphere must also have an internal list of subspheres. For sake of simplicity, let's say we compute subspheres down to a fixed 4 levels deep and store in a fixed array addressing scheme. That's 31 * 4 ~ 128 bytes per object of heirarchical bounding volumes in the absolute simplist case. Now assume we have 1,000 objects on the screen at once. That would consume 50% of a single SPU's local storage, without even considering geometry or other physics information (velocity, inertial tensor, etc)
If we consider 1,000 army men each with a character model that has 10,000 vertices, you're talking 10M vertices. If instead, we just geometry instancing, you still need to store position, bone position, and physics for those 1,000 actors.
Then, if those 1,000 guys were all running at you, you'd need to update 1,000 hierarchical bounding sphere trees and 1,000 set of bones, and 1,000 sets of physical properties (more, considering all the limbs each have constraint forces ).
On top of that, when all is said and done, it has to be written back to XDR (so that the SPU can work on other data), and it must be written in a format which is stream friendly to write, and stream friendly to read, otherwise you'll be performing scatter-writes with DMA.
One possibility is if the PPE is good at scatter writes (I profess ignorance), the SPUs can stream a work list of memory locations to update plus the data. The PPE can then batch them into memory write friendly bunches and execute the scatter updates.
The problem with the SPUs isn't that physics can't be reasonably localized. The problem is, the paucity of their local memory dictates that they must write and read from external memory, and they are very bad at doing and absorbing latency.