Future console CPUs: will they go back to OoOE, and other questions.

psurge · Oct 5, 2006

DemoCoder - I do agree with the principles you're outlining, although I tend to think they are more relevant to AI than physics. Example: it's perfectly fine for an AI driver to have been unaware of the truck he just ploughed into, but not so fine for the physics simulation to miss the collision entirely or pick up on it when the driver's car is in the middle of the truck. Anyway, assuming that these two entities are being updated asynchronously, how would a collision between truck and car be handled when the two entities don't even agree on what time it is?

I may be missing something, but it seems to there are cases where a physical simulation should have perfectly accurate and "correct" local information.

aaaaa00 · Oct 5, 2006

SPM said:
The way tasks are scheduled and the way memory locality is organised is very different.

All I'm saying is that supercomputer type clusters that share no memory are good for parallel tasks that share nothing.

If you have tasks that need to share a lot of state, SMP wins. Any shared-none cluster will lose too much performance because of its interconnect.

Your statement "SMP servers with a large number of nodes aren't really used to run parallel applications, they merely run many independent processes at the same time." is just plain wrong, simply because multiple independent processes do not take advantage of what SMP is good at. In fact, if you truely have "many independent processes", just distribute them onto a cluster, and you don't need an SMP.

SMP machines are good at single process, multiple threads with lots of shared state.

Applications exist for either model that use the stengths of either model. That's why they both models exist.

DemoCoder · Oct 6, 2006

3dilettante said:
In a simulated system, all information is by default completely visible and completely correct in so far as the simulation is capable of representing it.

I disagree. In a simulated system, all that matters are the observers. If a tree falls in a forest and no one is around, it didn't happen. Likewise, events occuring offscreen need not be computed at the same rate, or even at all, unless there is an underlying game play mechanic driving them (e.g. an AI monster knocking down trees on the horizon as he comes towards the city)

You are not aware of the state of the food, but the state still exists, and by virtue of whatever this thing called reality, one second to you is approximately one second to the food that may or may not be cooking too long.

Irrelevent. In a simulation, the offscreen cooking food, for say, 30 minutes, need not be simulated for every time-step between 0 and 30 minutes. If I go down to check the food 30 minutes later, the simulation could replace the uncooked food with a 30 minute cooked representation without calculating all of the intervening steps. In the vast majority of cases, I would be none the wiser as a reasonable fascimile of the act of cooking would be good enough. Human beings are not good at telling the difference between a pot roast that has been cooked for 28 minutes and one cooked for 28.5 minutes.

As a simple example of this principle in practice, look at HashLife. Cellular Automata is a classic example and textbook case of a system in which the state at step N may be dependent on the previous history of all cells in the grid going back to step 0. So in theory, developing an analytic solution in terms of f(N) may be impossible, you just have to execute all intervening simulation steps -- or do you? It turns out, you don't. You run different parts of the simulation at different speeds, sometimes not even computing some sections at all, in a lazy "call by need" fashion.

Obviously, Life is a simplistic example, but there are models of physics based on cellular automata, and moreover, there is no reason why games have to run physically accurate simulations down to the object level. As supercomputer N-body solvers have shown, one can often work with macroscopic properties, just one can solve some lighting problems with FFTs or SH.

In a simulated system, if there is no implicit or explicit timekeeping, the entity that represents you could go through dozens of cycles, while for some reason the food object has not. If nothing keeps track of that, you could wait an hour and come to a ice-cold pot.

The point of my original post is not that "nothing keeps track", just that one does not need to keep track of all microscopic states. In the case of a pot on the stove, one doesn't need a complete simulation of pot and fire and meat, and every particle, and the interaction with fire and thermal conductance. One only needs to remember when the flame was turned on.

How this applies to physics is simple, as I have repeated over and over. The complete game physics simulation does not need to be run on the entire reachable world at each step. The individual sway of leaves on trees far from the observer is irrelevent and all that needs to be simulated is a macroscopic procedural reaction to high level climate variables.

Only *salient* details need micro-simulation, and then, if and only if, the intermediate states have far reaching propagation to the entire world. In fact, the very way human beings detect "wrong" simulations or stuff that seems "not real" is based on mental models of the real world which are high level and inaccurate representations that do not depend on integration of fundamental physics over time, but on emergent macroscopic properties modeled by the mental map. Nobody can run an accurate and consistent physics simulation in their head to conclude "hey, this stuff is not in the right place according to my highly accurate mental simulation of particle physics". We reject "unreal" simulations based on disagreement with very course grained and high level predicted signatures of real world phenomena.

At some point in time, it suffices to draw artificial barriers in the game world and say that object X can't influence object Y, be it across a level boundary, or on artificial space division scheme. In most cases, it won't even be perceptable to the user.

No algorithm works fine if the very bases of time and space can flit around due to factors not present in the simulated world.

Actually, there are plenty of algorithms that work even if the underlying assumptions are changed. For example, gradient potential pathfinding works, even if the gradient is non-linearly modified at certain steps.

Collision detection doesn't work if one of two objects that should have hit each other is three update cycles behind.

All known collision detection algorithms are approximations and use hierarchical bounding volumes, and many use space partitioning, some even use temporal prediction. Sure, the way you phrased it, you'd have a problem. Let me rephrase it: "collision detection still works even if one of two objects being considered is N update cycles behind in the simulation, as long as N's position can predicted to not intersect the spatial partition that the other object is contained in."

In the real world, I can make such predictions just using light cones. I don't need to simulate anything outside my light cone. In the game world, one can make reasonable assumptions about the velocity of objects as well as the lifetime (many objects are effects which disappear) and use this high level information to quickly reject such collisions.

I don't have to worry about my pot roast colliding with me, because it's velocity is zero, and unlikely to fly out of the pot, up the stairway, open my door, and collide with me face.

If you consider a game like Alan Wake, do you really expect such games, even with substantially faster serial CPU performance, to be capability of simulating every nook and cranny, every leaf, every blade of grass, every water droplet in every cloud, at every time step? And not only that, but to consider the possibility that every object can have an effect on every other, be it potential collisions, or electromagnetism or gravity? It's preposterous.

Future games are going to run massive game world simulations by using a hierarchy of representations and partition schemes, they are simply not going to simulate every object in the world geometry database at every time-step. These simulations will be emminently parallelizable. Sure, you won't be able to stack up 5 miles of dominoes in a straight line and knock them over, and then stand 5 miles a way at the end waiting for the last one to fall. But these are *games* we are talking about, not "The Matrix", so some restriction on freedom and simulation is warranted.

Thus, the bro-haha over "serial" physics and AI is one of wrong assumptions and lack of imagination.

OtakingGX · Oct 6, 2006

I don't think you can reject as much information as you seem to be implying, Demo. (Maybe I'm reading too much into what you are saying.)

I'll use the example of a building exploding into multiple pieces as an illustration. When the structure breaks up each piece is given a velocity. These will follow largely parabolic paths. The largest (most massive) pieces will likely reach their destinations (in this regard I think you're right) irrespective of collisions. When equal sized pieces collide, however, their new velocities will have to be resolved. Perhaps you could set up a hierarchy in which chunks of size A will reach their destination and their trajectories can be computed independently. Then chunks of size B will reach their destination, so long as they don't collide with size A. So size B calculations can run independently, so long as they never surpass those of size A in time.

I can't imagine how much storage would be required to save the state of each particle. If it happens that chunk B.004's path intersects chunk A.005's path at t = 0.492s, you'd have to go back to that point in chunk A.005's trajectory to do calculations. I believe the reduced complexity of computation would drive up the required storage space astronomically.

Playing around with Finite Element Programs the matrices become huge when time is introduced. If you can keep everything in the same time step and calculate the next you only have to store one state at a time. If not, you'd need to store states from the last unresolved one up to as far as the furthest object in the future.

In regards to your food cooking example, you'd still need to keep an up-to-date state on it. The food will create an aroma that you may or may not smell. If you do, it may influence you to go check on it early. It would seem more likely that the food cooking would just take larger time steps in the state calculations and not just calculate quickly and be done with (seeing as you'll be influencing it before it can by turning it off once you smell it and go check on it).

3dilettante · Oct 6, 2006

DemoCoder said:
I disagree. In a simulated system, all that matters are the observers. If a tree falls in a forest and no one is around, it didn't happen. Likewise, events occuring offscreen need not be computed at the same rate, or even at all, unless there is an underlying game play mechanic driving them (e.g. an AI monster knocking down trees on the horizon as he comes towards the city)

AI observers are part of the simulation. If the tree actually goes through the action of falling in the forest within the simulation, then its fall has been calculated and its state stored perfectly within the system's memory. It is up to the simulation to define something as imperfectly perceivable.

As simulated entities that exist on a data level in the same way as the tree and its history, AI agents have the full ability to apprehend every value associated with that tree, unless they have been programmed to not use that source value.

Irrelevent. In a simulation, the offscreen cooking food, for say, 30 minutes, need not be simulated for every time-step between 0 and 30 minutes. If I go down to check the food 30 minutes later, the simulation could replace the uncooked food with a 30 minute cooked representation without calculating all of the intervening steps. In the vast majority of cases, I would be none the wiser as a reasonable fascimile of the act of cooking would be good enough. Human beings are not good at telling the difference between a pot roast that has been cooked for 28 minutes and one cooked for 28.5 minutes.

It need not be simulated fully, but it still must appear to the user that it did go through what it should have when the user returns. A stupid example is how Thief 3 handled transitioning out of areas (I don't know if it still does this, I don't have it at the moment). If guards shot arrows at you just as you went through a door that took you somewhere else, those in-flight arrows would not be gone and would still be flying if you came back later.

While you weren't in the area when the arrows were shot, you would expect them to not be waiting for you in mid-air when you returned from your sneaking. Something has to keep track of that, and unless part of the simulation explicitely does so, even if it's just deleting the entities, the consistency of the world is compromised.

As a simple example of this principle in practice, look at HashLife. Cellular Automata is a classic example and textbook case of a system in which the state at step N may be dependent on the previous history of all cells in the grid going back to step 0. So in theory, developing an analytic solution in terms of f(N) may be impossible, you just have to execute all intervening simulation steps -- or do you? It turns out, you don't. You run different parts of the simulation at different speeds, sometimes not even computing some sections at all, in a lazy "call by need" fashion.

As long as an interaction with step 5 of a given automaton properly derives the values of step 5, it doesn't matter to me. That's a case of a program explicitely keeping consistency with finite time steps.
My concern was the idea of running everything purely asynchronously in the software implementation of the simulation. In such a case, even when the simulation thinks it is accessing step 5, it may not be accessing or calculating step 5.

Obviously, Life is a simplistic example, but there are models of physics based on cellular automata, and moreover, there is no reason why games have to run physically accurate simulations down to the object level. As supercomputer N-body solvers have shown, one can often work with macroscopic properties, just one can solve some lighting problems with FFTs or SH.

Precision isn't my concern, I stated earlier that approximations can work fine as long as the error bounds are kept within reason. My concern is removing consistency at a software level. In such a case, even if two objects were called up as needed, they would not be guaranteed to behave as the physical rules of the simulation are defined.

At the low-level of implemenation, there is always something keeping track. If it's a central control loop, some kind of object history/rule set, or other method, it's of no concern to me. I would note, however, in the case of active user interaction, it is helpful to run the simulation at some multiple of the rate that the IO buffers are polled.

As long as the given state that coincides with second ten for object A doesn't interact with the state that coincides with second two for object B, it doesn't matter how they arrived at their states. (Unless the state has been defined to stretch across multiple seconds, in which case, it's fine.)

One only needs to remember when the flame was turned on.

You also need to know what time step the observer is in. The state function doesn't work if you don't know how long it's been for you. Simulation time, unless tracked, has no meaning to the software running the simulation. How else is the system to know you (or perhaps a simulated person who was no longer being actively simulated, then reactivated) left the iron on a second ago or an hour ago?

Only *salient* details need micro-simulation, and then, if and only if, the intermediate states have far reaching propagation to the entire world. In fact, the very way human beings detect "wrong" simulations or stuff that seems "not real" is based on mental models of the real world which are high level and inaccurate representations that do not depend on integration of fundamental physics over time, but on emergent macroscopic properties modeled by the mental map. Nobody can run an accurate and consistent physics simulation in their head to conclude "hey, this stuff is not in the right place according to my highly accurate mental simulation of particle physics". We reject "unreal" simulations based on disagreement with very course grained and high level predicted signatures of real world phenomena.

My concern is in the way the implementation is run, not the algorithms above it. If object 1 for whatever reason takes twice as long to process asynchronously as object 2, there must be something that keeps the real-world time differential from screwing with the in-game interactions. It doesn't matter how precise it is, it just has to be consistent.

At some point in time, it suffices to draw artificial barriers in the game world and say that object X can't influence object Y, be it across a level boundary, or on artificial space division scheme. In most cases, it won't even be perceptable to the user.

I'm not against approximation, I'm against that dirty system taking liberties with my sweet, sweet algorithms.

Actually, there are plenty of algorithms that work even if the underlying assumptions are changed. For example, gradient potential pathfinding works, even if the gradient is non-linearly modified at certain steps.

Gradient pathfinding assumes when you move an approximation step that you actually move a step. If the underlying code isn't consistent, you can't even assume that. You can't even assume that the world you're in actually has a path to find.

And there are algorithms that don't. A* will most likely fail if you change the terrain around after it starts.

All known collision detection algorithms are approximations and use hierarchical bounding volumes, and many use space partitioning, some even use temporal prediction. Sure, the way you phrased it, you'd have a problem. Let me rephrase it: "collision detection still works even if one of two objects being considered is N update cycles behind in the simulation, as long as N's position can predicted to not intersect the spatial partition that the other object is contained in."

If you know it won't matter, fine. I'm talking about when it does matter, in those cases the system needs to work as it is defined. Purely independent processes will not behave as they should.

In the real world, I can make such predictions just using light cones.

That depends if your world is a room full of bouncing superballs and how long the time slices are in the simulation.
Regardless, my concern is that the system is consistent, not anal.

If you consider a game like Alan Wake, do you really expect such games, even with substantially faster serial CPU performance, to be capability of simulating every nook and cranny, every leaf, every blade of grass, every water droplet in every cloud, at every time step?

No, but I expect that the AI for the baddie with a machine gun not outpace the render and be able to shoot me while all I can see is an intervening brick wall. I was addressing the idea of completely removing synchronization, not handwaving minutae.

Thus, the bro-haha over "serial" physics and AI is one of wrong assumptions and lack of imagination.

If that were ever my concern, I would have said that physics and AI were serial even though in most cases they are not.
I'm saying things like memory accesses to the same structures should be ordered and that one moment follow another, etc.

DemoCoder · Oct 6, 2006

OtakingGX said:
I'll use the example of a building exploding into multiple pieces as an illustration. When the structure breaks up each piece is given a velocity. ...

I would counter that except in cases of high speed photography and slow motion cameras, human beings perceive explosions as a macrosopic blur, like persistence of vision, but I'd say "persistence of semantics" in that all the perceptions that are occuring simultaneously and over such a short period: visuals, sound, et al, blend together in a mostly novel high level experience.

Thus, I would argue that explosions from bombs on structures can, in the immediate period in which it occurs, be modeled by collisionless particle systems obeying some macroscopic variables and effect based debris. If you want, slap in some large chunks, randomly chosen as debris which will get full physics treatment. Afterwards, you have the situation of the structure either being statically stable, or unstable are in the process of collapse.

In regards to your food cooking example, you'd still need to keep an up-to-date state on it. The food will create an aroma that you may or may not smell. If you do, it may influence you to go check on it early. It would seem more likely that the food cooking would just take larger time steps in the state calculations and not just calculate quickly and be done with (seeing as you'll be influencing it before it can by turning it off once you smell it and go check on it).

All of this behavior can be handled by high level scripts. I did such behavior 15 years ago in online text muds with programmability. Macroscopic concepts like "cooking" something are never going to be handled by simulating physics and chemistry of cooking. We can't even do this TODAY with petaflops of computing power. Just figuring out the exact process by which proteins denature is hard.

DemoCoder · Oct 6, 2006

3dilettante said:
AI observers are part of the simulation. If the tree actually goes through the action of falling in the forest within the simulation, then its fall has been calculated and its state stored perfectly within the system's memory. It is up to the simulation to define something as imperfectly perceivable.

Of course, and I already gave the example of a monster knocking down trees in the forest. But AI observers need not have a completely accurate and rendered view of the local geometry in their area. I mean, come on, if a tree falls on the horizon, and an AI actor is coming towards me, should he have to execute a "jump" over the tree? I won't see it, and such waste of resources adds nothing to the gameplay nor believability of the unseen AI actor. The AI actor should work with a gross representation of the world when it is pathfinding and too far away for small scale obstacles to matter.

As simulated entities that exist on a data level in the same way as the tree and its history, AI agents have the full ability to apprehend every value associated with that tree, unless they have been programmed to not use that source value.

I really don't understand your argument. You're trying to claim that AI in games should be written so that AI will comprehend each and every property of everything of the world, from tree leaf color and position of every leaf, to semantic concepts like whether the tree was "big" or not? I mean, we are talking about CPU design in this forum, and as such, we are talking about near term, what's achievable, not blue-sky "if we had a positronic CPU and Dr Noonian Soong, then..."

I am arguing what's practical and realistic to implement and still maintain a reasonable air of realism. You are countering that I am wrong because it won't work with some hypothetic and amazing AI that needs to perceive every last bit of foiliage state.

As long as an interaction with step 5 of a given automaton properly derives the values of step 5, it doesn't matter to me. That's a case of a program explicitely keeping consistency with finite time steps.

HashLife only derives the local values it needs. You can ask for the value of cell @ x,y at time t. It may only calculate that value, and may not calculate the value of any other cell at time t.

My concern was the idea of running everything purely asynchronously in the software implementation of the simulation. In such a case, even when the simulation thinks it is accessing step 5, it may not be accessing or calculating step 5.

If you're talking about physics absent EM and gravity, then the only relevance of stale data is collision detection. And this is only relevent if you can't rule out a collision. But it is unlikely you're going to be running the collision detection before all threads which are updating salient objects have finished first (e.g. the constraint solving and integration steps) As for rendering, people play online multiplayer FPSes which inaccurate and delayed positions and firing cues to no ill effect, as long as the inaccuracy is only a few frames off. So I see no reason that both player and AI need to have frame accurate consistency of information.

Gradient pathfinding assumes when you move an approximation step that you actually move a step. If the underlying code isn't consistent, you can't even assume that. You can't even assume that the world you're in actually has a path to find.

No, gradient pathfinding makes no such assumption. In real world biological gradient pathfinding, the actual movement is stochastic. You may take a step, take a step backwards, or take the wrong step. All that matters is that the overall sum total of all movements adds up to progression along the gradient. Biology is not deterministic, but stochastic and error prone. Computer scientists are used to programming under assumptions of determinism and correct functioning of their computations, but that is not the *only* way to compute.

No, but I expect that the AI for the baddie with a machine gun not outpace the render and be able to shoot me while all I can see is an intervening brick wall. I was addressing the idea of completely removing synchronization, not handwaving minutae.

It's far more often the case that AI acts like this because it is given complete omniscience about the world. Sure, there are cases you don't want to occur, but in my experience, these problems are not caused by synchronization issues but by AIs knowing the player's position and being able to see without occlusion. Sure, if your screen is updating at 1hz and the AI is running at 200hz, he may run rings around you and kill you before you see one update.

But ensuring that doesn't happen does not imply the whole world's simulation must be run in lock-step, or that everything must be in a consistent state prior to render.

I'm saying things like memory accesses to the same structures should be ordered and that one moment follow another, etc.

This is not neccessarily a requirement of parallel algorithms. Some algorithms work fine with concurrent read and write of stale data. Whether or not serialized access is needed depends on the algorithm. Lookup CRCW algorithms and mdoels. Also, saying one moment follows another needs clarification. As I gave in the HashLife example, one moment does not neccessary follow another in strict monotonically increasing step-at-a-time order. In fact, HashLife has profound philosophical ramificatons for the nature of time and perceptions of our own universe if you mull over it. HashLife simulates incredibly large universes, in fact, 20 years ago, they were simulating 1billion by 1billion cell grids on ancient computers. In any system with "local" physics where effects propagate at a maximum speed (speed of light), a HashLife like algorithm can calculate the correct state of a subregion of space at arbitrary time in the future, even if the universe being simulated is huge. Moreover, the more time steps, the faster it works, leading to fact that one can in fact, skip over steps without even calculating them based on recombinations of memoized partial results.

All that is required is a proper hash function and partition algorithm. Given a big enough lever, one can move mountains, given a good enough hash function, one can simulate the world.

Frank · Oct 6, 2006

I completely agree with Demo.

To clarify it a bit more: when you calculate states asynchronously, of course you do take into account when they happen. You have a start time, as well as a (probable) end time. Every time you calculate that state, you record how long it took. The next time, you look at the time when you finished the last update, and assume that it will take as long as it did the last time. That's the step you have to calculate.

As for determining what (object or AI) state to calculate next, you can start by looking for the one that is oldest. And you give the ones closer to the player a higher priority. So, you calculate them very often and with small steps, while you calculate the ones further away (and/or off-screen) less often. And the ones furthest away and off-screen are calculated very infrequently.

And while doing so you store the movement vector(s) and maximum size in a table (until the next update), and look for possible collisions. When you find one, you give both objects high priority and reprocess them in detail.

Edit: and of course, you try to group objects further away, and apply less rules and/or less detailed rules.

SPM · Oct 6, 2006

aaaaa00 said:
All I'm saying is that supercomputer type clusters that share no memory are good for parallel tasks that share nothing.

If you have tasks that need to share a lot of state, SMP wins. Any shared-none cluster will lose too much performance because of its interconnect.

Your statement "SMP servers with a large number of nodes aren't really used to run parallel applications, they merely run many independent processes at the same time." is just plain wrong, simply because multiple independent processes do not take advantage of what SMP is good at. In fact, if you truely have "many independent processes", just distribute them onto a cluster, and you don't need an SMP.

SMP machines are good at single process, multiple threads with lots of shared state.

Applications exist for either model that use the stengths of either model. That's why they both models exist.

It is the other way round surely. SMP machines are poor at single process, multiple threads with lots of shared state. Why? because the data that you are in different temporal states. The fact that data is physically located in different compute nodes is not a big problem - you have the means to access them over the network, ring-bus whatever without much difficulty. In fact with Linux/Unix, everything is completely network enabled - for example devices are accessible as files, the display system, X-Windows is built on a network server-client model, and on the Linux/Unix programming model Inter Process Communication is based on streaming data through pipes rather than through shared memory. This makes streaming everything over the network as easy as doing it on a local node, and compatible with pretty well every piece of Unix/Linux code. In fact in Linux processes are almost as lighweight as threads and therefore threads aren't used very often, except for porting non-Linux applications that have been written to use shared memory eg. the Java Virtual machine. So for a super-computer programmer working in Linux, not having the ability to share local memory over the whole system, like an SMP machine, won't even be missed.

On the other hand, the data running on different SMP cores not sharing the temporal state is a real big problem, because although there are program interfaces you can use to access data on other nodes, there is no program interface that can go back or forward in time to fetch the data you want if processes are out of sync. All you can do is for the faster thread to stop and wait for the others to catch up - very inefficient and difficult to manage if you have several hundred threads. What SMP is good for is running many independent tasks that share little and don't interact much - like what you have in a file, web or database server.

With the OS scheduled SMP multi-threading model, threads are impossible to keep in sync. without a lot of complex wait states, because any other thread that timeshares on a core can slow it down. This is a problem if there is any kind of component of the task that has to be performed in series - for example if you have to complete adding two matrices together before multiplying it by a third matrix. That is the stark simple reason SMP is not used in supercomputing.

SPM · Oct 6, 2006

P.S. by temporal, I mean what state the process in terms of how far in the chain of execution the process has got, rather than time.

aaaaa00 · Oct 6, 2006

the Linux/Unix programming model Inter Process Communication is based on streaming data through pipes rather than through shared memory.

Please describe how you think a Unix pipe is implemented. Then please compare and contrast the performance vs shared memory.

SPM · Oct 6, 2006

aaaaa00 said:
Please describe how you think a Unix pipe is implemented. Then please compare and contrast the performance vs shared memory.

Via a pointer to memory.

Also accessing or transferring data does not normally demand as much performance as the number crunching part carried out within the parrallel processing carried on the individual compute nodes.

aaaaa00 · Oct 6, 2006

SPM said:
Via a pointer to memory.

Please elaborate. A pointer to what memory? Where? Who allocated it? How is it managed? How is access to it synchronized?

Now compare and contrast the performance you can get with IPC via pipes against what you can get with shared memory.

DemoCoder · Oct 6, 2006

Some Unix pipe implementations used shared memory as the underlying implementation, some don't.

As for how to deal with shared memory over slow interconnects, look up how they solved the problem for N-body solutions using Octrees, hash algorithms, and clever parallel orchestration for sharing only data which is essential to each node's computation.

For example, http://www.cs.berkeley.edu/~yelick/cs267/lectures/23/lect23-nbody.ppt this presentation gives a high level overview. This paper A Parallel Hashed Oct Tree Algorithm shows you how to deal with latency, memory hierarchy, locality, and access issues. Their algorithm on a computing cluster shows that the overhead of communication imposes a mere 2.5% on the total running time of the algorithm.

In their conclusion:

"The raw computing speed of the code on an extremely
irregular, dynamically changing set of particles which
require global data for their update, using a large num-
ber of processors (512), is comparable with the perfor-
mance quoted for much more regular static problems,
which are sometimes identified as the only type of
“scalable” algorithms which obtain good performance
on parallel machines. We hope we have convinced
the reader that even difficult irregular problems are
amenable to parallel computation."

To this I add, that Google runs algorithms on billion-node graphs distributed across a clustered filesystem (GFS) using a simply distributed hashtable abstraction (map input to slots + reduce all data collided on same slot), and they do this in near real time on a very commodity large cluster of machines.

As I continue to say, the problem of dealing with shared data is related to the abstractions one uses to design one's algorithms. If your abstraction is pointers, you are limited to one mode of thinking. If your abstraction is hash addresses, you are limited to a different mode of thinking.

The O(N-squared) theories of physics simulations of N objects that keep getting trotted out are hopelessly naive implementations and that people have designed large scale physics simulations across very slow distributed clusters which are not network latency bound nor scale as N^2.

3dilettante · Oct 7, 2006

DemoCoder said:
Of course, and I already gave the example of a monster knocking down trees in the forest. But AI observers need not have a completely accurate and rendered view of the local geometry in their area. I mean, come on, if a tree falls on the horizon, and an AI actor is coming towards me, should he have to execute a "jump" over the tree? I won't see it, and such waste of resources adds nothing to the gameplay nor believability of the unseen AI actor. The AI actor should work with a gross representation of the world when it is pathfinding and too far away for small scale obstacles to matter.

We're crossing arguments here. I agree with the idea that AI agents don't require perfect or wide-ranging observability.
I was just commenting that the proper way to implement imperfect perception is to filter how data is read by the agent, not have the actual data state be inconsistent. If a person misreads a map and thinks Alburquerqe is on the next left, it doesn't make the city fly off into space, as much as some would like it to.

I really don't understand your argument. You're trying to claim that AI in games should be written so that AI will comprehend each and every property of everything of the world, from tree leaf color and position of every leaf, to semantic concepts like whether the tree was "big" or not? I mean, we are talking about CPU design in this forum, and as such, we are talking about near term, what's achievable, not blue-sky "if we had a positronic CPU and Dr Noonian Soong, then..."

I am arguing what's practical and realistic to implement and still maintain a reasonable air of realism. You are countering that I am wrong because it won't work with some hypothetic and amazing AI that needs to perceive every last bit of foiliage state.

I'm not arguing that AI should have perfect perception. I am stating that in a digital simulation, they will have perfect perception unless the simulation intervenes. Unless the designer can manage an explicitly indeterminate data value in a register or memory location, there is no question as to what a read to that location will produce.

It's the difference between thinking crazy and the world actually going crazy.

HashLife only derives the local values it needs. You can ask for the value of cell @ x,y at time t. It may only calculate that value, and may not calculate the value of any other cell at time t.

The local values are the neighboring cells. The cellular automaton simulation is based on a very simple interaction that is evaluated for every generation. This implementation is just a lot smarter about not repeating redunant work. The data is fully consistent in that no calculation for a given generation pulls data from the results of another generation.

It looks like a kind of pipelining of cell generations, paired with the memoization and canonization. I can't speak with true certainty because I haven't worked through the last code listing entirely yet. I'll try to fully comprehend it and correct myself if I have parts of it backwards.

If you're talking about physics absent EM and gravity, then the only relevance of stale data is collision detection. And this is only relevent if you can't rule out a collision. But it is unlikely you're going to be running the collision detection before all threads which are updating salient objects have finished first (e.g. the constraint solving and integration steps)

I also mean things like decision making, AI communication, and player interaction. If there isn't something that maintains rough consistency, all of these start to mess up. Whenever something perturbs the simulation away from business as usual, it helps if the important parts don't continue as if it were business as usual.

It's not usually fatal to a game if the goofs are minor, and in most cases nobody so sloppily designs a game that the various parts are allowed to be out of phase with IO.

My original point was that some form of synchronization or coordination must be implictly or explicitely maintained. If there is no synchronization at all, you cannot assume anything about what will be finished before another. You can't have a precondition if there is no boundary to define what is the common before and after.

As for rendering, people play online multiplayer FPSes which inaccurate and delayed positions and firing cues to no ill effect, as long as the inaccuracy is only a few frames off. So I see no reason that both player and AI need to have frame accurate consistency of information.

If I get lagged as a matter of routine on a local machine running single-player, I'm going back to playing minesweeper. People log off if the lag is too horrible.

Even net-based play has rough boundaries on how far one portion of the engine can outstrip the other. It's more relaxed than maintaining frame by frame accuracy, but it's not like the renderer or physics portions will over time accumulate such a disparity that one will be minutes ahead of the other in the case of a complete lack of synchronization.

Even if laggy, the game is usually smart enough to not to create the impression it has data more recent than the last few frames displayed, because it will simply redisplay the same buffer over and over until the rest of the engine has caught up. Besides cosmetic effects, it's no good to render too far ahead in a dynamic simulation anyway.

No, gradient pathfinding makes no such assumption. In real world biological gradient pathfinding, the actual movement is stochastic. You may take a step, take a step backwards, or take the wrong step. All that matters is that the overall sum total of all movements adds up to progression along the gradient. Biology is not deterministic, but stochastic and error prone. Computer scientists are used to programming under assumptions of determinism and correct functioning of their computations, but that is not the *only* way to compute.

I had the wrong objection, you are correct that as long as the approximation process is given enough time, it will reach the goal it was given.

The only question is, if the pathfinding algorithm is told by stale and incorrect data to go to point C, will it still go to point B like I wanted?

Can I have the secret to "garbage in/puppies,kittens, and unicorns out"?

It's far more often the case that AI acts like this because it is given complete omniscience about the world. Sure, there are cases you don't want to occur, but in my experience, these problems are not caused by synchronization issues but by AIs knowing the player's position and being able to see without occlusion. Sure, if your screen is updating at 1hz and the AI is running at 200hz, he may run rings around you and kill you before you see one update.

The case I outlined doesn't happen often because nobody throws out synchronization away completely. AIs don't update irrespective of the rest of the game engine. Since their phases are kept in synch with the rest of the game, they don't get the chance to overshoot the current visible state.

But ensuring that doesn't happen does not imply the whole world's simulation must be run in lock-step, or that everything must be in a consistent state prior to render.

I never said things needed to be in lockstep. I said things shouldn't be allowed to go so far apart that it leads to an end result that should not be allowed in the rules of the simulation.
I don't mind minor inconsistencies if I do not see them all that often.
Rendering inconsistencies are usually only a frame or two in duration, so I don't mind a glitch in one frame in one level.

Beyond that, especially with mismanaged AI or physics, the state of the game-world becomes populated bad data and it ceases to be a simulation and becomes a garbage producer.

This is not neccessarily a requirement of parallel algorithms. Some algorithms work fine with concurrent read and write of stale data. Whether or not serialized access is needed depends on the algorithm.

I was overly broad in my demand for order. If an algorithm has no issue with indeterminant write order, I have no issue. It's when the algorithm needs it and it isn't delivered that I worry.

Also, saying one moment follows another needs clarification.
As I gave in the HashLife example, one moment does not neccessary follow another in strict monotonically increasing step-at-a-time order.

Yes it does. The rules of the life simulation state that the calculations are applied instantaneously with respect to a given cell generation. That means that the state of a given generation only affects the calculations of the generation that immediately follows it.

There is time, the generations do not interfere with each other's calculations. It's just not our time.

If HashLife didn't do that , it would be wrong, not efficient.

Correctness is after all the largest obstacle to performance.

In fact, HashLife has profound philosophical ramificatons for the nature of time and perceptions of our own universe if you mull over it. HashLife simulates incredibly large universes, in fact, 20 years ago, they were simulating 1billion by 1billion cell grids on ancient computers.

Outside time is meaningless to a properly constructed simulation. If someone built a mechanical HashLife simulator that took 10000 years to produce three generations, the time flow would still be three generations to the simulated cells.

It's when outside time concerns filter into the simulation that things go wrong.

In any system with "local" physics where effects propagate at a maximum speed (speed of light), a HashLife like algorithm can calculate the correct state of a subregion of space at arbitrary time in the future, even if the universe being simulated is huge. Moreover, the more time steps, the faster it works, leading to fact that one can in fact, skip over steps without even calculating them based on recombinations of memoized partial results.

Memoization speeds things up, it doesn't give a problem an asymptotically negative performance slope.

Don't forget that as far as we know, the world isn't perfectly determinate, at least not in the subatomic realm.

In such case, random chance is one thing we'll never get an algorithm to do.

All that is required is a proper hash function and partition algorithm. Given a big enough lever, one can move mountains, given a good enough hash function, one can simulate the world.

Aside from the fact that every hash has some set of inputs it sucks at, it's also unlikely we'll find a satisfactory way to properly capture values that aren't fully representable.
With error being inevitable, we could simulate a world, it just won't be ours.

SPM · Oct 7, 2006

DiGuru said:
I completely agree with Demo.

To clarify it a bit more: when you calculate states asynchronously, of course you do take into account when they happen. You have a start time, as well as a (probable) end time. Every time you calculate that state, you record how long it took. The next time, you look at the time when you finished the last update, and assume that it will take as long as it did the last time. That's the step you have to calculate.

As for determining what (object or AI) state to calculate next, you can start by looking for the one that is oldest. And you give the ones closer to the player a higher priority. So, you calculate them very often and with small steps, while you calculate the ones further away (and/or off-screen) less often. And the ones furthest away and off-screen are calculated very infrequently.

And while doing so you store the movement vector(s) and maximum size in a table (until the next update), and look for possible collisions. When you find one, you give both objects high priority and reprocess them in detail.

Edit: and of course, you try to group objects further away, and apply less rules and/or less detailed rules.

But what is the benefit of calculating things assyncronously, with presumably a long lived thread for each object? There are two reasons why people do this on a PC:
1) A transparent way of distributing tasks to cores when you are not sure of the number of cores, by spawning lots of threads and letting the OS distribute them between SMP cores.
2) To ensure a number of tasks, (including single threaded tasks) can all respond to user input by timesharing between tasks on the system.

Item 1 doesn't really apply to consoles, since you know the number of cores, although it is valid for multi-core PC games.
Item 2 applies only applies to processes handling user input and other assyncronous input.

Since the system is tied to frames, what is the benefit of the assyncronous threads (arbitarily scheduled by the system) over a more controlled scheduling tied to the video frame (or a multiple of video frames for information that needs to be updated less regularly)?

Just think of the video frame as a big timeslice within which a number of similar objects are batch processed in turn. For a PS3, it would work like this:
You assign any control tasks and those that need to respond to to user input to the PPE using assyncronous multi-tasked threads.
Other tasks are assigned to a heaps (gather/scatter lists) and batch processed by the SPEs in turn switching to the next when the heap is empty. Rather than try to switch contexts under control of an automatic scheduler, the SPE's load code to process one type of object into local store, and process all the objects on the asociated heap until they are done. Then they load the code associated with the next type of object and process that heap and so on. You would do the objects that absolutely need to be completed first and ensure that there aren't too many of these assigned to an SPE so that it is a certainty that all will be finished within a frame, and leave the ones that can be left to the next frame to later in the frame. An interrupt or signal at the end of each frame (or every few frames) makes the SPE save what it has finished and start again.
This way you minimise the context switching on the SPEs, and you ensure the things that need to be syncronised are in sync, and the things that can be assyncronous can be deferred to the next frame. If you need the results of one process to feed another, there are no processes waiting for other processes to complete as you might get with assyncronous threads - you either stream directly between SPEs, or run the first a batch process first then run the second batch process using the stored results of the first, or you could run the second batch process on the next frame.

In a game, it seems to me sensible to use the video frame as a kind of a time base to organise things. Also it seems sensible to me to keep context switches on SPEs to a minimum, and the video frame seems to me like the natural unit to do the context switching within.

aaaaa00 · Oct 7, 2006

DemoCoder said:
Some Unix pipe implementations used shared memory as the underlying implementation, some don't.

That's precisely the answer I was looking for from SPM. I was also looking for the realization that local IPC with shared memory is way faster because it avoids memory copies, which is why it's such a common optimization for pipes.

I was also looking for SPM to realize that an RPC or pipe over the network is thousands and thousands of times slower than a local shared memory RPC call or pipe implementation.

I was also looking for SPM to realize that the underlying implementation of a pipe requires all those "complex wait states" he was complaining about.

SPM claimed and claimed again in this thread that "SMP machines are poor at single process, multiple threads with lots of shared state."

In this, he is just wrong. "Single process (ie shared address space) multiple threads" is SMP's bread and butter.

Demo, I agree that in the future it's quite possible that fewer problems are going to look like this as we work out the algorithms. I won't argue with you about this.

However it's clear that SPM has a fundamental misunderstanding of what SMP actually is and what its strengths are. He constantly asserts that SMP is related to OS scheduled pre-emptive multitasking, which it does not -- SMP describes the arrangement of processors and the memory model, and has nothing to do with the scheduling policy of the OS running on it. The OS running on an SMP machine is free to schedule one thread per CPU and never preempt it, or any other arrangement it desires.

SPM · Oct 7, 2006

aaaaa00 said:
Please elaborate. A pointer to what memory? Where? Who allocated it? How is it managed? How is access to it synchronized?

A local pipe is a FIFO structure in local memory with two pointers, one for access to each end of the pipe.

Now compare and contrast the performance you can get with IPC via pipes against what you can get with shared memory.

It is very fast and very efficient, the only thing is that the communication has to be serialised. There is no speed issue since IPC is used to transfer results or if the results are left in place, to query specific results. It may take a long time to decompose a matrix, but it only takes a fraction of the time to transfer the results, and even less to query specific results or send a specific result through a pipe.

Basically if you want to do something like random access to an array in order to say normalise it, you would do that within the process itself - have the control node send a command to the process to normalise tha matrix. Why would you want to do it through another process? You can spawning off child threads with shared memory within the process, but how would this give you a performance advantage over the process doing that itself? - the child thread would just timeshare with the parent process. The main reason for spawning off a thread or child process within a process is if you need a thread to stop and wait for a signal without stopping the main process. This is not which is not a performance enhancement, and it is not often necessary in a compute node that receives commands from control node, and returns results or a request for the next command as soon as it is done.

DemoCoder · Oct 7, 2006

I think you may need to read more on how HashLife actually works. HalfLife can and does skip over steps and it does have an increasing performance slope. Indeed, the first 100-steps may take 1 minute to calculate, the next 1000 steps may take 10 seconds, and the next 1,000,000 steps may be calculated in only 1 second. If you were to display a HashLife simulation in progress, you'd see some regions jumping far forward, and even others appearing to move backwards sometimes. However, in any given region, all of the values are provably correct for the local time of that cell.

HashLife is a spatial partitioning scheme based on the observation that effects may only propagate at the speed of light (in Life, 1 cell per tick). That means, on any nxn grid, the state of all of the cells on that grid are completely determined only by the state of that grid (locality) except in the next tick, the outer 1-cell border on the top, bottom, left, and right will be potentially "corrupted" by propagation from nearby grids. At most, in 1 tick, non-local information can corrupt/penetrate only 1 cell width inwards.

So our perfect predictability in terms of local state becomes (n-1) x (n-1) in the next tick, and (n-2) x (n-2) in the tick after next, and so on. If you stack up these grids, they form a spacetime pyramid, or "light cone" of local predictability. You then take this pyramid and bisect it halves. Halfway up the pyramid is a grid of n/2 x n/2. Due to the way we have constructed out pyramid, the state of this grid at time t * 2 is completely determined by the state of the base of the pyramid at time t (the initial n x n grid) Therefore, to determine the state at time t*2, one only need to compute a hashcode of the grid at t, and then lookup the answer. If it has not been previously calculated, it can be quickly calculated by combining partial results of sub-grids that were calculated. The hashcode itself is recursive, a hashcode is formed by combining the hash of 4 sub-grids and so on until you get to a base case where you utilize Life's transition rules.

As the program runs, it learns, and as more and more configurations of the grid are encountered, it's speed only increases. Simulating arbitrarily large grids into the far future, paradoxically, is easier than simulating in the near term. And yes, the problem does skip over time steps, literally, in a log_2(t) fashion. It can do so, because the locality of prediction guarantees that many intermediate steps are not needed in order to calculate the smaller predictable grid at time t*2. The only major downside is the memory consumed in running the algorithm.

Think this only applies to Life? Let us impose a speed of light on our game world of 'C', and let us impose a "planck time" of t, (e.g. 100Hz 1/100th of a second). Let us divide the game world into partitions with the smallest being C*t in size (if C == 1000 m/s, and t = 1/100th s, then C/t = 1 meter). In any partition NxN in size (N = some multiple of Ct), the state of the partition N/2 x N/2 in size at time t*2 is completely determined by the state at t. We can proceed to memoize results just as was done in HashLife, as long as we can design a suitable hash algorithm (as I showed in another message, such an algorithm has already been created for N-body gravitmetric problems) Physics within the minimal partition size will be run as the base case.

Once this is finished, we can simulate game physics in arbitrarily large worlds and query for the state of any game partition at any large time in the future without necessarily having to recompute the entire world, or intervening steps. Note, this does not depend on the base case problem being analytically solvable! (e.g. 3 body problem is insoluable analytically) Life, and many cellular automata do not present analytic solutions.

The special cases are the following:

1. objects which don't exist completely within one partition or another, but which span two partitions.
2. partitions which include non-linear or non-local inputs: random events, AI controlled characters, human controlled characters

HashLife already handled #2 a long time ago by introducing hash entry invalidation and dirty bit concepts. #1 may or may not be handled by subdividing the object.

In any case, I'm not advocating that HashLife's algorithm itself beused to solve the large scale simulation problem in games. For one, it burns memory like crazy. What I am trying to show is that monotonically increasing step-by-step time simulations are not neccessarily needed to get results, and that lazy evaluation and "call by need" may offer alternatives.

I am also trying to stimulate interest in the idea that regions of the game simulation can be partitioned into local predictable cones of influence (which may not include AI actors or players) which need not be simulated at every step. In my "pot roast" example, going and looking at the pot roast 30 minutes later and seeing it replaced with a cooked fascimile is the equivalent of a semantic hash table lookup (hashtable.get(roast, 30 minutes) -> return pre-simulated cooked roast). Depending on level design, one can mark whether an object is "local" or has non-local effects (aroma of meat cooking, sounds of boiling) and thus the simulator can decide whether it needs to micro-simulate that partition, or whether it can replace it with a cached pre-computed representation later.

The reason HashLife has philosophical implications for our reality is the implication that what you experience as an arrow of time and history of states may be completely illusory and that reality itself does not neccessarily "compute" all states in between. Yes, we can run measurements to measure phenomena at arbitrarily small resolutions, but if the universe works like HashLife, than our perception of having taken 1,000 samples at 1000hz may simply be an illusion and our memory of "now" merely contains the perception of having carried out all of those steps, when in fact, the universe jumped forward 1 second and our memory reflects the correct and consistent history that never in fact occured.

People have long questioned the continuum vs the discrete conception of time. However, in both conceptions, a monotonically increasing linear order was always assumed. Step T_n depends on Step T_(n-1) according to transition rules or the laws of physics. The philosophical debate engendered by HashLife was whether we even need this assumption, maybe the universe can "jump" or "tunnel" between arbitrarily large configurations, as long as all possible observers in the later configuration agree that the later state was a evolutionary consequence of the former. And when one considers Many Worlds, it may even be that the rules are not functional (1:1 and onto, a bijection), but that the universe may branch from state T to an arbitrary number of states X,Y,Z,... as long as there are an arbitrary number of intermediate transition steps that take T into X. Time of course, would be merely an illusion, an after-though mental explanation by back-predicting the states that could have lead to the current one.

I perceive myself to have had a life history, because information in the state of my brain leads to me believe that the current state (now) could have been the consequence of prior ones (the past). Cause and effect extrapolation. The assumption of continuum, that a continuous stream of time existed between all events in my memory, are just that, an assumption, or an extrapolation.

In any case, I bring it up because HashLife is not just a curiousity. When Bill Gosper invented it, people were already playing with the idea of artificial life and intelligence within these grids, and philosophers began to ask "what if?" our universe amounted to nothing more than a big cellular automaton. If so, what if it evaluates using HashLife techniques? Then many of our most fundamental assumptions are wrong. Sure, sophistry right? Still, it makes for interesting thought experiments.

In any case, I hope I have atleast stimulated some ideas as to the nature of parallelism and predictability even on simulation logic that does not permit smoothly differentable or analytic calculation.

aaaaa00 · Oct 7, 2006

SPM said:
A local pipe is a FIFO structure in local memory with two pointers, one for access to each end of the pipe.

So, who owns the memory and who can see the memory? If I have two processes, with seperate address spaces, how does data put in one end of the pipe get to the other?

You can spawning off child threads with shared memory within the process, but how would this give you a performance advantage over the process doing that itself? - the child thread would just timeshare with the parent process.

This is wrong. The basic unit of scheduling in both NT and Linux is the kernel thread, not the process. If you create two kernel threads within the process then both threads will run at the same time on 2 CPUs. If your work items are independent , then you can spawn a thread or create a thread pool, and work in parallel. Remember, the basic unit of scheduling in both Linux and NT is the kernel thread.

Future console CPUs: will they go back to OoOE, and other questions.

psurge

aaaaa00

DemoCoder

OtakingGX

3dilettante

DemoCoder

DemoCoder

Frank

Certified not a majority

SPM

SPM

aaaaa00

SPM

aaaaa00

DemoCoder

3dilettante

SPM

aaaaa00

SPM

DemoCoder

aaaaa00

Similar threads