SPM said:
Of course you haven't encountered the terms "procedural AI" and "neural net type AI" . I have said more than once that these terms are just terms I have coined in this thread to mean very specific things, since a more appropriate term doesn't exist.
The problem is that those terms are already in use. I don't go into an italian restaurant, order a pizza, and then say afterwards that what I meant by pizza was a burlap sack full of used oil filters.
PRODEDURAL CODE
Where every part of the code is reached by branching or spawning off child threads/processes from parent thread/process. Procedural code can be single or multi-threaded but there is a programmed branch or a fork instruction linking the execution of every bit of code to every other. While procedural code may be single or multi-threaded, any executing code must follow a direct thread of execution from the master process.
I like my programs to be able to run every portion of their code.
Aside from the initialization of a thread, there is no requirement that the new threads have anything to do with the main one. It would help if they did, but there is no requirement.
EVENT DRIVEN CODE
Where the code is written in sections which are completely disconnected from each other, are initiated independently of program control, and have no thread of execution connecting the sections of code.
Programming languages that have subsections that can run independently are often functional languages where each function takes data and runs with it in its own little context. Thread-safe code in almost any language can be made to do so.
Event-driven programming is often used for things like GUIs, where individual buttons and selectable areas are given event listeners that allow the program to seamlessly handle user input.
However, under the hood, there is a core process being run in the OS. Something has to take messages from the listeners, poll the buffers, watch the mouse coordinates, etc. There is always a thread of execution somewhere tying them together.
Event listeners are also a performance consideration. They are packages of code that take a nonzero number of cycles in the processor and memory bus to watch out for a non-predictable input. If a thousand AI agents did the same thing, the processor would be sitting there asking "is there an input? no? well I'll check again" more than anything else.
The reason I coined the term "procedural AI" is because the outcome of the AI results are mapped to branches in the code - hence branching to handle the processing of the different AI outcomes is procedural in the same way that branching to different sections of code makes procedural code procedural. Get it?
I'm going to state my definition of procedural code, before I try to tackle yours.
At the most basic high level, a procedural language lets the programmer do this:
function foo(takearguementA, takearguementB){
do something with A and B
return something
}
elsewhere:
blah...
foo(argumentA, argumentb)
...that's it
What your definitions says is that the program counter's behavior is affected by the data state of the AI objects. In other words, the program behaves differently when the AI are in different states. I'll get back to this later.
Using procedural code is fine for a batch programs (and non-realtime RPG type games), but procedural code is a poor way of handling real-time simulations. How you would keep the AI processing in sync with the object animation processing for example?
Synchronization and resource contention is the biggest problem with multithreading anywhere. Event driven programs don't get around this. Each code segment will still need to arbitrate access to limited resources and buffers.
If one event object needs to pass its results to another object, they will have to synchronize. There's no magical way to get around a data dependency.
Unless there's an absolute need for synchronization, these rules can be relaxed, however. If they aren't working on the exact same data structure at the same time, threads can do whatever they want. A general flag or just code arrangement can keep them in their own little sandboxes.
How would you ensure the AI processing doesn't get out of sync with other syncronous game processes like animation that need to be completed before a certain time? How you would get a game written in procedural code to respond to external events like user input?
The same way it's done everywhere else. These aren't new problems, and since we have AI in other games, it's been done before.
The game's operation is divided into phases, the AI has its phase, inputs are polled, and the graphics routines have their run of the processor. If properly done, the graphics commands are fed into a circular buffer, where the GPU will run through it as long as it has commands.
It has to be done like this because having code running willy-nilly accessing memory, input buffers, and doing wierd things gives you either bad performance, bad results, or a system crash.
With procedural code keeping execution syncronised with other processes and even between multiple threads within the same procedural code is difficult, messy, and inefficient.
Having a million event listeners would be insane. Software cannot take any action, process, or even listen, unless it runs through a processor's execution pipeline. If it runs through a processor's execution pipeline, it is part of a thread's execution. This is the same for the SPE as it is for any current silicon microprocessor.
If a program's sections or individual objects are all listening for inputs, they must take up time slices in the core to do so. If there are a million of them, they will have the chip spinning its wheels every time they ask if some rare event has happened yet.
Instead, it is far more efficient to tell every object to do its job once in the update phase. So long as the game's engine tick is small, the user will never notice.
This is why developers have been saying that multi-threaded real-time games programs for the XBox 360 are going to be very difficult to program (even though writing code for use under SMP is transparent). The code is easy to write, but sycncronising the threads so the different threads so they deliver their results in realtime using procedural programming is very complicated.
Anything where data is modified out of lockstep is harder to write. It's complicated and often unpredictable.
If a program were event-driven, the programmer still needs to worry about what would happen if eventX happened before or during event Y.
You could get the AI update process to sleep and repeat every so many seconds to update AI (this uses a timer and an interrupt to wake up by the way), but this will get out of sync with other syncronous process since the execution time of procedural code with conditional branches, multiple threads and variable numbers of forked threads cannot be predicted nor kept constant.
The action of going to sleep, waking up, or peeking at the timer is an operation with a performance cost. Timers are also no guarantee of synchronization, since they rely on something that isn't even on the same chip.
You would have to put in locks and semaphores and make them wait for other threads in order to syncronise threads with other threads/processes, but this quickly becomes very complicated and cumbersome to manage and inefficient because a lot of tasks will be waiting for others to complete rather than completing as soon as possble. This is why some developers have said that the first generation of games won't fully use the XBox 360's three cores.
This only happens if all the threads are accessing the same areas in memory or the same buffers/ports. Beyond this contention, the threads don't even need to know the others exist.
In procedural code, if you need to detect an event such as user input or a collision between objects and respond to it - such as having a player respond to being punched for example and responding with appropriate AI, the only way you can do this is to continously poll a memory location or i/o port for the event. This is a grossly inefficient use of CPU time especially in a realtime game scenario.
No, that's not what is done. The game cycles to the physics simulation update phase. It then scans through the simulated world, finds what objects have movement vectors that indicate a collision is occuring and flags that part for further work.
An event listener is what sits there polling forever waiting for something to happen.
Event driven code is appropriate and necessary rather than procedural code for most realtime AI and realtime similation situations in games.
I'm not aware of who actually does it this way.
This is underlined by the fact that it is nigh impossible to even write any decent GUI non-realtime computer application without most of the code being event driven - procedural code just cannot respond to user input fast enough or efficiently enough in a multi-threaded multi-tasking environment.
Event-driven programming works for the upper layers of a GUI because it is not a performance-critical part of the system. The user is slow, so updates need only happen within a very coarse update schedule compared to other parts of the program.
Most importantly, there's only one user who does puts in actions one at a time.
The user's input is almost never too fast for the system, it's the other way around.
Also, I am not suggesting the event driven code should be triggered separately for each object as you seem to think - quite the opposite in fact since I am talking about vector processing of AI and stream processing of animation of objects, but a timer/interrupt can be used to trigger a process when and if required rather than having the process to wait in memory.
Relying on an interrupt or a timer for something that needs to be done regularly is sloppy and unreliable.
Timers can be inaccurate, and interrupts can be preempted by interrupts of higher priority.
Interrupts are specifically intended for things that cannot be readily predicted or are time critical on the time scale of the electronics (hence the term interrupt).
The cycling of the game engine is completely predictable, and the user isn't that picky.
Kicking something out of memory that needs to be used regularly is also not a good idea.
There is one thing that procedural code for AI is good at though which I mentioned earlier as better suited to running on the PPE: the simple top level game logic - eg. the code that takes you to different levels or into different rooms in a maze. This is not running in realtime, is very simple in terms of AI complexity and so does not require efficient AI processing.
If it's that simple, it doesn't need to worry where it's run.
On the other hand, if it has to do with making decisions regarding the environment, AI agent state, or strategic decisions, efficiency is paramount.
In any case, if the decisions being made involve more than a trivial amount of interrelationship between state variables, or the behavior is more complex, or the game world is dynamic, or there is planning, then there will be an explosion in the number of states the hardware must process.
I think you still haven't got this point, so I will try to explain again. I didn't say that the code has to be single threaded, I said that if you use procedural AI as I have defined, you can't use SIMD (or data packing) to process AI objects in parrellel.
This is highly dependent on the kind of computation involved. A complex AI state can still use data packing, and very, very simple objects can conceivably be packed as well.
What becomes a problem is if the relationships in the simulation do not line up to the static lineup of the code.
Of course you can fork out multiple threads/processes, but unless you follow my rules of processing AI as data and making sure that branches are not conditional on AI state, you can only ever process one object at a time per thread on a "boolean bit at a time basis".
Unless there is some shared memory location being fought over, there is absolutely no reason for multiple threads to stop one another.
This is grossly inefficient for any very intensive logic processing. You have 32 or 64 bit CPU with 4 vector units and a lot of internal registers which can process a huge amount of data in parallel (4 x boolean true/false flags in 64 bit word format, or 4 x 64 = 256 bits of flags in packed boolean flag format) in the SIMD vector units in parralel. However with procedural AI code, where a conditional branch reflects true or false in an AI test, you are using your state of the art SIMD 64 bit processor as a one bit (true/false) processor because your processor can't do more than one branch decision at a time.
Not all AI calculations are simple true/false, or at least they cannot be described compactly in those terms. I'd hazard to say the interesting ones are much more complex.
Also if you branch just to run code to store the AI results instead of simply computing the AI results and storing them directly, you add a big execution overhead on top of that due to the fact that with the long branches typical of procedural AI code, you will get a cache miss and incur penalties on execution speed.
There is no such thing as a long branch, a branch is a single instruction and its penalty is the length of the pipeline from instruction fetch to branch resolution (for the SPE, this is ~20 cycles, ~20 instructions).
That is the primary penalty of the branch, and it is fixed, regardless of problem size.
Boolean algebra is a branch of mathematics, and it is every bit as precise as mapping branches to code. Also while procedural AI may be simpler to program for some simple non-realtime AI problems where AI can be mapped directly to code, for more complex AI, especially for syncronised or time critical code, it is much more complicated to program than event driven coding that lend themselves to the data processing approach to AI that is used in the way I am suggesting AI is done.
It is precise, but it also presumes some serious resource consumption.
Producing identical program behavior without resorting to a jump or conditional jump involves some kind of if-conversion, conditional moves, or instruction select. They transform a control dependency into a data dependency, trading execution bandwidth for improved latency.
Other architectures use these, but very VERY carefully, because the code growth, execution unit consumption, and memory consumption rise very quickly.
If a given bit of code must produce outcomes dependent on 4 different possible states, there is a branched and non-branched way.
Branched:
if (case 1)
do this
else if (case 2)
do something
else if (case 3)
do something to this
else if(case 4)
do something here
nonbranched
do case1 and case2 and case3 and case4
keep the valid case (by selective store, masking results, ORing with a storage location)
Great, right?
Only with branching, it is highly likely that only one or two cases at a time will ever be accessed.
With the non-branched version, the processor must do them all every single time.
What happens if another variable is involved (hopefully just a boolean)?
The branched code gets longer, perhaps doubles in length, but at any time it is likely only one or two cases will be used.
The non-branched version will have to run twice as many cases, using twice as much memory, and twice the number of execution units.
Oops, but there aren't twice as many resource units to handle just a boolean flag.
Branched code trades time against finite resources.
Non-branched code expends finite resources to save time.
For complex AI, the resource demands are large. You can pare them down somewhat, with selective branching of course.
It has been used in the way I say (using branchless operations on boolean data) in very complex simulations such as neural networks and chip simulations.
Chip simulations work well with boolean math because that's what they do. When the program is trying to simulate a world that does not reduce down to boolean terms without some serious elaboration, the amount of work involved in getting useful results goes up.
It is also a certainty that the herd simulations and the AI accelerators we have seen do things the same way, for the very reason that games developers have said that the SPE will be poor at doing (procedural) AI, and because it simply isn't practical to do it any other way.
Whatever procedural they were talking about is not your definition of procedural. What they say cannot support your argument unless you clarify what they are talking about with regards to your definition (that only you use).
It is a mistake to assume simply because you don't understand how it is done, that it isn't done by others or that it can't be done. Please take a look at
http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Book/CH13/CH13.4.htm
You've confused the higher-level event-driven simulation for the lower-level implementation.
The event-driven part is the simulation. The actual code being run is plain old procedural.
An event list is created, and events are added to the queue. The program then creates a timer for the
simulation, and evaluates events whose time matches the timer. The actual hardware running the program could care less about the time.
It's very conventional past the conceptual behavior of the data objects.
Read my explanation above. Branching which is dependent on the value of AI variables will prevent SIMD being used for parellelism (ie. you can only process one bit wide boolean values per thread).
If the objects' behavior varies that wildly, they should not be grouped.
If they are grouped, it better be because the performance sacrificed to running code for every contingency is less than the waste from branch overhead. For a narrow design like the SPE, the threshold is lower.
Short branches are branches which are within sections of code that is small enough to be held entirely within the instruction cache or SPE's local store.
A branch is a single instruction or pair of instructions used to indicate possible movement of the program counter to an address that does not immediately follow the last instruction. That's it.
What you might be talking about is called a basic block, the straight line code between branches. Computers like large basic blocks, the more the merrier. Straight-line code is awesome, so long as the work involved in removing branches does not exceed the (constant) branch penalty.
Some basics about instruction caching first. Branch prediction is not the only nor even the most important issue in relation to use of CPU instruction caching. Before processors capable of branch prediction, instruction caches only stored the instructions already executed.
Storing previously accessed data or instructions is all a cache does. What branch prediction did was allow a processor to avoid the massive drop in throughput involved in waiting for a branch to resolve before fetching instructions into the pipeline (from memory or cache). Depending on the length of the pipeline, the penalty was small or large.
Most speed critical code in programs use short backward loops. The instructions branched to are already in the cache and so there is no penalty for these branches, and there isn't a need to load instructions at all from main memory since it is already in the cache, so it is even more efficient than a correctly long branch.
Code is automatically prefetched during the course of execution. For the SPE, it is batched in on demand.
Small loops can wreak havoc on parallel execution, since the code being repeated uses the same register names over and over, forcing the processor to stall. What is usually done for in-order designs is compiler-driven loop unrolling, which converts the loop into a long stream of instructions with far fewer branches.
Even with the arrival of branch prediction, this is still the most important aspect of instruction caching, given that tight backward loops can be pretty well be guaranteed in time critical sections of code.
Without hardware capable of speculation and register renaming, a tight loop on a long pipeline leads to unnecessary stalling to avoid false data dependencies. If they can, programmers will always go for longer basic blocks (or in your terms, long branches).
The procedural AI, as implemented in most top level AI code has lots of code between the branches because it contains the code that acts on results.
Chips like this. Cell likes this.
AI code that makes complex decisions uses branches because they avoid a lot of the exploding complexity of mapping out decisions and future states.
This is the reason why many developers have said that the SPE is not good for AI, however it is not a problem, because this type of AI code is not time critical and does not require intensive processing and can easily be handled by the PPE.
If a bot must decide between running for health through a kill-zone or stopping and calling for suppressive fire.
For uniform behavior like finite state machines and simple herds, the SPE will do well.
For complex stuff, the SPE suffers just as much or worse than other chips. No silicon does high-level AI well.
As I said, before, certain things like simple top level games logic/AI is easier to map to procedural code, and computational efficiency is not an issue because it the processing is trivial and not time-critical.
No, that is mostly backwards. Pure number-crunching and tree traversal that the SPE excels at would work best for simple top-level games. Like tic-tac-to, for example. The logic states are simple, and the optimal solution can be mapped out very quickly as a numeric value. A decision tree can scanned through very easily.
More complex environments with more agents require all the efficiency that can be mustered, because a similar decision tree could have millions of nodes.
Also as I explained above procedural AI approach is inefficient for compute intensive AI processing, and complicated to program for realtime AI.
Any software approach to AI is going to be inefficient. Branches are not necessarily worse once the problem grows more complex.
The point here is that you don't need to branch conditionally on the AI results in the AI processing code, since this can be deferred. This allows the AI code to be compact enough to fit into the SPE local store, especially if you write event driven code to handle specific object types and then terminate, rather than your all encompassing procedural AI code which is all linked together by a huge network of branches and forks.
Decisions like:
move forward, so long as the cow's not thirsty, or the wolf is not here, or the grass is green, or the clouds don't look like rain
Compact code is no good if the relationship between AI state, world state, and optimal decision making are more complex than that.
You need to read up on boolean algebra and learn to formulate problems in terms of boolean operations instead of thinking in terms of the procedural mindset. In other words think about how to organise the logic boolean data structures in the first place and process the boolean data as data, rather than thinking about how to write the code in terms of branching and pack data or pick out packed data using masks and then thinking about how you would convert that into branchless code.
Try to sit down and plot out a conceptual AI agent. Think of all the things it must know, must remember, and must take into consideration.
Say for a cow.
Does it care about how many neighbors it has? That's a state variable.
Does it care how full it is? That's another.
Thirst?
Location?
Height?
Line of sight?
Terrain?
Allergies?
Does it like Bessie after Bessie mooed at Bertha?
What about the 1/1000 chance a wolf will show up?
The 1/200 chance of a bee?
What if the wind is below 20 mph, but above 5, but since it's sunny, the range should be upped?
Decide how interesting you want its behaviors to be, what the acceptable range of values are for each state variable. Decide if the cow tries to plan out its day.
Then model it in a static software context.
Count every memory access and mathematical operation. Decide if a whole slew of math ops is worth avoiding a branch.
I'm sure you can do it. Then decide if the work you went through will even be noticed by the user.
Masks are easy to use and fast though, if you need to pack data.
Masking presumes a certain portion of output data is going to be eliminated.
A non-trivial AI problem will give you a fixed amount of correct output, and a sea of discarded data. The chip will have spent time on the wasted data, regardless if it was outnumbered by real data or outnumbered real data 1,000,000,000 to 1.
For example you may want to store an integer value representing a factor representing anger, which is incremented or decremented by different amounts dependent on the status of various flags (or various tests). This can be done easily by mixing arithmetic and boolean operations, for example summing the values of an integer representing the flags weighting ANDed with the value of the flag into the integer AI data representing anger.
But what if you want the chimp to only throw bananas if he has anger>8 but only if the bananas are off-green and the bratty kid has brown hair?
As I said, anything you can do by branching, you can do with boolean operations, the only limiting factor is your mindset in approaching the problem. As I said before, the only other limiting factor might be the procedural mindset of some CPU instruction set designers. This is not a major problem on more modern CPUs though.
The limiting factor for branch conversion is the gargantuan amount of wasted execution it involves.
As for speed look at the instruction clock cycles - there is nothing a CPU can do faster than boolean and integer arithmetic instructions.
Branching by comparison has bigger performance hit, and after having branched you still have to store or manipulate the AI the data.
AI's limiting factor is computational complexity, not branching. A kludgy branch is better than 1000 wasted multiplies.
Within a certain range of behaviors, SIMD and some limited branch conversion is a win. For interesting behaviors, not so often.