How could SPE's be used in console RPG's?

3dilettante said:
Apparently we are on different wavelengths. I have never encountered definitions for procedural code or neural networks like what you are using, and they mix up the higher-level conceptual objects with lower-level implementation details.

Of course you haven't encountered the terms "procedural AI" and "neural net type AI" . I have said more than once that these terms are just terms I have coined in this thread to mean very specific things, since a more appropriate term doesn't exist. If you want to understand what I am talking about and engage in a coherent discussion with me, you will have to read my words in the context I mean

There are generally accepted and well known definition for "procedural code" and "procedural programming" and it's opposites "event driven code" and "event driven programming".

PRODEDURAL CODE
Where every part of the code is reached by branching or spawning off child threads/processes from parent thread/process. Procedural code can be single or multi-threaded but there is a programmed branch or a fork instruction linking the execution of every bit of code to every other. While procedural code may be single or multi-threaded, any executing code must follow a direct thread of execution from the master process.

EVENT DRIVEN CODE
Where the code is written in sections which are completely disconnected from each other, are initiated independently of program control, and have no thread of execution connecting the sections of code.

The reason I coined the term "procedural AI" is because the outcome of the AI results are mapped to branches in the code - hence branching to handle the processing of the different AI outcomes is procedural in the same way that branching to different sections of code makes procedural code procedural. Get it?

The logic state of the AI process is only evaluated between update cycles. This removes the nature of the update code completely from consideration.

Do what every single-threaded AI simulation has ever done. Iterate through the list of objects that store the state variables of the AI agents.

Better yet, multithread the procedural code.

Using procedural code is fine for a batch programs (and non-realtime RPG type games), but procedural code is a poor way of handling real-time simulations. How you would keep the AI processing in sync with the object animation processing for example? How would you ensure the AI processing doesn't get out of sync with other syncronous game processes like animation that need to be completed before a certain time? How you would get a game written in procedural code to respond to external events like user input?

With procedural code keeping execution syncronised with other processes and even between multiple threads within the same procedural code is difficult, messy, and inefficient. This is why developers have been saying that multi-threaded real-time games programs for the XBox 360 are going to be very difficult to program (even though writing code for use under SMP is transparent). The code is easy to write, but sycncronising the threads so the different threads so they deliver their results in realtime using procedural programming is very complicated.

You could get the AI update process to sleep and repeat every so many seconds to update AI (this uses a timer and an interrupt to wake up by the way), but this will get out of sync with other syncronous process since the execution time of procedural code with conditional branches, multiple threads and variable numbers of forked threads cannot be predicted nor kept constant.

You would have to put in locks and semaphores and make them wait for other threads in order to syncronise threads with other threads/processes, but this quickly becomes very complicated and cumbersome to manage and inefficient because a lot of tasks will be waiting for others to complete rather than completing as soon as possble. This is why some developers have said that the first generation of games won't fully use the XBox 360's three cores.

In procedural code, if you need to detect an event such as user input or a collision between objects and respond to it - such as having a player respond to being punched for example and responding with appropriate AI, the only way you can do this is to continously poll a memory location or i/o port for the event. This is a grossly inefficient use of CPU time especially in a realtime game scenario.

Event driven code is appropriate and necessary rather than procedural code for most realtime AI and realtime similation situations in games. This is underlined by the fact that it is nigh impossible to even write any decent GUI non-realtime computer application without most of the code being event driven - procedural code just cannot respond to user input fast enough or efficiently enough in a multi-threaded multi-tasking environment.

Also, I am not suggesting the event driven code should be triggered separately for each object as you seem to think - quite the opposite in fact since I am talking about vector processing of AI and stream processing of animation of objects, but a timer/interrupt can be used to trigger a process when and if required rather than having the process to wait in memory. Unless you are talking of non-realtime batch programs or time predictable single threaded code, event driven code is a necessity.

There is one thing that procedural code for AI is good at though which I mentioned earlier as better suited to running on the PPE: the simple top level game logic - eg. the code that takes you to different levels or into different rooms in a maze. This is not running in realtime, is very simple in terms of AI complexity and so does not require efficient AI processing. Unlike the realtime simulation of complex interaction problems above, is also conceptually easy to program procedural code for simple top level game logic because the AI logic branching takes you straight to the code you need to execute for each level.

Agent state in every form of AI is stored as data. Code is then run using that data. There is no need for a direct mapping of the AI simulation to the underlying code.

Depending on what definition of procedural you are using, there may be no problem at all. I'm still not clear on why you think a procedural AI must be single-threaded or why branches have any affect on threadability at all.

I think you still haven't got this point, so I will try to explain again. I didn't say that the code has to be single threaded, I said that if you use procedural AI as I have defined, you can't use SIMD (or data packing) to process AI objects in parrellel. Of course you can fork out multiple threads/processes, but unless you follow my rules of processing AI as data and making sure that branches are not conditional on AI state, you can only ever process one object at a time per thread on a "boolean bit at a time basis". This is grossly inefficient for any very intensive logic processing. You have 32 or 64 bit CPU with 4 vector units and a lot of internal registers which can process a huge amount of data in parallel (4 x boolean true/false flags in 64 bit word format, or 4 x 64 = 256 bits of flags in packed boolean flag format) in the SIMD vector units in parralel. However with procedural AI code, where a conditional branch reflects true or false in an AI test, you are using your state of the art SIMD 64 bit processor as a one bit (true/false) processor because your processor can't do more than one branch decision at a time. Also if you branch just to run code to store the AI results instead of simply computing the AI results and storing them directly, you add a big execution overhead on top of that due to the fact that with the long branches typical of procedural AI code, you will get a cache miss and incur penalties on execution speed.

The reason you can't do SIMD or packed bit parallelism with procedural AI is simple. If you have say four AI objects you are processing simultaneously in the vector units, or flags from 256 units you are processing in parallel in packed bit format and you use test and branch (procedural AI) to process the AI data, then how do you handle different objects being processed by the SIMD instructions branching different ways? Answer - you can't because the CPU has only one program counter and so can only branch one way per thread, and so your processor has to be used as a one bit processor for boolean operations, and only one AI object can be processed by each thread. If on the other hand you use test and store operations with boolean algebra to manipulate the AI data and process and store it as data with as much looping/branching as you like so long as the branching is not affected by the AI data, then you can use SIMD to process a number of AI objects in parralel using SIMD. Besides this there are the advantages of efficiency of eliminating cache misses due to conditional branching dependent on the AI. This is especially important for the SPEs (the topic of this discussion thread), because the long branches typical of AI code must be avoided at all costs.

There is one type of AI that is suited to procedural AI code - top level game logic. The reason for this is that the AI processing is extremely trivial and does not require a lot of CPU time. Also it takes advantage of the one advantage procedural AI has - that is the branch takes you to the code to be executed as a consequence of the AI result. While branching simply to run code to store the intermediate results of AI processing is very inefficient, you are going to have to have to branch some time at the end of it in order to get to the code that runs to take the necessary action dependent on the results of AI processing. With top level game logic, the branching does indeed take you to the code you need to execute to start a particular level or enter a room. With object AI like herd simulation, and complex AI where a large number of boolean flags and complex boolean processing represent the results, you don't want to process the object until all the AI processing is done - at least for that object. For this procedural AI is not appropriate.

Game simulations run in discrete time steps. The AI is updated after each step (or some multiple number of steps). The simulation would run through the object lists and run the needed methods. This can be done in parallel, and there is no need of an unreliable timer to muck things up.

Read above. The reverse is actually true - timer generated interrupts are reliable and efficient. Timing dependent on code execution cannot be relied on unless you have single thread with no conditional branching (even then timing can be indeterminate due to cache misses OS time sharing etc). Locks and semaphores are problematic, require complicated programming effort, and are often a source of problems in programs that use them. The most reliable code -timing wise - is event driven with a single or fixed number of threads and no or minimal conditional branching other than looping.

Thread-safe code has procedures that can be run independently. Unless a procedure is interrupted in the middle of running through all its branches by something that accesses common variables, the end result will be fine. There is no reason why there needs to be a single piece of procedural code.

Thread safe code is not an issue as they can be used with both procedural and event driven code, and both code using procedural AI and the "neural net" or "object AI" of the type I am suggesting.

They can be made equivalent in result, but relying solely on boolean operations acting on data means the data model must be more complex. The addition of a few variables leads to a sizeable increase in the amount of (probably wasted) parallel computation.

Blindly computing in parallel and then masking or selecting burns serious resources, since without some kind of smart branching every combination of states must be computed, and then the proper result masked out.

Read above. There is a big difference in performance for complex AI processing.

Boolean algebra is a branch of mathematics, and it is every bit as precise as mapping branches to code. Also while procedural AI may be simpler to program for some simple non-realtime AI problems where AI can be mapped directly to code, for more complex AI, especially for syncronised or time critical code, it is much more complicated to program than event driven coding that lend themselves to the data processing approach to AI that is used in the way I am suggesting AI is done.

All AI simulations store agent state as data between update cycles. Code itself does not directly map onto a single object, not without self-modification (a big no-no).

Who said anything about code modification? I am talking about treating AI logic data and results as data, not treating program code as data.

I really don't think your definition exists.

Err. try reading my post before answering. What I have said a number of times is "procedural AI" has no dictionary meaning - it is a term I have coined to mean something specific in my post.

AI is not done the way you say it is

Really? Maybe in last generation RPG game code it hasn't - due to the single threaded nature of last gen games consoles and the primitive level of AI of last gen RPGs. It has been used in the way I say (using branchless operations on boolean data) in very complex simulations such as neural networks and chip simulations. It is also a certainty that the herd simulations and the AI accelerators we have seen do things the same way, for the very reason that games developers have said that the SPE will be poor at doing (procedural) AI, and because it simply isn't practical to do it any other way. The SPE clearly isn't poor at game AI given the selection of the Ageis PPU as an accelerated AI platform, but it is also clear from what games developers have said that the Aegis PPU (and SPE) are poor AI platforms if things are done the way games developers have done things in the past.

It is a mistake to assume simply because you don't understand how it is done, that it isn't done by others or that it can't be done. Please take a look at
http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Book/CH13/CH13.4.htm
http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Book/CH13/CH13.htm

and branching behavior has no effect on threading or parallel processing.

Read my explanation above. Branching which is dependent on the value of AI variables will prevent SIMD being used for parellelism (ie. you can only process one bit wide boolean values per thread).

You'll have to write out exctly what you think a long branch is.
Branch instructions have a fixed penalty. Large numbers of poorly predicted branches are a performance minus. They have no length.

Short branches are branches which are within sections of code that is small enough to be held entirely within the instruction cache or SPE's local store.
Some basics about instruction caching first. Branch prediction is not the only nor even the most important issue in relation to use of CPU instruction caching. Before processors capable of branch prediction, instruction caches only stored the instructions already executed. Most speed critical code in programs use short backward loops. The instructions branched to are already in the cache and so there is no penalty for these branches, and there isn't a need to load instructions at all from main memory since it is already in the cache, so it is even more efficient than a correctly long branch. Even with the arrival of branch prediction, this is still the most important aspect of instruction caching, given that tight backward loops can be pretty well be guaranteed in time critical sections of code.

If you can make the branches in your time-critical sections of code short enough for the code to fit into the cache or SPE's local store then you can effectively avoid cache misses. In the SPE, to get efficient processing, the code especially must be executed from the SPE's local store and the code run on many AI objects in sequence, which gives the equivalent to zero cache misses and no instructions loading laitence. This is easily achieved by processing the AI data on an object by object basis on a large number of AI objects in sequence storing it in main memory using the same code loaded into the SPE's local store, and then using another process to handle the actions of the AI processing (eg. the animation of the objects) serially later on. The separation of the AI logic and the results processing code in this manner eliminates having to put the code for handling the results of the AI in-between the branches resulting in short branches.

The procedural AI, as implemented in most top level AI code has lots of code between the branches because it contains the code that acts on results. In top level AI this is usually huge because it represents code for different game levels or environments. As such it contains very long branches which won't fit into the SPE's local store, and so is not really suitable for running on the SPE. This is the reason why many developers have said that the SPE is not good for AI, however it is not a problem, because this type of AI code is not time critical and does not require intensive processing and can easily be handled by the PPE.

It doesn't even need to be event driven. That's how it's usually done. However, how the updating is done is highly dependent on how complex the relationships and utility functions are.

Read above about event driven vs propcedural code for real-time processing.

Not all AI decisions are solely based on A>B do C.

Did anybody say they are?

Unless the logical code and masking are being run on a particularly complex data model.
Branches have a fixed penalty.
Data expansion and computational complexity initially is smaller, but it grows fast.
Adding something like a variable with four possible states will quadruple the amount of data that will need to be masked.

As I said, before, certain things like simple top level games logic/AI is easier to map to procedural code, and computational efficiency is not an issue because it the processing is trivial and not time-critical. The traditional procedural AI approach you are talking about is probably the best approach for this. However, as several developers have pointed out, this approach is unsuitable for implementing on the SPE, and so a different approach is required. Also as I explained above procedural AI approach is inefficient for compute intensive AI processing, and complicated to program for realtime AI.

This is nothing new, that's how it's done.

The point here is that you don't need to branch conditionally on the AI results in the AI processing code, since this can be deferred. This allows the AI code to be compact enough to fit into the SPE local store, especially if you write event driven code to handle specific object types and then terminate, rather than your all encompassing procedural AI code which is all linked together by a huge network of branches and forks.


What you neglect to cover is exactly how well the data can be updated when the desired update is based on complex rule and variable relationships. Using selects and masking means a lot of work is done in parallel and then discarded. The wasted work takes a non-trivial amount of time, and it multiplies with every new state combination.

You need to read up on boolean algebra and learn to formulate problems in terms of boolean operations instead of thinking in terms of the procedural mindset. In other words think about how to organise the logic boolean data structures in the first place and process the boolean data as data, rather than thinking about how to write the code in terms of branching and pack data or pick out packed data using masks and then thinking about how you would convert that into branchless code. Instead, think about how you would organise the data and process it based in branchless boolean operations in the first place. The compare instruction generates a boolean value which can be stored rather than be used for a conditional branch. You can also for example take the sign of A-B which will give you 1 if A>B, -1 if A<B and 0 if A=B and store this value as a byte or a word, which can save you doing two compare operations. There is also no need to pack boolean data unless it is going to speed up processing by allowing processing to be done in parrallel (which it will do for very complex operations on boolean tables) since boolean data does not take up much space. Masks are easy to use and fast though, if you need to pack data. Even if you don't pack data, you can do operations on 4 objects at a time just by using the SPE's SIMD vector processing capabilities. If you pack data, you can get a 256 fold increase in speed in processing the data, which may be useful if you have some really complex processing such as true neural network simulations, or CPU chip simulations. Instead of using a select branch to set values, you can read boolean integer or other values directly out of a table using indirection. Also, not all the AI data needs to be stored in boolean form, and a mixture of boolean and ordinary arithmetic can be used for processing. For example you may want to store an integer value representing a factor representing anger, which is incremented or decremented by different amounts dependent on the status of various flags (or various tests). This can be done easily by mixing arithmetic and boolean operations, for example summing the values of an integer representing the flags weighting ANDed with the value of the flag into the integer AI data representing anger.

As I said, anything you can do by branching, you can do with boolean operations, the only limiting factor is your mindset in approaching the problem. As I said before, the only other limiting factor might be the procedural mindset of some CPU instruction set designers. This is not a major problem on more modern CPUs though.

As for speed look at the instruction clock cycles - there is nothing a CPU can do faster than boolean and integer arithmetic instructions. Branching by comparison has bigger performance hit, and after having branched you still have to store or manipulate the AI the data. Don't forget also, that if you can store the AI object's data in the SPE local store, the reading and storage of intermediate data is a lot faster than to from main memory. The only benefit you get with procedural AI is if code has to be executed as a result of the AI test rather than to store or process AI data. However except for very trivial AI such as is found in top level overall games logic/AI processing, even this can be done more efficiently especially on the SPE, if this is done on an object by object basis by having the object processing code loaded into the SPE's local store, and the data for objects of this type loaded loaded and processed one by one in a stream. In this case the code will need to branch depending on the value of the AI data, but since all the instructions and data are performed on the local store, there is no hit on branching and no main memory access required on repeated reads to the local store. What is more, if you have enough space in the SPE's local store to load the next object while processing the last, there is zero time lost waiting to load and store to main memory. Put all of these together, and you get a massive increase in AI performance.

That's mixing up the low-level and high-level types of "procedural". They are completely separate things. The SPE will be running programs written like what is seen in C and C++. That's going to be imperative and procedural code.
I am not sure where your earlier defintion of "procedural" fits at this level.

As I said before, when I say "procedural AI" it means something very specific. What I call procedural AI is used in high level (as in top level) games logic/AI. It does not mean high level in the sense of being more intelligent or sophisticated, in fact quite the reverse.

The method I am suggesting is "low level" in the sense that the code has to be broken down into a mathematically simpler form that can be translated into boolean algebraic operations on data, and in the sense that the object processing code has to be more specific in order to fit into the SPE's local store (or into cache). However the AI achieved in this way is not "low level" at all, since the intensity and therefore the complexity of AI processing can be way in excess of what is possible by "high level". In most cases it will also add to the top level games logic/AI rather than replace it.
 
pc999 said:
Are you sure about this, because in the site they talk about
http://www.aiseek.com/FAQ.html#7

this seems not like Cell/PPU which have a very diferent configurations, unless you speak in Cell like in a very broad sense and (1 head core more a few "inferior" ones), althought it should also use some intensive maths, meybe in this it looks like.

Can you share a bit more of info on this.

It is incorrect to say that the SPE is not good for branch intensive code. What the must be avoided on the SPE is long branches that will require code size that won't fit in the SPE's local store and therefore will have to be reloaded. The SPE is poor for branching if the code can't run from the local store. If the code can run from local store, the the SPE is superior to everything else for branching because it is effectively running everything from very fast low laitance local memory, and there are never any cache misses.

Maximal AI performance on the SPE requires fitting the code in the SPE's local store and stream processing AI. It also means using vector processing where possible which means branchless code (at least from the point of view of branching conditionally on the AI data). AI data does not usually occupy much space, so fitting the data and code into the SPE local store isn't usually difficult provided the AI code is separated from other code.

Those games developers who said the SPE is poor for AI are right. The SPE is poor for AI if the same "procedural AI" code that is used in last gen games is used. However with appropriate algorithms the SPE is far superior in AI performance if you need it.

The SPE is also poor at other things that are better done on the PPE - eg. stack and pointer operations on large data stored in main memory. However for searching through a maze or tree where a blocks close to the object concerned can be loaded and searched separately, the SPE can process these very efficiently by loading the blocks into local store for stream processing using scatter/gather DMA.
 
SPM said:
Of course you haven't encountered the terms "procedural AI" and "neural net type AI" . I have said more than once that these terms are just terms I have coined in this thread to mean very specific things, since a more appropriate term doesn't exist.
The problem is that those terms are already in use. I don't go into an italian restaurant, order a pizza, and then say afterwards that what I meant by pizza was a burlap sack full of used oil filters.

PRODEDURAL CODE
Where every part of the code is reached by branching or spawning off child threads/processes from parent thread/process. Procedural code can be single or multi-threaded but there is a programmed branch or a fork instruction linking the execution of every bit of code to every other. While procedural code may be single or multi-threaded, any executing code must follow a direct thread of execution from the master process.

I like my programs to be able to run every portion of their code.
Aside from the initialization of a thread, there is no requirement that the new threads have anything to do with the main one. It would help if they did, but there is no requirement.

EVENT DRIVEN CODE
Where the code is written in sections which are completely disconnected from each other, are initiated independently of program control, and have no thread of execution connecting the sections of code.

Programming languages that have subsections that can run independently are often functional languages where each function takes data and runs with it in its own little context. Thread-safe code in almost any language can be made to do so.

Event-driven programming is often used for things like GUIs, where individual buttons and selectable areas are given event listeners that allow the program to seamlessly handle user input.

However, under the hood, there is a core process being run in the OS. Something has to take messages from the listeners, poll the buffers, watch the mouse coordinates, etc. There is always a thread of execution somewhere tying them together.

Event listeners are also a performance consideration. They are packages of code that take a nonzero number of cycles in the processor and memory bus to watch out for a non-predictable input. If a thousand AI agents did the same thing, the processor would be sitting there asking "is there an input? no? well I'll check again" more than anything else.

The reason I coined the term "procedural AI" is because the outcome of the AI results are mapped to branches in the code - hence branching to handle the processing of the different AI outcomes is procedural in the same way that branching to different sections of code makes procedural code procedural. Get it?

I'm going to state my definition of procedural code, before I try to tackle yours.

At the most basic high level, a procedural language lets the programmer do this:

function foo(takearguementA, takearguementB){
do something with A and B

return something
}

elsewhere:

blah...
foo(argumentA, argumentb)
...that's it

What your definitions says is that the program counter's behavior is affected by the data state of the AI objects. In other words, the program behaves differently when the AI are in different states. I'll get back to this later.

Using procedural code is fine for a batch programs (and non-realtime RPG type games), but procedural code is a poor way of handling real-time simulations. How you would keep the AI processing in sync with the object animation processing for example?
Synchronization and resource contention is the biggest problem with multithreading anywhere. Event driven programs don't get around this. Each code segment will still need to arbitrate access to limited resources and buffers.
If one event object needs to pass its results to another object, they will have to synchronize. There's no magical way to get around a data dependency.

Unless there's an absolute need for synchronization, these rules can be relaxed, however. If they aren't working on the exact same data structure at the same time, threads can do whatever they want. A general flag or just code arrangement can keep them in their own little sandboxes.

How would you ensure the AI processing doesn't get out of sync with other syncronous game processes like animation that need to be completed before a certain time? How you would get a game written in procedural code to respond to external events like user input?

The same way it's done everywhere else. These aren't new problems, and since we have AI in other games, it's been done before.

The game's operation is divided into phases, the AI has its phase, inputs are polled, and the graphics routines have their run of the processor. If properly done, the graphics commands are fed into a circular buffer, where the GPU will run through it as long as it has commands.
It has to be done like this because having code running willy-nilly accessing memory, input buffers, and doing wierd things gives you either bad performance, bad results, or a system crash.

With procedural code keeping execution syncronised with other processes and even between multiple threads within the same procedural code is difficult, messy, and inefficient.
Having a million event listeners would be insane. Software cannot take any action, process, or even listen, unless it runs through a processor's execution pipeline. If it runs through a processor's execution pipeline, it is part of a thread's execution. This is the same for the SPE as it is for any current silicon microprocessor.

If a program's sections or individual objects are all listening for inputs, they must take up time slices in the core to do so. If there are a million of them, they will have the chip spinning its wheels every time they ask if some rare event has happened yet.

Instead, it is far more efficient to tell every object to do its job once in the update phase. So long as the game's engine tick is small, the user will never notice.

This is why developers have been saying that multi-threaded real-time games programs for the XBox 360 are going to be very difficult to program (even though writing code for use under SMP is transparent). The code is easy to write, but sycncronising the threads so the different threads so they deliver their results in realtime using procedural programming is very complicated.
Anything where data is modified out of lockstep is harder to write. It's complicated and often unpredictable.

If a program were event-driven, the programmer still needs to worry about what would happen if eventX happened before or during event Y.

You could get the AI update process to sleep and repeat every so many seconds to update AI (this uses a timer and an interrupt to wake up by the way), but this will get out of sync with other syncronous process since the execution time of procedural code with conditional branches, multiple threads and variable numbers of forked threads cannot be predicted nor kept constant.
The action of going to sleep, waking up, or peeking at the timer is an operation with a performance cost. Timers are also no guarantee of synchronization, since they rely on something that isn't even on the same chip.

You would have to put in locks and semaphores and make them wait for other threads in order to syncronise threads with other threads/processes, but this quickly becomes very complicated and cumbersome to manage and inefficient because a lot of tasks will be waiting for others to complete rather than completing as soon as possble. This is why some developers have said that the first generation of games won't fully use the XBox 360's three cores.
This only happens if all the threads are accessing the same areas in memory or the same buffers/ports. Beyond this contention, the threads don't even need to know the others exist.

In procedural code, if you need to detect an event such as user input or a collision between objects and respond to it - such as having a player respond to being punched for example and responding with appropriate AI, the only way you can do this is to continously poll a memory location or i/o port for the event. This is a grossly inefficient use of CPU time especially in a realtime game scenario.

No, that's not what is done. The game cycles to the physics simulation update phase. It then scans through the simulated world, finds what objects have movement vectors that indicate a collision is occuring and flags that part for further work.

An event listener is what sits there polling forever waiting for something to happen.

Event driven code is appropriate and necessary rather than procedural code for most realtime AI and realtime similation situations in games.
I'm not aware of who actually does it this way.

This is underlined by the fact that it is nigh impossible to even write any decent GUI non-realtime computer application without most of the code being event driven - procedural code just cannot respond to user input fast enough or efficiently enough in a multi-threaded multi-tasking environment.

Event-driven programming works for the upper layers of a GUI because it is not a performance-critical part of the system. The user is slow, so updates need only happen within a very coarse update schedule compared to other parts of the program.

Most importantly, there's only one user who does puts in actions one at a time.

The user's input is almost never too fast for the system, it's the other way around.

Also, I am not suggesting the event driven code should be triggered separately for each object as you seem to think - quite the opposite in fact since I am talking about vector processing of AI and stream processing of animation of objects, but a timer/interrupt can be used to trigger a process when and if required rather than having the process to wait in memory.
Relying on an interrupt or a timer for something that needs to be done regularly is sloppy and unreliable.
Timers can be inaccurate, and interrupts can be preempted by interrupts of higher priority.

Interrupts are specifically intended for things that cannot be readily predicted or are time critical on the time scale of the electronics (hence the term interrupt).

The cycling of the game engine is completely predictable, and the user isn't that picky.

Kicking something out of memory that needs to be used regularly is also not a good idea.


There is one thing that procedural code for AI is good at though which I mentioned earlier as better suited to running on the PPE: the simple top level game logic - eg. the code that takes you to different levels or into different rooms in a maze. This is not running in realtime, is very simple in terms of AI complexity and so does not require efficient AI processing.
If it's that simple, it doesn't need to worry where it's run.
On the other hand, if it has to do with making decisions regarding the environment, AI agent state, or strategic decisions, efficiency is paramount.

In any case, if the decisions being made involve more than a trivial amount of interrelationship between state variables, or the behavior is more complex, or the game world is dynamic, or there is planning, then there will be an explosion in the number of states the hardware must process.

I think you still haven't got this point, so I will try to explain again. I didn't say that the code has to be single threaded, I said that if you use procedural AI as I have defined, you can't use SIMD (or data packing) to process AI objects in parrellel.
This is highly dependent on the kind of computation involved. A complex AI state can still use data packing, and very, very simple objects can conceivably be packed as well.
What becomes a problem is if the relationships in the simulation do not line up to the static lineup of the code.

Of course you can fork out multiple threads/processes, but unless you follow my rules of processing AI as data and making sure that branches are not conditional on AI state, you can only ever process one object at a time per thread on a "boolean bit at a time basis".
Unless there is some shared memory location being fought over, there is absolutely no reason for multiple threads to stop one another.

This is grossly inefficient for any very intensive logic processing. You have 32 or 64 bit CPU with 4 vector units and a lot of internal registers which can process a huge amount of data in parallel (4 x boolean true/false flags in 64 bit word format, or 4 x 64 = 256 bits of flags in packed boolean flag format) in the SIMD vector units in parralel. However with procedural AI code, where a conditional branch reflects true or false in an AI test, you are using your state of the art SIMD 64 bit processor as a one bit (true/false) processor because your processor can't do more than one branch decision at a time.
Not all AI calculations are simple true/false, or at least they cannot be described compactly in those terms. I'd hazard to say the interesting ones are much more complex.

Also if you branch just to run code to store the AI results instead of simply computing the AI results and storing them directly, you add a big execution overhead on top of that due to the fact that with the long branches typical of procedural AI code, you will get a cache miss and incur penalties on execution speed.



There is no such thing as a long branch, a branch is a single instruction and its penalty is the length of the pipeline from instruction fetch to branch resolution (for the SPE, this is ~20 cycles, ~20 instructions).

That is the primary penalty of the branch, and it is fixed, regardless of problem size.

Boolean algebra is a branch of mathematics, and it is every bit as precise as mapping branches to code. Also while procedural AI may be simpler to program for some simple non-realtime AI problems where AI can be mapped directly to code, for more complex AI, especially for syncronised or time critical code, it is much more complicated to program than event driven coding that lend themselves to the data processing approach to AI that is used in the way I am suggesting AI is done.

It is precise, but it also presumes some serious resource consumption.

Producing identical program behavior without resorting to a jump or conditional jump involves some kind of if-conversion, conditional moves, or instruction select. They transform a control dependency into a data dependency, trading execution bandwidth for improved latency.

Other architectures use these, but very VERY carefully, because the code growth, execution unit consumption, and memory consumption rise very quickly.

If a given bit of code must produce outcomes dependent on 4 different possible states, there is a branched and non-branched way.

Branched:
if (case 1)
do this
else if (case 2)
do something
else if (case 3)
do something to this
else if(case 4)
do something here

nonbranched
do case1 and case2 and case3 and case4
keep the valid case (by selective store, masking results, ORing with a storage location)

Great, right?
Only with branching, it is highly likely that only one or two cases at a time will ever be accessed.
With the non-branched version, the processor must do them all every single time.

What happens if another variable is involved (hopefully just a boolean)?

The branched code gets longer, perhaps doubles in length, but at any time it is likely only one or two cases will be used.

The non-branched version will have to run twice as many cases, using twice as much memory, and twice the number of execution units.

Oops, but there aren't twice as many resource units to handle just a boolean flag.

Branched code trades time against finite resources.
Non-branched code expends finite resources to save time.

For complex AI, the resource demands are large. You can pare them down somewhat, with selective branching of course.

It has been used in the way I say (using branchless operations on boolean data) in very complex simulations such as neural networks and chip simulations.
Chip simulations work well with boolean math because that's what they do. When the program is trying to simulate a world that does not reduce down to boolean terms without some serious elaboration, the amount of work involved in getting useful results goes up.

It is also a certainty that the herd simulations and the AI accelerators we have seen do things the same way, for the very reason that games developers have said that the SPE will be poor at doing (procedural) AI, and because it simply isn't practical to do it any other way.
Whatever procedural they were talking about is not your definition of procedural. What they say cannot support your argument unless you clarify what they are talking about with regards to your definition (that only you use).

It is a mistake to assume simply because you don't understand how it is done, that it isn't done by others or that it can't be done. Please take a look at
http://www-ee.eng.hawaii.edu/~msmith/ASICs/HTML/Book/CH13/CH13.4.htm
You've confused the higher-level event-driven simulation for the lower-level implementation.

The event-driven part is the simulation. The actual code being run is plain old procedural.
An event list is created, and events are added to the queue. The program then creates a timer for the simulation, and evaluates events whose time matches the timer. The actual hardware running the program could care less about the time.

It's very conventional past the conceptual behavior of the data objects.

Read my explanation above. Branching which is dependent on the value of AI variables will prevent SIMD being used for parellelism (ie. you can only process one bit wide boolean values per thread).
If the objects' behavior varies that wildly, they should not be grouped.
If they are grouped, it better be because the performance sacrificed to running code for every contingency is less than the waste from branch overhead. For a narrow design like the SPE, the threshold is lower.


Short branches are branches which are within sections of code that is small enough to be held entirely within the instruction cache or SPE's local store.

A branch is a single instruction or pair of instructions used to indicate possible movement of the program counter to an address that does not immediately follow the last instruction. That's it.

What you might be talking about is called a basic block, the straight line code between branches. Computers like large basic blocks, the more the merrier. Straight-line code is awesome, so long as the work involved in removing branches does not exceed the (constant) branch penalty.



Some basics about instruction caching first. Branch prediction is not the only nor even the most important issue in relation to use of CPU instruction caching. Before processors capable of branch prediction, instruction caches only stored the instructions already executed.
Storing previously accessed data or instructions is all a cache does. What branch prediction did was allow a processor to avoid the massive drop in throughput involved in waiting for a branch to resolve before fetching instructions into the pipeline (from memory or cache). Depending on the length of the pipeline, the penalty was small or large.

Most speed critical code in programs use short backward loops. The instructions branched to are already in the cache and so there is no penalty for these branches, and there isn't a need to load instructions at all from main memory since it is already in the cache, so it is even more efficient than a correctly long branch.

Code is automatically prefetched during the course of execution. For the SPE, it is batched in on demand.

Small loops can wreak havoc on parallel execution, since the code being repeated uses the same register names over and over, forcing the processor to stall. What is usually done for in-order designs is compiler-driven loop unrolling, which converts the loop into a long stream of instructions with far fewer branches.

Even with the arrival of branch prediction, this is still the most important aspect of instruction caching, given that tight backward loops can be pretty well be guaranteed in time critical sections of code.

Without hardware capable of speculation and register renaming, a tight loop on a long pipeline leads to unnecessary stalling to avoid false data dependencies. If they can, programmers will always go for longer basic blocks (or in your terms, long branches).

The procedural AI, as implemented in most top level AI code has lots of code between the branches because it contains the code that acts on results.
Chips like this. Cell likes this.
AI code that makes complex decisions uses branches because they avoid a lot of the exploding complexity of mapping out decisions and future states.

This is the reason why many developers have said that the SPE is not good for AI, however it is not a problem, because this type of AI code is not time critical and does not require intensive processing and can easily be handled by the PPE.

If a bot must decide between running for health through a kill-zone or stopping and calling for suppressive fire.

For uniform behavior like finite state machines and simple herds, the SPE will do well.
For complex stuff, the SPE suffers just as much or worse than other chips. No silicon does high-level AI well.

As I said, before, certain things like simple top level games logic/AI is easier to map to procedural code, and computational efficiency is not an issue because it the processing is trivial and not time-critical.

No, that is mostly backwards. Pure number-crunching and tree traversal that the SPE excels at would work best for simple top-level games. Like tic-tac-to, for example. The logic states are simple, and the optimal solution can be mapped out very quickly as a numeric value. A decision tree can scanned through very easily.

More complex environments with more agents require all the efficiency that can be mustered, because a similar decision tree could have millions of nodes.

Also as I explained above procedural AI approach is inefficient for compute intensive AI processing, and complicated to program for realtime AI.
Any software approach to AI is going to be inefficient. Branches are not necessarily worse once the problem grows more complex.

The point here is that you don't need to branch conditionally on the AI results in the AI processing code, since this can be deferred. This allows the AI code to be compact enough to fit into the SPE local store, especially if you write event driven code to handle specific object types and then terminate, rather than your all encompassing procedural AI code which is all linked together by a huge network of branches and forks.

Decisions like:
move forward, so long as the cow's not thirsty, or the wolf is not here, or the grass is green, or the clouds don't look like rain

Compact code is no good if the relationship between AI state, world state, and optimal decision making are more complex than that.

You need to read up on boolean algebra and learn to formulate problems in terms of boolean operations instead of thinking in terms of the procedural mindset. In other words think about how to organise the logic boolean data structures in the first place and process the boolean data as data, rather than thinking about how to write the code in terms of branching and pack data or pick out packed data using masks and then thinking about how you would convert that into branchless code.

Try to sit down and plot out a conceptual AI agent. Think of all the things it must know, must remember, and must take into consideration.

Say for a cow.
Does it care about how many neighbors it has? That's a state variable.
Does it care how full it is? That's another.
Thirst?
Location?
Height?
Line of sight?
Terrain?
Allergies?
Does it like Bessie after Bessie mooed at Bertha?
What about the 1/1000 chance a wolf will show up?
The 1/200 chance of a bee?
What if the wind is below 20 mph, but above 5, but since it's sunny, the range should be upped?

Decide how interesting you want its behaviors to be, what the acceptable range of values are for each state variable. Decide if the cow tries to plan out its day.
Then model it in a static software context.

Count every memory access and mathematical operation. Decide if a whole slew of math ops is worth avoiding a branch.

I'm sure you can do it. Then decide if the work you went through will even be noticed by the user.

Masks are easy to use and fast though, if you need to pack data.
Masking presumes a certain portion of output data is going to be eliminated.
A non-trivial AI problem will give you a fixed amount of correct output, and a sea of discarded data. The chip will have spent time on the wasted data, regardless if it was outnumbered by real data or outnumbered real data 1,000,000,000 to 1.

For example you may want to store an integer value representing a factor representing anger, which is incremented or decremented by different amounts dependent on the status of various flags (or various tests). This can be done easily by mixing arithmetic and boolean operations, for example summing the values of an integer representing the flags weighting ANDed with the value of the flag into the integer AI data representing anger.
But what if you want the chimp to only throw bananas if he has anger>8 but only if the bananas are off-green and the bratty kid has brown hair?

As I said, anything you can do by branching, you can do with boolean operations, the only limiting factor is your mindset in approaching the problem. As I said before, the only other limiting factor might be the procedural mindset of some CPU instruction set designers. This is not a major problem on more modern CPUs though.
The limiting factor for branch conversion is the gargantuan amount of wasted execution it involves.

As for speed look at the instruction clock cycles - there is nothing a CPU can do faster than boolean and integer arithmetic instructions.


Branching by comparison has bigger performance hit, and after having branched you still have to store or manipulate the AI the data.

AI's limiting factor is computational complexity, not branching. A kludgy branch is better than 1000 wasted multiplies.

Within a certain range of behaviors, SIMD and some limited branch conversion is a win. For interesting behaviors, not so often.
 
Last edited by a moderator:
Back
Top