Physics Processor - AGEIA (Cool Demo)

Discussion in 'GPGPU Technology & Programming' started by rwolf, Mar 8, 2005.

  1. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    The more threads you have, the better you can leverage support for multiple processors. As long as the overhead for each thread is small compared to the amount of processing that much be done in each thread, performance should not degrade significantly when run on a single-processor machine either.
     
  2. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    Try it....... with a none shared memory model.
    AI is not trivial to do this way, it only looks like it on the surface.

    Also if you launched 1 thread per AI entity, you'd be killed by the thread overhead.

    Sure you can do it but it's hardly trivial.
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Why not? I mean, sure, AI actors may want information as to the location and behavior of other nearby actors, but it's not like they can't work on the previous frame's data (it'd be more consistent that way, anyway).

    Well, when the multithreading is this trivial, you can obviously scale the number of AI entities per thread just as trivially.
     
  4. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,242
    Likes Received:
    615
    Just to state the obvious ... not something you have to worry about on a PC.
     
  5. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    953
    Likes Received:
    51
    Location:
    LA, California
    I also question whether the AI/physics computation are so trivially interlocked. AI seems like one of those things with components you'd want to amortize over multiple frames.
     
  6. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Not really that specialized. Anyone doing multithreaded programming uses these primitives. The problem is, debugging multithreaded programs is fraking hard. Race conditions are usually timing sensitive and hard to reproduce. Deadlocks and stuff like priority inversion can arise easily. Academic languages designed to make it harder to trip over these problems generally suck at performance.

    It's like how using Haskell, you can avoid lots of bugs that can trip of C programmers (segfaults, memory leaks), but Haskell performance sucks.

    Concurrent Clean seems to perform well, but any garbage collected language will tend to impose too much non-determinism on devs who want real-time behavior.
     
  7. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Um, yeah, that's kinda the definition of specialized language. See, I'm not a computer science major. My work in programming to date has been either based on hobby, or for physics simulations. Anyway, I've been doing my parallel programming via MPI so far, and haven't come across any of those terms yet in the reference material I've been using.

    See, this is why I'm really wondering why it's not possible to do good multithreaded programming without problems. In the programming I do, I've always made heavy use of pointers and new and delete operators. But I'm also very careful about how I handle them, so that seg faults and memory leaks just aren't a problem.

    All that I do is:
    1. For simple new/delete pairs I simply search the source file and ensure that each new operator is paired with a delete operator.

    2. For more complex systems, I first build a base class that encapsulates the most basic functionality I want to implement. This way I can keep the entire operation of the code in my head. I then test the hell out of this base class, to make absolute certain that there are no memory leaks. Then I just derive this class for whatever more complex functionality I want to make use of and go from there.

    And these work great for me. I don't see why similar things wouldn't help large teams of programmers, too, provided that the company has strict policies on coding and documentation, so that everybody's on the same page when it comes to writing code.

    I mean, why not just do a similar thing for multithreaded code? Either limit yourself to completely independent code, or implement the most basic functionality you want to make use of in a stripped-down base class first, test the hell out of it in a prottype environment, and then just derive this class later when you want to use this functionality.
     
  8. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    Thanks. I understand where you're coming from...I think.


    I'm still worrisome about scaling the number of AI instances/physical objects by adding or removing threads on the individual level. You appear to have noted later that you could scale within each thread. I would think that it would be better to scale within each thread and then add more threads to take advantage of cores that would sit idle. That way you could keep all the cores busy, maintain an even load and curb overhead as much as possible no?

    I also wonder if AI and physics can be processed in a set order. An entity will have to react to a barrel being tossed at it and dodges away from the barrel and hits a chair physical interaction is necessary due to the AI entity. Such situations could happen off in the distance or in the local area. An intelligent scheduling medhansim would seem to be necessary then no?

    I'm unsure about what you imply with similar performance on single-core/processor machines. If you mean the cost of a context swith I understand. I you mean overall overhead then the amount should or could be equivalent. The effect would be different. On a single-core/processor machine the overhead can not be absorded (not to imply absorbed completely) by concurrent operation elsewhere on another task. Per unit time performance will degrade faster on a single core/processor machine than a mult-core assuming all cores are equivalent as more and more context switches are introduced.

    I'm sure I'm far from telling you anything you don't know and I did say I'm usure what you're actually talking about. I teh challenged :wink:
     
  9. Ragemare

    Regular

    Joined:
    Apr 8, 2004
    Messages:
    333
    Likes Received:
    7
    Location:
    England
    Considering the first multi-core CPU's will only have two cores, starting a new thread for each instance of an AI would seem inefficient to me.

    Would it be a better idea to create a new thread for each free core, send it a fragment of the data that needs to be processed (In this case the AI objects), then upon completeing the processes embeded into the AI class, it would send the objects back to the main thread with the updated data.

    The objects sent to each thread needn't even be of the same class as long as they support a common interface so that they can be executed by the threads main loop.

    The only problem is that some data will have to be processed before other data is, so you will have to catagorize your classes depending on which other classes they rely on. So you would send the classes that rely only on data from the last frame first. The problem is you will have to wait untill all the threads in each catagory finnish so time may be wasted if you give one thread too much work too do and it lags behind the rest. In other words you will have load balanceing issues.

    This is the only method I'v seen used that hasn't confused the hell out of me. It's basically the same as handing out tasks among workers.
     
  10. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, my assumption is simply that each AI object will take a different amount of time to process, so it's better to have significantly more threads than CPU's to efficiently leverage the CPU's. So, what I would do is some simple testing to check about how many AI instances per thread I need to get the thread overhead to be, on average, no more than, oh, 5% of the performance for that thread. Then just duplicate 'em all.

    Sure, but an AI entity doesn't need the data for the current frame, just the previous frame. If you program in a somewhat realistic reaction time delay anyway, this is no problem (reaction time could be simply programmed as when an event that the AI object should respond to happens, put a message in the AI object's message queue that states when to react. When the current timestep is past the message's, then react to that message and delete it from the queue).

    One drawback to this approach is that you really need to double-buffer all of your AI data to ensure that your parallelism isn't interrupted, but that shouldn't be a huge issue.

    Right, so you simply use your single-core machine to guage the performance impact of multiple threads. Then it becomes your worst-case scenario, and if you can ensure that you get in excess of, say, 10 threads most of the time, then leveraging multiple processors becomes trivial.
     
  11. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    I'm workin on it. When you say "leverage the CPUs" you are referring to having plenty of threads waiting in the ready list to hop on the CPUs/cores and keep them busy. More threads ensures there's something available to that end. Makes sense...err...I think. Upon reflection it certainly appears to be alot simpler.

    I am thinking simply here (...well that's a given :D) but here's where I'm coming from.

    Objectives:

    Minimize overhead
    Leverage the CPUs/cores

    Implementation:

    A main task is in controll of everything.
    The main task tests to fill in parameters you have tested in house so that it best utilize the HW.

    The main task lauches the physics, AI, sound, garbage cleaning, etc threads within itself using library threads the OS does not see so that it has full control of scheduling. I would imagine the Kernel would be aware of or in direct control of threads lauched on seperate cores to leverage their added processing power. It still will happen though but in a similar manner. More on that in a moment.

    Again the main will decide whether to put physics on another core, some portion of the physics entities, or even particular interactions. The same is true of other aspects where applicable. Consequently this should alter the behavior of the sceduler/sceduling principles.

    I thought about having a seperate thread for sceduling but I didn't like introducing a "middle-man" doing what the main task should be able to do or really what thread level resouces may not be needed to accomplish. I think one could think up the inter-relations that dictacte when things should happen and then use existing functions to handle scheduling needs provided you have pertinent data in globals within the FDT. (wait, sleep, yield to etc--or you could define your own)

    Objects would define AI entities, and phyical objects from the most simple to the most complex using inheritance. Each class would have attached functions to define, call for, or wait on interactions (you could move scheduling tasks into the main though of each respective thread)

    Objects would be stored in a dynamicly allocated data structure such as an array or record (depends of how you define your objects) of AI entities or physically interactable objects.

    A data structure to store time by the system clock. Needed for synchronization of events.

    How it work...err...a little clearer:

    There would be two states. Setup and Execution. Setup will be a safe tested pre-defined period of time or at least enough time to schedule all high priority interactions.

    Execution will be a time until all work is done in the current iteration for at least high priority interactions.

    Temporal restrictions are in place to preserve responsiveness and given if a interaction is low priority it may be ignored for a time without consequence. It's not a luxury it's a hack that fits the task.

    Setup time should resemble normal time delay when and if applicable.

    Setup---Micro-sceduling:

    This refers to the actions within each respective thread in determining what should happen next.

    If not defined during the previous Exectution cycle then priority will be defined by whatever scheduling principles are defined with things in the local area of the user having the greatest priority normally. Priority could be assigned with a run through the whole storage structure for your objects. A better way would seem to have arrays store priority values for corresponding entries in the object arrays. For instance an array for high, mid, and low priority. This would eliminate a run though the whole array and the need to shuffle unchanged values. Also allowing a quick run through through these smaller arrays would be a small optimization during the Execute cycle. Static or dynamic arrays...don't know.

    Setup---Macro-scheduling:

    After micro-scheduling is done the threads deemed with the most priorty by the main task based upon some rules go first on each core present.

    Execution:

    All interactions are performed or at least all high priority ones. Since objects functions will define whether they are wating on something or not during this cycle cues micro-scheduling are done invisibly. It is with this that I hope Setup should take relatively very very little time. In essence setup time is little more than a restart of execution of interactions where Setup does no work other than for macro-scheduling and provides a normal time delay to reactions where applicable.

    What's the benefit?

    hmm...I get a cookie or a slap in the face. ...OUCH! ....can I still have a cookie? :D

    I think it would reduce overhead supplied by context switches. If you use use library threads or those of your own design then it should be faster than those the Kernel would perform in the OS.

    I think you can take advantage of some things to still leverage the CPUs or cores available. One being you could still put threads on other cores and given the nature of things you would not need to yield the core often other than to the OS so in essense your sucking the core dry without adding context switches that could be avoided. With this in mind you still have enough scalabilty to utilize as many CPUs or cores as are available.

    I must say...this stuff is teh hardz! :cry:

    Oh I get it! You use the single-core machine to test when the overhead is unacceptable and then use this to determine when you should put other cores to use. Got it. You did mention an actual work to overhead relation in that post as well. Oops!
     
  12. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, think about this situation:
    Imagine you have two threads and two CPU's. Computation cannot continue until both threads exit. But each thread takes a different amount of time to finish. So, no matter how they're executed, you're still not making full use of the processor time available.

    But if you have lots of threads, chances are no one thread takes much longer than the rest, or, better yet, if threads execute serially when there are fewer processors than threads, while that long one is running, the others will be finishing on the other processors.

    Well, not quite. I'd rather not worry about full scheduling control, just which threads are running at any given time. Wasn't thinking about going that low-level.
     
  13. SlmDnk

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    625
    Likes Received:
    291
    http://www.extremetech.com/article2/0,1558,1777120,00.asp?kc=ETRSS02129TX1K0000532

    Agreed x 1000.

    - - - - - - - - - - - - - - - - - - -

    http://personal.inet.fi/atk/kjh2348fs/ageia_physx.html
     
  14. IgnorancePersonified

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    778
    Likes Received:
    18
    Location:
    Sunny Canberra
    Yeh that was a good summary however I think this defacto "stealth" launch is in itself generating enough hype to do the product justice.

    The linked article could just be a journalist feeling a bit snubbed and upset at not getting the red carpet/limosine treatment they feel they deserve.
     
  15. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    True. I'm just anal and I don't like introducing anything that isn't directly working towards completing my tasks...and if I have to do it I want to make those same things pay for my suffering.

    I understand how I wouldn't want stuff sitting around idle or waiting. It was my thinking that this would not happen often and that with a custom scheduling scheme it could be dealt with efficiently.

    It was my thinking that threads handling as much of the AI, physics, etc as possible individualy would rarely if ever have nothing to process. Also by using a bit of a time restriction and some level of prioritization waiting which is bound to occur at some level is better within a thread than on the ready list and could at least be handled intelligently vs. in some random fashion by the OS (random...lack the words...Kernel picks who's next).

    If such an imbalance was found during testing then using more threads would be the optimum thing to do and also since the shedule system would be intelligent it could do the same in real-time as well. Dynamic load balancing so to speak.

    I missed some stuff earlier that might...maybe...it's not impossible that it could make more sense.

    I would avoid stalling the whole thread by using global flags for individual objects. Flags for "you do" and "done" should suffice. When running through the priority arrays if an AI object's "you do flag" is set to 1 or the physical object's id which it is reacting with then it's obviously waiting and passed over to the next AI object. This is a one way street. AI will say interact with me...as they should realise when they've been hit or have hit something. Physical objects merely interact when they're told to and say when they're done. Interactions AI to AI or physical object to physical object are handled internally with their respective threads. Public vars of the classes which identify each object will facilitate this. This prevents a single AI entity or physical object from stalling the entire thread and gives relative data for micro-scheculing/macro-scheduling during the execution and setup cycles.

    If the FDT gets too bloadted then pipes etc could be used instead...but that would suck so this would be a fallback.

    I would put AI, physics and generally somewhat dependent tasks on seperate cores to take advantage of concurrent processing and keep the "waits" to a minimum.

    This would appear to be at least approaching keeping everyone busy, keeping overhead down, and doing something about wait states on the individual objects level. Alot of deadlock opportunities should be avoided because the execution of each thread is not inter-dependent to the others excepts when the main task steps in to catch one of them up so to speak. Dependencies are not on the thread level. Race conditions are with respect to objects is dealt with concurrent processing and the overall scheduling scheme.

    I'm a glutton for punishment...because I NEVER see it coming!

    I was just pondering that a re-think on things may be necessary so really nothing is out of bounds.

    That and as I have no life...why not code more and forget about the illusion anyway :D

    ... :(

    If you were referring to not quite for not quite faster with my idea of using lib threads I was thinking this becasue you don't have to copy the FDT etc during a context switch which is faster and saves resources.

    ...my head hurts. I'm going to get some ice-cream :wink:
     
  16. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    I would love to see something too and something oh related to...COST!

    But I am one of the wide-eyed really hoping this thing takes off because of the potential it presents to gaming.

    C'mon Aegia...the time is now!
     
  17. TomW

    Newcomer

    Joined:
    Nov 3, 2004
    Messages:
    30
    Likes Received:
    0
    There's no need for more than 1 AI thread per (virtual) CPU core, since you can have the controller thread set up a lock free queue of AI tasks (or entities), and the threads just pluck tasks off the list until its empty. When each thread finds the queue empty it notifies a barrier, and the main controller thread unblocks when all the AI threads have made their notifications. Certainly, assigning a fixed chunk of work to each thread up front doesn't seem to be the best way to do it, but rather have the threads adapt to do whatever work needs doing.

    But I note that AI entities aren't actually independent - squad behaviour requires cooperation. This is just one instance where it might get complicated.

    Tom
     
  18. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Yeah, that'd work too. It'd be more complex than just using more threads, though, as you're doing the scheduling instead of allowing the OS to do it.

    Well, if your AI system doesn't work well with a delayed messaging system (which can allow for teamwork, but the inherent delay may start screwing things up), you would simply launch groups of AI entities that are interacting at once.
     
  19. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,242
    Likes Received:
    615
    Communication delay is not a real problem for squad behaviour unless you are simulating a squad of robots which can communicate and react to eachother at the speed of light. In most cases a 1/60 second delay should not matter much.
     
  20. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, I was thinking more in terms of coordinated movement. Communication can just include things like, "This is my current position," i.e. stuff that people pick up just by looking. Might be bad to have marines accidentally shoot one another, too.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...