Some thoughts on the physics situation

Discussion in 'GPGPU Technology & Programming' started by JF_Aidan_Pryde, Jun 27, 2006.

  1. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    16x could be, but it's just 1x and not even here yet. Only the PCI-version is out so far. And I say "could", because if it would really be done properly, it would need a proprietary bus to the gfx-card and require seemless integration within the gfx drivers.
     
  2. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    As long as you don’t want to transfer fully skinned objects 1x PCIe would be fine.

    Additional chipsets could support fast point to point connections between PCIe cards.

    If I understand nVidia right they want to push the physic data from the physic card to the graphics card over a SLI bridge. This would be hard to beat.
     
  3. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    For doing somewhat correct/realistic physics with the whole gameworld? I severely doubt it. I'm talking about absolutely every vertice here having physics-influenced properties/behaviours, which is what I would consider "correct".

    The physics as good as the "correctness" of todays gfx related to reality would be the least I'd expect there.
     
  4. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    I don't see why you need the physics meshes to be as high-detail as the visual meshes.
     
  5. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    This doesn't need much more than standard floating-point additions and multiplications. Besides, I believe AGEIA designed its hardware to be generic enough to handle any physical calculation of the present and the future. It's fully programmable, but I very much doubt it has advanced out-of-order execution or threading to deal with long latency instructions.
     
  6. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    In that case, why would you need a PPU?

    EDIT: if we're talking about "realistic" physics, it would have to be high detail. That's what Ageia's advertising at least. If physics should NOT be that complex, that denies the need for a PPU since your PC will be able to calculate low-detail physics without promlems without the aid of the PPU, see HL2 or such (which is what I'd describe as low-detail physics).
     
    #26 _xxx_, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  7. Mate Kovacs

    Newcomer

    Joined:
    Dec 12, 2004
    Messages:
    163
    Likes Received:
    3
    Location:
    Mountain View, CA
    I disagree.
    If it's unable to efficiently handle the kind of data structures * that are needed for today's collision detection algorithms, then the fp add/mul performance isn't going to have any relevance.

    * For which you need so much more than fp add/mul, IMHO.
    EDIT: So it seems to me that it all comes down to how you interpret "much more", which is kind of wishy-washy for an argument to be based on it. :)
     
    #27 Mate Kovacs, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  8. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I'm sorry but you're not making much sense to me. Could you try to be more specific and use facts instead of humble opinions?

    For all I know, software physics engines are limited by floating-point performance. In particular, well optimized physics engines are SSE limited. And Core 2 Duo will have two times more SSE execution units, which are twice as wide as a Pentium 4's. So compared to a single-core it's eight times faster. I can't see how this can be improved with 'dedicated' hardware.
    Not really. Any other instruction I can think of which is useful for physics, is already present on CPUs and GPUs. So if the PPU has any exclusive instruction, it won't be used a lot. For GPUs it's instantly obvious to point out the need for a fully pipelined TEX instruction because it's hard to emulate with generic instructions. Now where is this 'obvious' specialized instruction for PPUs?
     
  9. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    Unfortunately I don't have enough physics experience to offer straight facts, but I do have a (pretty well based) opinion.

    AFAIK efficient collision detection and other similar physics algorithms need efficient access to advanced data structures (i.e. atleast scatter-gather would be needed) and heavy vector based floating point processing power. GPUs do have the heavy floating point lifting equipment, but fail on the advanced data structures front. It seems that even in D3D10, scatter would be insanely slow. CPUs on the other hand are really good at datastructures but fall short on the processing power front.

    Another issue with GPUs might be branching granularity, but that is only a gut feel not based on anything solid.

    The Aegeia PPU architecture is significantly different from both GPU and (modern x86) CPU architecture. Its actually eerily similar to Cell architecture. I feel that this is for a reason. You really cant get by only with streaming writes, which is the GPU creedo and with complex OOO CPUs you just don't have the room for enough parallel threads and vector units.

    PS: Nick, given your assembly programming background, I'm surprised you got this wrong: Core2 architecture has twice the per clock vector power of P4, not four times. P4 does one SSE2 vector op per cycle when Fadds & Fmuls are interleaved. Core2 can do one Fadd and one Fmul each cycle, so twice the power and you still need 50/50 add/mul mix. Architecturally I think they have the same number of units but twice as wide and now on different ports. Its still 4 times as fast in total as a single core P4 though.
     
  10. Mate Kovacs

    Newcomer

    Joined:
    Dec 12, 2004
    Messages:
    163
    Likes Received:
    3
    Location:
    Mountain View, CA
    Yeah, because they use efficient collision detection algorithms, so the integration code becomes the bottleneck. Which is obviously not the same as getting limited by collision detection done the dumb way (n^2), being unable to handle the data structures needed for an advanced algorithm efficiently.

    In other words, if you can't do collision detection efficiently (being unable to handle the data structures necessary), you'll be slow (practically as well as theoretically) no matter how fast your fp add/mul is.

    GPUs are not really good at handling those darn data structures.
    Some posts from the neighbourhood:
    http://www.beyond3d.com/forum/showthread.php?p=773607#post773607
    http://www.beyond3d.com/forum/showthread.php?p=775117#post775117
    http://www.beyond3d.com/forum/showthread.php?p=773275#post773275

    CPUs are, on the other hand, well-suited to handle them, but they're not designed specifically for physics, so they're not especially good at doing e.g. fp-hungry stuff.

    (Remember, I did not say that you don't need fp add/mul, my point was that you need "much more" than just those, and that fp performance is irrelevant if you don't have the functionality to make use of efficient collision detection algorithms. EDIT: BTW, you don't need fp operations to perform e.g. an AABB sweep test.)

    I'm not talking about single instructions, I'm talking about functionality. The whole architecture has to be designed with dynamics simulation in mind, to be able to carry out all the computations necessary for a simulation 'tick' on its own (no need for a CPU to detect those darn collisions, etc).

    EDIT: And once more: I don't know if AGEIA's PhysX actually implements those facilities. I'm just saying that even if it does not, that's still no proof that we don't need PPUs that do.
     
    #30 Mate Kovacs, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  11. Mate Kovacs

    Newcomer

    Joined:
    Dec 12, 2004
    Messages:
    163
    Likes Received:
    3
    Location:
    Mountain View, CA
    After you've done with collision detection, all the integration can be done in parallel, so it needs some kind of streaming model. But before that, you need random memory access, branching and stuff like that to utilise efficient collision detection algorithms.

    So it'd be like if you did the collision detection (and response) on a CPU, the integration on a GPU, but you didn't have to send all the data back and forth between them.

    PS: I can't be "much more" specific than that. :)
     
    #31 Mate Kovacs, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  12. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Scatter can be turned into gather by writing out the indices and then performing a gather to do the updates. This will become much easier with Direct3D 10, which supports integer operands and output streams.
    Core 2 Duo is going to be a really big step to improve that.
    It's similar to Cell except that its clock frequency is only a fraction of it! A CPU makes up for its lack of high parallelism with a high clock frequency and high efficiency.
    The information that's available now is a bit limited to determine the exact configuration. But as far as I know a Pentium 4 can process only 64-bit per clock cycle, because it has only 64-bit wide SSE units (requiring a second cycle to start the other half of the instruction), and only one port for both addition and multiplication. From what I can tell, Core 2 Duo has two 128-bit SSE units each on a different port. It's probably one adder and one multiplier, but with interleaved instructions that's four times faster than a Pentium 4. So that's four times faster for a single-core, eight times faster for a dual-core.

    In practice, dual-core brings a whole lot more than double the processing power. If previous a game used say 25% of a single-core to do physics processing, then moving that task to the second core on a dual-core allows four times more physics processing. Combined with the wider execution units of Core 2 Duo that's pretty phenomenal. Add to this that the 25% has been freed up for other tasks (on top of the overall more efficient architecture), and Direct3D 10 resolves some driver bottlenecks, and doing complex physics on the CPU becomes really interesting!
     
  13. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    I'd hazard a guess that this would be pretty slow. But in essence, yes its possible.

    Core 2 Duo is going to be a really big step to improve that.

    Slower, but wider. Anyway I'm not taking an opinion one way or the other on the Aegeia PPU. Just saying that physics processing wants a distinctly different model from either GPU or CPU. Atleast in their current form. I don't know the featureplans of GPU makers, but AMD has stated interest in doing assymetric multicore processors, with some processors being small vector cores.

    AFAIK pretty sufficient information is available about the pipeline configuration. You're accurate on that point that P4 has 64bit SSE units on one port. But the port is used only for the issue. The P4 (as current K8) can schedule a 128bit add on one cycle and a 128bit mul on the next.

    Correct, that the vector ports are Fadd + mov, Fmul/Fdiv + mov, shuffle + mov.
     
  14. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    This is true for the GPU, but it's going to improve considerably with Shader Model 4.0 and unified architectures.
    True for a Pentium 4, not much longer for a Core 2 Duo.

    My only point is that AGEIA gets a lot of competition from different sides. And my prediction is that it's not going to survive it, because there's nothing unique enough to throw dedicated hardware at. Give CPUs more floating-point power, or GPUs more programmability, and you have perfect physics processors. And that's what's already happening...
     
  15. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    With a PPU you also have to send data back and forth to the CPU (and then to the GPU), just at another stage. And I wouldn't be surprised if AGEIA still did some processing on the CPU.

    So it still seems better to me to either do all the processing on a powerful CPU, or do some of the processing on the CPU and let the GPU handle the rest and immediate start rendering the results (and send them back to the CPU to update the scene).
     
  16. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    So the port can take a 128-bit instruction for each execution unit every two clock cycles, starting a 64-bit operation on each execution unit every cycle? So it can sustain one 128-bit instruction every clock cycle (interleaved)? Interesting! It makes sense of course, to keep all execution units busy. And in this case Core 2 Duo would indeed be exactly two times faster (per clock per core).

    I never owned a Pentium 4... :D
     
    #36 Nick, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  17. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    edit: Checked it over, according to Has de Vries its the same on K8 (Athlon 64). And K7 if I'm not mistaken.
     
    #37 stepz, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  18. Mate Kovacs

    Newcomer

    Joined:
    Dec 12, 2004
    Messages:
    163
    Likes Received:
    3
    Location:
    Mountain View, CA
    I mostly agree on this. You can pretty much get away with more fp power (and parallelism) on the CPU or more flexibility on the GPU side, but I still don't think it means that a dedicated physics HW couldn't be more efficient by any means. If you define efficiency as "how little work you have to do on existing stuff", then any PPU will be 'inefficient', compared to modified CPUs/GPUs, of course.

    Well, I'd argue that it depends on the type of the PPU. :) You could make such a PPU that'd need the whole stuff only once, then only the external forces/thrusts acting at the beginning of each tick, so it wouldn't need to send anything back to the CPU, only explicit queries or notifications would be necessary.
    EDIT: And it'd need a direct link to the GPU, of course.

    Me neither, sadly. :)

    The method using the GPU as well sounds more convincing to me, because the integration is a piece of cake for any stream processing thingy. But I guess you'd need SM4.0 to do it efficiently. I mean, if you have e.g. rigid bodies, then you integrate the positions/orientations in a stream, and besides sending the results back to the CPU, you want to render a model according to each position/orientation pair, but probably a different one for each body. Is this even possible with SM4.0? IIRC it's not possible with the simple geometry instancing present in SM3.0.
     
    #38 Mate Kovacs, Jun 30, 2006
    Last edited by a moderator: Jun 30, 2006
  19. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    I wonder how 'big' of a problem physics is, especially compared to graphics.

    For graphics, in the beginning, it was mostly about texture sampling. Today it has grown to involve vertex and pixel shaders, multiple textures, AA and stencil units. When the first graphics accelerators were released, one could still get by with software renderers; letting the CPU do point sampling was still a feasible fallback. But with the years, the graphics load stacked up. Now it's inconceivable that the CPU can do all the work that it's currently doing plus all the shader, texture and AA work.

    What about physics? Are there a whole bunch of improvements that can be had with additional hardware, as analogous to graphics?

    The improvement in graphics also came from better shader models. We went from flat, gouraud to phong as hardware improved. Is there a similar pattern with physics algorithms?

    Without really understanding the scope of physics, the range of available algorithms, their complexity and how well they map to specialized hardware, I don't think we can really predict if physics hardware has a future.
     
    #39 JF_Aidan_Pryde, Jul 1, 2006
    Last edited by a moderator: Jul 1, 2006
  20. SPM

    SPM
    Regular

    Joined:
    Dec 18, 2005
    Messages:
    639
    Likes Received:
    16
    Sounds like a good reason to use Cell on a PPU - with the PPE managing the data structures, and a high bandwidth between the PPE and SPEs on the same chip, and scatter/gather list DMA to feed the SPEs, it should solve the problem of relatively low bandwidth between the PCs CPU and the PPU card. Cell is also considerably more powerfull than the Aegia PPU, and should be a lot cheaper - if Ageia picks up defective Cell chips unusable on the PS3 because more than 2 SPEs have failed. I can't see why Ageia isn't doing this, unless Sony is preventing them from doing this.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...