Multi threaded PhysX benchmarks - bye bye GPU PhysX.

Discussion in 'PC Gaming' started by brain_stew, Mar 18, 2010.

  1. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,823
    Likes Received:
    162
    Location:
    Seattle, WA
    Indeed.

    That would explain the results that I've seen.

    Floats don't actually make much difference. Aligning the matrices, well, that depends upon the matrix size on whether it makes a difference. For small matrices, the difference was maybe 20%. But for large matrix sizes, just aligning them improved performance by 5x-10x even before using SSE! Adding SSE on top of aligning the matrix can improve things from anywhere from 2x-3x, for large matrix sizes (more improvement for floats, less for doubles).
     
  2. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Interesting; this implies that the compiler is *not* auto-vectorizing and just using scalar SSE. Now I'm curious... can you post the assembly (-S in ICC IIRC) for that function if it's not too huge? :)

    Cool. Yeah that's expected for small matrices as there you're bound by loop overhead and everything fits in cache anyways. When you say "large" matrices, what sizes are you talking?

    Interesting test btw... thanks for posting.
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,823
    Likes Received:
    162
    Location:
    Seattle, WA
    Well, I guess I misspoke. There is little difference when talking about the very small matrix sizes, which is likely because the primary limitation is register pressure. But when you have large matrix sizes, the performance improvement is about 2x for floats, 20% for doubles. So there is a big difference, at least for large matrices. In fact, when I look at the assembly, as you suggested, I see this chunk:

    Code:
    ..B1.55:                        # Preds ..B1.55 ..B1.54
            movups    (%edx,%ecx,4), %xmm2                          #38.32
            mulps     (%ebx,%ecx,4), %xmm2                          #55.5
            movups    16(%edx,%ecx,4), %xmm3                        #38.32
            mulps     16(%ebx,%ecx,4), %xmm3                        #55.5
            addps     %xmm2, %xmm0                                  #55.5
            addl      $8, %ecx                                      #55.5
            addps     %xmm3, %xmm1                                  #55.5
            cmpl      %edi, %ecx                                    #55.5
            jb        ..B1.55       # Prob 99%                      #55.5
            jmp       ..B1.59       # Prob 100%                     #55.5
                                    # LOE eax edx ecx ebx esi edi xmm0 xmm1
    I believe the "mulps" routines are vector multiplications, which would indicate auto-vectorizing.

    Anyway, I think the 2x performance improvement for enabling SSE2 using single-precision on large matrices is pretty tremendous, and really shows how much realistic computational loads can benefit from this very very simple sort of optimization (many scientific codes are entirely limited by how rapidly they can perform large matrix multiplications, for instance).

    Oh, sorry. Up to 2048x2048. So with floats, that requires 32MB, 64MB with doubles (since I'm using two matrices).

    P.S. If you'd like, I can go ahead and post the full .asm. I just didn't want to clutter the post.
     
  4. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    846
    Likes Received:
    245
    Yes, mulss is the scalar version. And it's float (s = single precision), mulpd is the double variant.

    The primary reason for faster FP with SSE is throughput. You can have 2/1 to 3/1 for most operations if you do not have dependencies. The ripping apart of a x87 code-stream which operates as a stack-machine to the outside is exhausting a lot internal resources leading to 1/3 to 1/10. It essentially has nothing to do with faster math-operations itself (1 clock is a non-surpassable barrier :^) You can assume that the math logic is a shared resource between x87 and 3DNow/SSE-modes.
     
    #124 Ethatron, Jul 11, 2010
    Last edited by a moderator: Jul 11, 2010
  5. digitalwanderer

    digitalwanderer Dangerously Mirthful
    Legend

    Joined:
    Feb 19, 2002
    Messages:
    16,932
    Likes Received:
    1,531
    Location:
    Winfield, IN USA
    Best quote of this thread, can I sig it Andy?
     
  6. Mendel

    Mendel Mr. Upgrade
    Veteran

    Joined:
    Nov 28, 2003
    Messages:
    1,350
    Likes Received:
    17
    Location:
    Finland
    so they will include multithreading but no comment about SSE?
     
  7. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yup the "ps" stands for "packed single [precision]".

    Yeah it's neat that ICC is actually able to auto-vectorize this, and although it is simple code, there's definitely some computationally-expensive chunks of physics code that can benefit from the same sort of optimizations, particularly if the data structures are arranged in SoA format. Definitely shows potential for taking advantage of SIMD, even if it has to be done more explicitly.

    Nope that's ok, the chunk you posted demonstrated what the compiler was doing nicely, thanks!

    If you like, but it wasn't meant to be particularly profound :)

    To clear something up though... I'm not arguing that CPU physics is going to be faster than GPU physics (particularly just by flipping some compiler switch), but rather that the CPU implementation of PhysX seems to leave a good chunk of performance on the floor (SSE and scalable multi-threading). NVIDIA's comments about PhysX 3.0 seem to support this, and it's good to see them moving towards a better solution.

    I think that's what David's article and analysis were meant to convey as well but some people have interpreted it more harshly so I did want to separate myself from that view: I think the GPU is a reasonable architecture for an increasingly-large set of physics computations, but the CPU can definitely do a better than what PhysX (2.0?) is doing now. I'll let David speak for himself on what he intended to say though :)
     
  8. Sxotty

    Veteran

    Joined:
    Dec 11, 2002
    Messages:
    4,827
    Likes Received:
    289
    Location:
    PA USA
    Your arguments still don't make sense. Until some other physics implementation is better castigating Nvidia for not supporting other businesses is silly. Intel bought havok to sell intel CPUs. Where the the pitchforks about how they gimped it on GPUs :).

    Anyway I am with Longbj, I think we all could agree that it would be nice if havok did wipe the floor with physX then at least it would provide motivation. I mean why should nvidia do anything for CPUs? They don't get paid for physX, nor CPUs. That is just asking them to do something for nothing/mind share.

    edit:
    The work isn't justified for Nvidia. How would they explain it to their stock holders. They have an obligation and it isn't to sell lots of CPUs.
     
    #128 Sxotty, Jul 11, 2010
    Last edited by a moderator: Jul 11, 2010
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,823
    Likes Received:
    162
    Location:
    Seattle, WA
    Not entirely. Wider acceptance of PhysX among developers would definitely be a positive point for selling nVidia cards, and higher CPU performance could help devs to decide to make use of it.
     
  10. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Their customers in this case are the game developers who use PhysX, for whom these optimizations hold large value. Sure NVIDIA could take the stance that you describe but it would not endear themselves to the developers for obvious reasons.
     
  11. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    15,556
    Likes Received:
    4,458
    I agree with that, currently with 4 core CPU's even with the optimizations PhysX would most likely still be faster on GPU than CPU for certain tasks. It just wouldn't be nearly as dramatic or marketable. Especially if it allows devs to follow in the footsteps of Metro 2033 and make sure all GPU effects are also done on CPU. That takes the majority of thunder right out of PhysX GPU marketing. Although there will always be people that might be willing to go that route for the additional speed. But it would for the most part get rid of vendor lockin, something Nvidia is reluctant to remove. Only consumer backlash is going to make them change their ways.

    And then there's always the 6 core CPUs starting to enter the market and in the future 8 and 12 cores CPUs. With SIMD SSE and proper multicore implementation there is the possibility that the lead will dwindle even further or disappear entirely.

    GPU's obviously will also get faster, but I think they're starting to hit the power/heat wall that CPUs have been at for a while. GPU's are just a little more lucky in that area as they appear to be more tolerant of high power/high heat. Or at least the limit set for CPU's is significantly lower.

    Regards,
    SB
     
  12. Squilliam

    Squilliam Beyond3d isn't defined yet
    Veteran

    Joined:
    Jan 11, 2008
    Messages:
    3,495
    Likes Received:
    113
    Location:
    New Zealand
    Sig whores... bah! :razz:
     
  13. findhorn_elves

    Newcomer

    Joined:
    Jun 19, 2010
    Messages:
    45
    Likes Received:
    1
    If the game is optimized and well programed that has to be taken under consideration as well
     
  14. Sxotty

    Veteran

    Joined:
    Dec 11, 2002
    Messages:
    4,827
    Likes Received:
    289
    Location:
    PA USA
    this and Andrews argument have some merit.

    You are saying "more use of physX= more cards sold" I think and he more use=more developers optimizing for nv cards. aAt least I think. I still think that is a hard sell to a stock holder. I think the better argument is physX will die if we fall behind other physics middleware in the market for CPU support. Hence maybe the update is b/c havok or something else will be updated soon.
     
  15. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,823
    Likes Received:
    162
    Location:
    Seattle, WA
    Well, it's definitely a fine line to walk. On the one hand, you want good CPU performance so developers will pick it up. On the other, you want much better GPU performance so that gamers have a reason to consider it when buying a new graphics card. Clearly if Havok was doing no better in terms of CPU performance, nVidia would have no reason to optimize further. But then they risk bad PR (as just happened).

    So yeah, it definitely isn't terribly simple, but at least we can be a little happy that nVidia is making at least some strides to improve CPU performance. The best way to keep them at it would be if Havok were to improve.
     
  16. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Agreed but that gets more into the territory of why NVIDIA (and Intel) are in the physics middleware markets at all... are they there to actually develop great middleware for everyone to use on all platforms or to promote their own hardware? The former is a sustainable strategy while the latter will get eaten in the long run by stuff like bullet. We'll see which strategy NVIDIA chooses going forward.

    On another note out of curiosity I decided to run the bullet benchmarks that were linked from the other forum with slightly different compiler settings to get at the real question about how much the simple "compiler flag SSE2" matters vs pure x87. Thus I disabled bullet's explicit use of SSE entirely and instead just varied the compiler flags between nothing and /arch:SSE2 (in MSVC). Here are the results (on a single core of a Core i7 940), all in milliseconds:

    x87:
    Results for 3000 fall: 19.562317
    Results for 1000 stack: 12.707859
    Results for 136 ragdolls: 10.471497
    Results for 1000 convex: 18.448212
    Results for prim-trimesh: 9.196541
    Results for convex-trimesh: 16.437344
    Results for raytests: 22.529740

    /arch:SSE2
    Results for 3000 fall: 18.083166
    Results for 1000 stack: 12.192101
    Results for 136 ragdolls: 9.896542
    Results for 1000 convex: 15.049071
    Results for prim-trimesh: 8.446912
    Results for convex-trimesh: 13.625685
    Results for raytests: 18.630299

    Definitely some larger gains than I expected for a simple compiler option! In some cases it didn't make a huge difference but 3 of the cases netted 10-20% boosts just from using scalar SSE vs x87. Impressive, but you'd have to actually use SIMD SSE (either via auto-vectorization in ICC or explicit SSE) to see 1.5-2x level gains. I'd be interested in how well ICC would do with bullet (Chalnoth's auto-vectorizer results are encouraging), but I don't have the time to set it up.

    I also ran it with /arch:SSE2 and their explicit SSE stuff enabled as well out of curiosity:

    /arch:SSE2 + USE_SSE
    Results for 3000 fall: 17.373240
    Results for 1000 stack: 11.402561
    Results for 136 ragdolls: 9.099781
    Results for 1000 convex: 14.487597
    Results for prim-trimesh: 8.132652
    Results for convex-trimesh: 13.579372
    Results for raytests: 18.672155

    Minor gains over just the compiler option. There is small use of SIMD instructions in this mode but it's mostly opportunistic AoS stuff in the small vector library routines so I'm pretty sure that more could be done with some data structure re-arranging. Thus I don't think 1.5x (50%) improvement or more from pure x87 -> SIMD SSE2 is unreasonable but it would require some work (at least in bullet). I still think it's worth it though :)

    Anyways food for thought. Obviously the results in PhysX could be different.

    More importantly, this response from NVIDIA seems completely reasonable to me and I definitely don't think they are intentionally hurting CPU performance to make the GPU look good. The response of "most people write for console, port it to PC and it runs faster there so we don't look much more at it" is true in my experience. By NVIDIA's admission there's performance left on the floor but I doubt it's due to anything nefarious. Rather, it's just an increasingly dated code-base and game developer apathy. Sounds like they've got a good handle on this for PhysX 3.0 though so that will be good to see.
     
    #136 Andrew Lauritzen, Jul 12, 2010
    Last edited by a moderator: Jul 12, 2010
  17. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    I have read that physx on the console is much better then on the PC. Do you have a link or some sort of talking point that discuss this more? I ask because I've only read posts just like yours simply saying the exact same thing. But no other information that provides more details about it. I would like to read more about it.
     
  18. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Have you read anything in this thread ECH?

    http://arstechnica.com/gaming/news/...cpu-gaming-physics-library-to-spite-intel.ars

    that's nV's reply to dKanter's article and says just about everything you need straight out of the horse's mouth.

    Yes it's old, yes it's unoptimized, but you won't notice because you have a lightning fast PC or no ATI card.
     
  19. ECH

    ECH
    Regular

    Joined:
    May 24, 2007
    Messages:
    682
    Likes Received:
    7
    Yes, I've read that article but it didn't address my question. That article clearly shows that physx api needs to be completely re-coded. I understand that and it's suppose to happen sometime next year. But what I like to know is by what means is physx multithreaded on the console? it's been said more then once that physx is multithreading/multi core implementated on the console. If true, I would like to read more about it.
     
  20. neliz

    neliz GIGABYTE Man
    Veteran

    Joined:
    Mar 30, 2005
    Messages:
    4,904
    Likes Received:
    23
    Location:
    In the know
    Maybe that's in one of the old Ageia interviews? http://interviews.teamxbox.com/xbox/1117/AGEIA-Technologies-Interview/p1

    They were harping their multithreadedness back then, especially on the consoles.

    edit: ye olde B3D thread discussing this: http://forum.beyond3d.com/showthread.php?t=18897
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...