Multi threaded PhysX benchmarks - bye bye GPU PhysX.

Discussion in 'PC Gaming' started by brain_stew, Mar 18, 2010.

  1. Bouncing Zabaglione Bros.

    Legend

    Joined:
    Jun 24, 2003
    Messages:
    6,363
    How about any PhysX implementation on a console? Just because everyone else is also behind on physics, is doesn't mean the poor PC PhysX implementation is good - it's just as bad on the PC as anything else - and is not to be patted on the back for that.

    The only reason for the push on PhysX is it's one of the few differentiators at a time when Nvidia has been failing to provide good products that can stand without these leveraged non-standard API's.

    I think the ultimate argument is the new PhysX SDK. If Nvidia is now trumpeting a new version that they say specifically addresses the poor multicore CPU implementation (something that's been fixed on the console implementation for a long time), then it's pretty easy to extrapolate that the current version doesn't do the job.
     
  2. Lonbjerg

    Newcomer

    Joined:
    Jan 8, 2010
    Messages:
    197
    The PhysX on consoles don't come close to the GPU physX, what is your point?
     
  3. Bouncing Zabaglione Bros.

    Legend

    Joined:
    Jun 24, 2003
    Messages:
    6,363
    PhysX's good use of multiple CPU cores on consoles is confirmed, just as the use of a single CPU core on a PC has been confirmed.

    What exactly is your point? That if everything else is as crap at using multiple CPU cores for physics on a PC as PhysX is, then PhysX is somehow a great solution because it works well enough (bar massive framerate drops) on GPUs?

    Your's is a strawman argument. Everyone is complaining that PhysX doesn't use multicore CPUs properly in CPU mode, and you're asking for anything else that does it better. Crap is still crap, even if alternatives are crap too.

    If PhysX is so good, why is it only just as bad as everything else on the CPU? Why is Nvidia finally and publicly addressing this specific problem if it's a non-issue as you seem to think?

    As I said higher up, it's time in the life-cycle of PhysX as a marketing feature to become inclusive of all platforms if it wants to survive, expand, and maybe become the industry standard in the future.
     
  4. Lonbjerg

    Newcomer

    Joined:
    Jan 8, 2010
    Messages:
    197
    Metro2033 uses multicore CPU physX, what is your point?:
    http://physxinfo.com/news/2447/advanced-physx-in-metro-2033/

    That other devs didn't code for that...is evidence of what?

    PhysX (in it's current form) isn't locked to a single core, but it's up to the dev to implment that.

    Only thing I see evidence for is that you don't know what you are talking about, when you can say this:
    I just proved you wrong...what smoke&mirrors will you post next?

    EDIT:
    I guess FluidMark is singlethreaded too?!
     
  5. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    12,907
    Not only that but not to use at the very least SSE is absolutely criminal.

    Glad you brought up Metro 2033, as EVERYTHING it does on GPU it also does on CPU. There is absolutely nothing that is done on GPU that couldn't be done on CPU. The only limitation is Nvidia purposely crippling the CPU implementation. Without those limitations, there would be nothing to make GPU PhysX appear special in the face of the same effects on CPU.

    Thus the ONLY reasoning for the deliberate crippling of the CPU side (even Ageia had better CPU support in PhysX prior to Nvidia buying it out, and they were also deliberately holding back CPU performance) was to give an artificial reason to buy their cards over the competitions. Any benefit to the consumer was ancilliary to this.

    How many good GPU accerlated games are there? I can only think of arguably 3. Mirror's Edge, Batman, and Metro 2033. Except in Metro the CPU can do everything the GPU does only to a lesser extent. Everything else was pretty much crap and shovelware, with Nvidia desperate to get anyone to do GPU physics in an attempt to push their video cards.

    Personally, I think it's rather telling that most of the best AAA games still prefer to use Havok over PhysX.

    Oh and since you like to constantly bring up what has Havok done that can challenge PhysX GPU acceleration, I'd like to see even one title using PhysX CPU that's even remotely as good as some of the well implemented Havok CPU games. There aren't any that I can think of. Probably because no PhysX CPU implemented game has had any physics effects that stand out compared to Havok games or even company specific physics engines.

    As to fluidmark, it received major developement help from Nvidia which helped specifically code it to hide some glaring weaknesses of the GPU versus the CPU, and at the same time make sure those situations were the worst possible for CPU. Many of the comments and some testers have revealed that with some changes (increased emitters I believe was one) that you can bring GPU speed down to a snails pace while CPU speed is virtually unaffected.

    Regards,
    SB
     
  6. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,636
    Havok, has shown more than once demos that look and react better than things that nvidia claimed you need a GPU to do.
     
  7. Lonbjerg

    Newcomer

    Joined:
    Jan 8, 2010
    Messages:
    197
    "Oddly" the developer seems to disagree with you:
    http://www.pcgameshardware.com/aid,706182/Exclusive-tech-interview-on-Metro-2033/News/

    PCGH: What are the visual differences between physics calculated by CPU and GPU (via PhysX, OpenCL or even DX Compute)? Are there any features that players without an Nvidia card will miss? What technical features cannot be realized with the CPU as "physics calculator”?
    Oles Shishkovstov: There are no visible differences as they both operate on ordinary IEEE floating point. The GPU only allows more compute heavy stuff to be simulated because they are an order of magnitude faster in data-parallel algorithms. As for Metro2033 - the game always calculates rigid-body physics on CPU, but cloth physics, soft-body physics, fluid physics and particle physics on whatever the users have (multiple CPU cores or GPU). Users will be able to enable more compute-intensive stuff via in-game option regardless of what hardware they have.

    Would you stop with the lies?
    The current PhysX SDK is not limited to singlethreaded PhySX...you can (as the developers did for Metro2033) make it mulithreaded

    I I own a PPU and have done since 2006, and I have never seen any AGEIA/PPU/PhysX game that used multi-core CPU physics.

    NEVER!

    Care to show me any AGEIA multi-CPU PhysX outside of tech demos?
    I even think I linked to both thier benchmark (RealityMark) and their PhysX Rocket being singlethreaded on this forum (but can't be arsed to dig up that post right now)

    Ah...trying to shift the goalpost.
    How many CPU physics games are out there doing the same?
    Wake me up when the number gets > 0

    Old habbits die hard...especially when you can reuse old stuff.
    But the trend is showing something else:
    http://bulletphysics.org/wordpress/?p=88

    PhysX is overtaking Havok's leading role, sorry to burst your bubble.


    Again, a lot of smoke&mirros (is that the SOP for this site...you can't prove/disprove something, then you try all sort of hula-hops to mask that?)

    Where are the Havok CPU games making CPU PhysX look borked?
    I remember one CPU physics game doing what I see Havok doing today:
    http://www.youtube.com/watch?v=-8z9CP6u5kk

    It seems to me that most of you argumentation is flawed/outdated *shrugs*

    What is next?
    BC2 having "awesome" physcis? :roll:
     
  8. Lonbjerg

    Newcomer

    Joined:
    Jan 8, 2010
    Messages:
    197
    Care to show me these demos...bet I can match them?

    But, yeah...demos...what does that remind me off?
    .oO(Oh yeah...)


    Larrabee tech demos :roll:
    Didn't you guys push "Project Offset" and it's physics on Larrabee over CPU physics?
    Before the hole thing floored and Larra-gate came...and now "Proejct Offset" is dead

    Wake me up when you can present an actual game, not PR-demos...

    This remind me of you calling X86 an open standard...or you posts against "fermi" in supercomputers.
    You should change you sig to "speaking for Intel" *hint-hint*
     
  9. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,340
    I'm not sure why anyone would expect developers that license a physics engine to multi-thread it. I know if I were licensing a physics engine it would be with the goal of mucking with it as little as possible. As indicated above the Metro 2033 developers did some work, but that doesn't seem to be the norm.

    Game developers often heavily modify the general game engine, but I'm guessing physics is different. As with most here though I'm not a professional game developer and have not integrated a physics engine into a game.

    Does anyone that has integrated a physics engine have an opinion on the expectations?
     
  10. hoom

    Veteran

    Joined:
    Sep 23, 2003
    Messages:
    2,403
    Either I have a really bad case of deja-vu or someone pointed out this single-threaded x87 thing ages ago :?:
     
  11. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    693
  12. Chalnoth

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,706
    Location:
    New York, NY
    A simple recompilation of a program with the SSE option enabled only uses the barest minimum of what SSE can do to improve performance. Basically, x87 floating point performance is absolute shit, and SSE helps to correct that by supplying extra registers. This won't always fix performance, but can lead to up to a 20% increase or so.

    The real power of SSE, however, comes from making use of its SIMD capabilities, which are best used with dedicated, optimized code. Typically a program would make use of these functions by using optimized libraries (e.g. for linear algebra), though using the Intel compiler may allow for some better improvements from a simple recompilation.
     
  13. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    359
    Lonbjerg - I think it would behoove you to try and understand the issues I presented in that article and the implications. Reading about x87 and SSE and vectorization would be a good start.

    Also, try and think about the scenario from a developer's perspective - they want to use PhysX to avoid dealing with physics and get about developing their game. Part of the product of a physics engine is performance.

    DK
     
  14. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    12,907
    Did you even read what you quoted? They agree with me completely. In face the GPU can only do a subset of things the CPU is tasked to do. When no GPU is available for PhysX, ALL of it is done on CPU with nothing left out. The GPU only makes things faster. Considering how crippled PhysX is on CPU it's quite interesting that they went this route.

    Ever wonder why? Probably not since you virtually worship PhysX. It is pretty telling that only now, after much community backlash has Nvidia decided to allow automatic multithreading on CPU (we'll see how commited they are to it and how well it's implemented) although it's been available for sometime for GPU and consoles.

    Likewise with CPU SIMD instructions, available on Consoles (Altivec like SIMD) but not on PC (SSEx SIMD). Oh but that would hurt their attempt to sell GPU's by promoting GPU PhysX as being artificially so much faster than CPU physics. Not to mention hurting their efforts at badmouthing Intel and trying to marginalize the role of CPUs going into the future.

    Amusing, that tells us absolutely nothing. Notice I mentioned AAA devs? Dev's with a budget to actually make a good AAA game? Sure if your project has barely got a budget a free SDK is going to be very attractive even if it sucked donkey nuts.

    I'm sure if we moved down and sampled devs with very tiny budgets Havok would have virtually 0 share. But again, completely avoids the point that I made.

    Additionally that includes console devs. Consoles, where CPU PhysX actually is fairly decent and actually does get some developement. I'd actually be interested to see how well CPU PhysX would do on PC if the PC version were brought up to par with the console version. Perhaps then it would have something over Havok other than price.

    Oh wait that would further reduce the artificial dominance of GPU PhysX. Not going to happen until there is even more community backlash.

    Seriously you're using THAT to showcase good implementation of CPU physics? It's better than nothing I suppose, but doesn't hold a candle to CoH or even the reduced amount of physics in DoW II. Heck DoW I does more and better and I wouldn't put that near the top of the list.

    Regards,
    SB
     
  15. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,517
    Location:
    British Columbia, Canada
    That actually is a much larger difference than I expected. Some of those benchmarks saved multiple ms's just by switching to scalar SSE! That's huge for changing a compiler flag frankly and there's no excuse for not doing at least that much.

    [Reading a bit more in the linked thread though, it's less clear what was actually done... various source-level flags and such were changed as well so it may have affected other code gen :S. Still, quite the large effect if it is mostly x87 -> scalar SSE!]

    But yeah, the point David was making is that there's also no excuse for not actually using the SSE SIMD instructions (more than just scalar SSE), which could easily buy 1.5x-2x performance. I don't see how anyone can really disagree with that point unless you are really arguing that the work isn't justified for only the 98.55% of people who have machines that support SSE2.

    I'm assuming that everyone in this thread does understand the difference between scalar SSE, SIMD SSE and x87 though, correct?
     
    #116 Andrew Lauritzen, Jul 9, 2010
    Last edited by a moderator: Jul 9, 2010
  16. Chalnoth

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,706
    Location:
    New York, NY
    Just for fun, I thought I'd throw together a trivial test to take a look at SSE performance on my AMD Phenom II. The test code is extremely simple: make some arbitrary 4x4 matrix, and multiply it by itself ten million times. Here is the exact source:

    Code:
    void mul_mat(double **mat1, double **mat2, int sz)
    {
      int ii, jj, kk;
      for (ii = 0; ii < sz; ii++)
      {
        double temp[sz];
        for (jj = 0; jj < sz; jj++)
        {
          temp[jj] = 0.;
          for (kk = 0; kk < sz; kk++)
    	temp[jj] += mat1[ii][kk]*mat2[kk][jj];
        }
        for (jj = 0; jj < sz; jj++)
          mat1[ii][jj] = temp[jj];
      }
    }
    
    int main()
    {
      const int sz = 4;
      const int num = 10000000;
      int ii, jj;
    
      double **mat1 = new double*[sz];
      for (ii = 0; ii < sz; ii++)
        mat1[ii] = new double[sz];
      double **mat2 = new double*[sz];
      for (ii = 0; ii < sz; ii++)
        mat2[ii] = new double[sz];
    
      for (ii = 0; ii < sz; ii++)
        for (jj = 0; jj < sz; jj++)
        {
          mat1[ii][jj] = double(ii)/sz*double(jj)/sz;
          mat2[ii][jj] = double(ii)/sz*double(jj)/sz;
        }
    
      for (ii = 0; ii < num; ii++)
        mul_mat(mat1, mat2, sz);
    }
    No inputs, no outputs, no library files. Just some very simple math. Now, if I compile the code with the following command:

    icc test-perf.cpp -o test-perf -O3 -msse2 -m32

    ...the code completes in 0.38 seconds. If, instead, I compile with this command:

    icc test-perf.cpp -o test-perf -O3 -mno-sse -mno-sse2 -m32

    ...the code takes a whole 11.4 seconds to finish. That's a speed up of 30 times! Now, this is a very artificial scenario, but it just highlights how horrible the x87 floating point unit can be. If, by contrast, I compile with the "-m64" option to produce a 64-bit binary, none of the SSE options make any noticeable difference (all complete in about 0.38-0.39 seconds). My suspicion is that the small size of the matrix allows the extra register space available from using the SSE2 instructions to really make a big impact, and a 4x4 matrix is just large enough to overflow the register space for x87, but not with the extra SSE2 registers. If I pick a 3x3 or 2x2 matrix, by contrast, the performance benefit drops to 2x. If I increase to an 8x8 matrix, the performance benefit disappears.

    My strong suspicion is that if one were to properly optimize this simple matrix multiplication code for SSE, the full performance benefit, even for larger matrices, would be closer to 2x. I'm not really sure how to do that, though.
     
  17. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,636
    you just buy a license for MKL/IPP/TBB from Intel. MKL/IPP are fully dynamic kernel libraries that support a wide variety of math functions. If you don't want to use the Intel libraries, there are various other ones available that do the same things but with lower performance. Basically, this is what any HPC site does for matmul, etc, they utilize libraries that auto-optimize for the various architectures available and they don't bother re-inventing the wheel poorly.
     
  18. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,517
    Location:
    British Columbia, Canada
    Wow... ok definitely something like heavy register pressure going on, but it does demonstrate potential.

    I believe in IA64/AMD64 mode all compilers assume SSE2 as a baseline since all processors released with those instruction sets also support SSE2. In MSVC /arch:SSE2 has no effect in the 64-bit compiler for the same reason. I imagine given the results it's just ignoring any flags and using SSE2 no matter what in 64-bit mode.

    You have to assume that it's auto-vectorizing here too (using SIMD SSE), but still an impressive result. I'm curious what the result is if you use single precision floats? Also you may want to try aligning your matrices in memory... I don't recall how bad it is nowadays but SSE load/stores used to take a hit for unaligned accesses. Of course this doesn't help the x87 results at all :S
     
  19. Chalnoth

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,706
    Location:
    New York, NY
    Very true. But this is just for an exceedingly simple test, and I honestly don't know how to make use of those libraries (which are available for free under Linux, by the way). My understanding is that they make use of an implementation of blas/lapack for linear algebra routines, which I've resolutely steered clear from because a) I haven't done any performance-intensive linear algebra, and b) the syntax that I have seen is horrific.

    In any case, one would expect that any halfway decent library would make use of such optimized libraries. It would really only take me a day or so to figure out how to integrate blas/lapack into my own programs. A performance library that relies heavily upon large numbers of simple linear algebra manipulations, as a physics library would, should be built from the ground-up to make use of such optimized routines.
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...