Post your code optimizations

Discussion in 'PC Hardware, Software and Displays' started by KimB, May 21, 2006.

  1. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Here's a couple that I've found to be useful:
    Take the following code:
    Code:
    for (i = 0; i < maxvalue; i++)
      p += arr[i];
    We can make use of compiler optimizations to unroll an inner loop by doing the following:
    Code:
    const int stride = 16;
    index = 0;
    for (i = 0; i < maxvalue/stride; i++)
      for (j = 0; j < stride; j++)
        p += arr[index++];
    
    for (i = 0; i < maxvalue % stride; i++)
      p+= arr[index++];
    This resulted in about a factor of two speedup when I tested it on my system.

    Another optimization that I made once, when performing one-dimensional integration, was to make use of recursion. Here's the code:
    Code:
    double trapint(function *f, double a, double b, double acc)
    {
      double f1, f2, f3, I1, I2, h;
      h = b - a;
    
      f1 = f->eval(a);                 //value at beginning of integration
      f2 = f->eval(b);                 //value at end
      f3 = f->eval((a+b)/2.0);     //half-way inbetween
    
      I1 = 0.5*h*(f1 + f2);             //Integrate using trapezoid rule approximation
      I2 = h/6.0*(f1 + 4.0*f3 + f2); //Integrate using Simpson's rule approximation
    
      if (fabs(I1 - I2) < acc)       //Check for convergence
        return I2;
      else                           //Hasn't converged: make the result of this integration equal to the sum of two integrations
        return trapint(f, a, a+h/2.0, acc, f1, f3) + trapint(f, a+h/2.0, b, acc, f3, f2);
        //The above function calls are similar to this function, except that the function values
        //at the beginning and end of the integration region are passed instead of recalculating
    }
    Making use of the above recursive algorithm for integration sped up my integral by a factor of 100, while maintaining better accuracy, than my previous "divide the integral into N steps and sum" setup. Since 1-D integration is the limiting factor in my code at the moment, I'm really hoping one day that I'll be able to implement a recursive function like this on a GPU...but that's beside the point.

    Anyway, here's a link on optimization that epicstruggle posted in the Coding style thread. It's got the loop unrolling tip, plus a number of others:
    http://www.cse.msu.edu/~cse471/lecture14_files/v3_document.htm
     
  2. db

    db
    Newcomer

    Joined:
    Mar 31, 2004
    Messages:
    43
    Likes Received:
    4
  3. epicstruggle

    epicstruggle Passenger on Serenity
    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,903
    Likes Received:
    45
    Location:
    Object in Space
    I like to make sure that my for loops have a zero comparison:

    So instead of:
    Code:
    for ( i = 0 ; i < arraysize ; i++)
    we use the following:
    Code:
    for ( i = arraysize ; i > 0 ; i-- )
    Depending on the scenario you may see some nice gain in speed optimization. Better ones are out there, but this is a nice simple one.

    epic
     
  4. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    This:
    Code:
    double A[20][1024];
    
    for(i = 0; i < 1024; i++)
        for(j = 0; j < 20; j++)
            t += A[j][0] * B;
    
    Runs three times slower than this (on some processors):
    Code:
    double A[20][1024 + 8];
    
    for(i = 0; i < 1024; i++)
        for(j = 0; j < 20; j++)
            t += A[j][0] * B;
    
    The reason is that the cache uses the lowest bits of the memory address to determine where to store the data. So in this situation although all elements of A that are used could fit in the cache, they try to use the same cache position. So there are tons of cache misses and this hurts performance badly. The second version doesn't have this problem. All used elements of A get a position in the cache and so all accesses are cache hits.

    In practice this worst-case situation doesn't occur a lot, and hardware prefetching solves it on most newer processors, but reordering your data can definitely have a significant impact. It definitely doesn't hurt to learn a bit about how your processor's cache works. Some applications spend more than 50% of time waiting for data that wasn't found in the cache.
     
  5. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    That is just horrible!

    First of all it's the compiler's task to perform that optimization if it's applicable. Secondly, you can seriously hurt performance this way if you use the index for memory addressing. Some (or even most) processors fetch data ahead, but not backward. So you end up with cache misses that can easily undo any advantage you might get from comparing to zero. Last but not least, never sacrifice readability for performance in a high-level language. If you had to debug or rewrite this loop later it will cost you extra time. Doing a loop backward for an algorithm that is supposed to have a forward loop can get you nasty data dependencies that are hard to track.

    There is a serious risk that non-proficient programmers would pick this up and write their loops this way as a habit, never thinking of possible consequences and writing horrible looking unmaintainable code...

    The only situation in which this would be acceptable is when writing assembly code. Since that gives you full control and readability is already non-existant. Plus, newbies wouldn't risk assembly programming.
     
    Tim Murray likes this.
  6. K.I.L.E.R

    K.I.L.E.R Retarded moron
    Veteran

    Joined:
    Jun 17, 2002
    Messages:
    2,952
    Likes Received:
    50
    Location:
    Australia, Melbourne
    Even non-newbies wouldn't risk ASM if they really didn't need it.
    Just ask Tim Sweeney. :)
     
  7. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Well it has always been true that 'premature optimization is the root of all evil'. But when you already use the optimal algorithms and you do need better performance then I'd say go all the way and use assembly instead of writing horrible code in a high-level language.

    Assembly routines should be regarded like black boxes. You write them once, they are thoroughly tested, and you know exactly what they do. And when you need to extend them there's hardly any other option then to rewrite them, which isn't a bad thing. And I personally always keep a C++ equivalent of the assembly routine for debugging and to better understand the assembly routine's operation. If you'd try to optimize high-level code that's a less obvious thing to do and you end up with unmaintainable code.

    By the way, isn't it Tim Sweeney who wrote the incredible MMX optimized software renderer of the original Unreal? That's still masterful hand-optimized assembly code.
     
  8. Mate Kovacs

    Newcomer

    Joined:
    Dec 12, 2004
    Messages:
    163
    Likes Received:
    3
    Location:
    Mountain View, CA
    Haven't tried this one yet, just some random junk. :D
    Some actual timing data would be nice, tho'.

    Given a laaarge array of (IEEE-754) floats to sort, we can apply a binary transformation on each, then sort them as ints, then apply the inverse transformation. The transformation is as follows: if the sign bit is set, then flip all the other bits. As it turns out, the inverse transformation is the same. :) A possible C++ implementation:
    Code:
     inline void transf(int& i) { i ^= i>>31 & ~(1<<31); }
    It can be utilized like this:
    Code:
    float array[CNT];
    //...
    int *shadow = (int*)array;
    for(int i = 0; i<CNT; i++)
      transf(shadow[i]);
    //sort, then apply again
    
    Just make sure that ints are 4 bytes long. :) The same could be done to doubles with long longs on 64-bit machines.
     
    Tim Murray likes this.
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Interesting. I tested the above on an Athlon 64 X2 3800+ and a dual Xeon 1.7GHz, and the Athlon 64 ran the code more than 200 times faster if I didn't initialize the array first! Initializing the array brought the code back to reasonable performance levels on the Xeon, and doubled performance on the Athlon 64.

    Also, contrary to what I was expecting, it was the Xeon that does better when not implementing the optimization (it only loses about 15% performance). The Athlon 64 is about 2.5 times slower without the optimization. With the optimization, the Athlon 64 is about 30% faster than the Xeon.
     
  10. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    There are several other funny tricks like that. For example, for taking the integer logarithm of an integer (basically the same as finding the bit position of the most significant bit that it set, you can do (for positive numbers only):
    Code:
    inline int ilog2( int p )
       {
       union {
         int i;
         float f;
         } intfloat;
       intfloat.f = (float)p;
       return (intfloat.i >> 23) - 127;
       } 
    
    While x86 processors do have an instruction named 'bsr' that does basically the same thing, the code sequence above is usually much faster.

    Another, unrelated trick for this thread: in the somewhat-common case that you need to divide a whole bunch of numbers by the same number, the code sequence
    Code:
    double *arr;
    for(i=0;i<size;i++)
       arr[i] /= divisor;
    
    is often a hell of a lot slower than
    Code:
    double *arr;
    double divisor_rcp = 1.0/divisor;
    for(i=0;i<size;i++)
       arr[i] *= divisor_rcp;
    
    This is pretty much trivial to do for floats; it is also possible - but much harder - to do this conversion of division to multiplication for integers as well (gcc does this if you are dividing by a compile-time constant; otherwise you are on your own).

    If you're really going for the last drop of performance: On some processors/compilers, it is possible to make the compiler use the carry flag, albeit in a somewhat roundabout way:
    Code:
    unsigned a,b;
    
    a += b;
    if(a < b) // only true if a+b overflowed, in which case the a+b computation set the carry flag.
       {
       // do the carry-flag-dependent stuff here.
       c = *d++;
       }
    
    (I have seen this code sequence compile to a grand total of 2 instructions on certain CPUs.)
     
    Tim Murray likes this.
  11. epicstruggle

    epicstruggle Passenger on Serenity
    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,903
    Likes Received:
    45
    Location:
    Object in Space
    Ive had a different experience with this type of optimization. Ive never had a penalty and almost always a nice gain in performance. Granted the last time i did any sort of serious coding was about 4-5 years ago, compilers might have incorporated this optimization into the compiling process.

    Out of curiousity, have you given it a try?

    epic
     
  12. arjan de lumens

    Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,274
    Likes Received:
    50
    Location:
    gjethus, Norway
    About the count-to-zero for-loop: if you use the counter as an array index and do not want to step backwards through memory, you can instead start it out with a negative value and have it count upwards towards 0, something as follows:
    Code:
    int *arr = ....
    
    arr += size;  // yay for pointer arithmetic!
    for(int i=-size;i<0;i++)
        arr[i] = your_data();
    
    It's not something that I would want to be doing in non-performance-critical code, though.
     
  13. epicstruggle

    epicstruggle Passenger on Serenity
    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,903
    Likes Received:
    45
    Location:
    Object in Space
    Very true. Depending on the program/enviromnent, optimizing might be a waste of time. Sometimes having a easy to read/edit codebase is more important than something hard to understand but very fast. While working on my augmented reality thesis, speed was more important that readibility. I was the only one working on the code, so i didnt have to worry about someone else looking at it, outside of my professor who was at least 1000x smarter than i am. :)

    epic
     
  14. epicstruggle

    epicstruggle Passenger on Serenity
    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,903
    Likes Received:
    45
    Location:
    Object in Space
  15. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    It made zero difference using Visual C++ 2005 on an Athlon 64 X2 4400+.

    I program in C++ and assembly daily, and I strongly advise everyone to never do this sort of things in a high-level language. It's just not worth it. It affects core manageability and there is no guarantee of improved performance. It's an attempt at controlling the assembly output, but this is totally compiler specific. If you need this level of control, step to assembly completely for the specific function, after all functional development is finished.
     
  16. epicstruggle

    epicstruggle Passenger on Serenity
    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,903
    Likes Received:
    45
    Location:
    Object in Space
    Ive posted at least 2 different sources that do show an improvement in performance. ;) I guess they are idiots.
     
  17. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    JavaScript is an interpreted language. Any discussion about performance is totally futile. I mean if performance is of any concern then don't choose JavaScript in the first place. It's like giving a crippled person running shoes.
     
  18. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,928
    Likes Received:
    230
    Location:
    Seattle, WA
    Well, that's not such a bad thing, as long as you don't expect them to actually run :)

    Anyway, tested it a couple more times myself and also noticed no difference. This includes doing the "offset and increment to zero" method.

    Edit: There is, however, a difference when you do the loop unrolling method, but it seems to help only if the array I'm incrementing is an integer array. With floating point math, there doesn't seem to be any difference.
     
  19. epicstruggle

    epicstruggle Passenger on Serenity
    Veteran

    Joined:
    Jul 24, 2002
    Messages:
    1,903
    Likes Received:
    45
    Location:
    Object in Space
    We will have to agree to disagree, ive seen some nice gains using it. You many not. Depending on what exactly is being run, we might never have overlapping results. I will leave it up to everyone to try it for themselves, it's simple enough that it wont take more than a minute or two to place it and test it. If you see a gain, great, if not, just change it back. Simple.

    epic
     
  20. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I never said that a performance improvement is not possible. In fact some of the loops I write in assembly use an inversed index and it makes a tiny difference in these few select situations. Nobody's an idiot if they know how to use it.

    My point is that the effect is so minimal and easily negated that it's not worth affecting the quality of high-level code. It definitely falls into the 'premature optimization is the root of all evil' category. It's very easy to mistype when you're not careful. And when you have to change something in that code a couple months later (or someone else has to), chances are that you'll make a mistake. And the first thing you'll do to debug it is rewriting it as a forward loop. With again a risk of breaking things or at least getting confused. So even though it looks innocent at first, it's evil.

    It's really ironic that high-level languages were invented to abstract machine-dependent code and save us time by making things readable and easy to manage. And now some people expect extra control but wish to stay at the same level. The only solution is to write assembly code. And there's absolutely nothing wrong with that (many C library functions like memcpy use hand-written assembly). But please keep high-level code high-level. It's important for fast development and manageability.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...