Post your code optimizations

KimB

Legend
Here's a couple that I've found to be useful:
Take the following code:
Code:
for (i = 0; i < maxvalue; i++)
  p += arr[i];
We can make use of compiler optimizations to unroll an inner loop by doing the following:
Code:
const int stride = 16;
index = 0;
for (i = 0; i < maxvalue/stride; i++)
  for (j = 0; j < stride; j++)
    p += arr[index++];

for (i = 0; i < maxvalue % stride; i++)
  p+= arr[index++];
This resulted in about a factor of two speedup when I tested it on my system.

Another optimization that I made once, when performing one-dimensional integration, was to make use of recursion. Here's the code:
Code:
double trapint(function *f, double a, double b, double acc)
{
  double f1, f2, f3, I1, I2, h;
  h = b - a;

  f1 = f->eval(a);                 //value at beginning of integration
  f2 = f->eval(b);                 //value at end
  f3 = f->eval((a+b)/2.0);     //half-way inbetween

  I1 = 0.5*h*(f1 + f2);             //Integrate using trapezoid rule approximation
  I2 = h/6.0*(f1 + 4.0*f3 + f2); //Integrate using Simpson's rule approximation

  if (fabs(I1 - I2) < acc)       //Check for convergence
    return I2;
  else                           //Hasn't converged: make the result of this integration equal to the sum of two integrations
    return trapint(f, a, a+h/2.0, acc, f1, f3) + trapint(f, a+h/2.0, b, acc, f3, f2);
    //The above function calls are similar to this function, except that the function values
    //at the beginning and end of the integration region are passed instead of recalculating
}
Making use of the above recursive algorithm for integration sped up my integral by a factor of 100, while maintaining better accuracy, than my previous "divide the integral into N steps and sum" setup. Since 1-D integration is the limiting factor in my code at the moment, I'm really hoping one day that I'll be able to implement a recursive function like this on a GPU...but that's beside the point.

Anyway, here's a link on optimization that epicstruggle posted in the Coding style thread. It's got the loop unrolling tip, plus a number of others:
http://www.cse.msu.edu/~cse471/lecture14_files/v3_document.htm
 
I like to make sure that my for loops have a zero comparison:

So instead of:
Code:
for ( i = 0 ; i < arraysize ; i++)

we use the following:
Code:
for ( i = arraysize ; i > 0 ; i-- )

Depending on the scenario you may see some nice gain in speed optimization. Better ones are out there, but this is a nice simple one.

epic
 
This:
Code:
double A[20][1024];

for(i = 0; i < 1024; i++)
    for(j = 0; j < 20; j++)
        t += A[j][0] * B;
Runs three times slower than this (on some processors):
Code:
double A[20][1024 + 8];

for(i = 0; i < 1024; i++)
    for(j = 0; j < 20; j++)
        t += A[j][0] * B;
The reason is that the cache uses the lowest bits of the memory address to determine where to store the data. So in this situation although all elements of A that are used could fit in the cache, they try to use the same cache position. So there are tons of cache misses and this hurts performance badly. The second version doesn't have this problem. All used elements of A get a position in the cache and so all accesses are cache hits.

In practice this worst-case situation doesn't occur a lot, and hardware prefetching solves it on most newer processors, but reordering your data can definitely have a significant impact. It definitely doesn't hurt to learn a bit about how your processor's cache works. Some applications spend more than 50% of time waiting for data that wasn't found in the cache.
 
epicstruggle said:
I like to make sure that my for loops have a zero comparison:

...

Depending on the scenario you may see some nice gain in speed optimization. Better ones are out there, but this is a nice simple one.
That is just horrible!

First of all it's the compiler's task to perform that optimization if it's applicable. Secondly, you can seriously hurt performance this way if you use the index for memory addressing. Some (or even most) processors fetch data ahead, but not backward. So you end up with cache misses that can easily undo any advantage you might get from comparing to zero. Last but not least, never sacrifice readability for performance in a high-level language. If you had to debug or rewrite this loop later it will cost you extra time. Doing a loop backward for an algorithm that is supposed to have a forward loop can get you nasty data dependencies that are hard to track.

There is a serious risk that non-proficient programmers would pick this up and write their loops this way as a habit, never thinking of possible consequences and writing horrible looking unmaintainable code...

The only situation in which this would be acceptable is when writing assembly code. Since that gives you full control and readability is already non-existant. Plus, newbies wouldn't risk assembly programming.
 
K.I.L.E.R said:
Even non-newbies wouldn't risk ASM if they really didn't need it.
Just ask Tim Sweeney. :)
Well it has always been true that 'premature optimization is the root of all evil'. But when you already use the optimal algorithms and you do need better performance then I'd say go all the way and use assembly instead of writing horrible code in a high-level language.

Assembly routines should be regarded like black boxes. You write them once, they are thoroughly tested, and you know exactly what they do. And when you need to extend them there's hardly any other option then to rewrite them, which isn't a bad thing. And I personally always keep a C++ equivalent of the assembly routine for debugging and to better understand the assembly routine's operation. If you'd try to optimize high-level code that's a less obvious thing to do and you end up with unmaintainable code.

By the way, isn't it Tim Sweeney who wrote the incredible MMX optimized software renderer of the original Unreal? That's still masterful hand-optimized assembly code.
 
Haven't tried this one yet, just some random junk. :D
Some actual timing data would be nice, tho'.

Given a laaarge array of (IEEE-754) floats to sort, we can apply a binary transformation on each, then sort them as ints, then apply the inverse transformation. The transformation is as follows: if the sign bit is set, then flip all the other bits. As it turns out, the inverse transformation is the same. :) A possible C++ implementation:
Code:
 inline void transf(int& i) { i ^= i>>31 & ~(1<<31); }
It can be utilized like this:
Code:
float array[CNT];
//...
int *shadow = (int*)array;
for(int i = 0; i<CNT; i++)
  transf(shadow[i]);
//sort, then apply again
Just make sure that ints are 4 bytes long. :) The same could be done to doubles with long longs on 64-bit machines.
 
Nick said:
This:
Code:
double A[20][1024];

for(i = 0; i < 1024; i++)
    for(j = 0; j < 20; j++)
        t += A[j][0] * B;
Runs three times slower than this (on some processors):
Code:
double A[20][1024 + 8];

for(i = 0; i < 1024; i++)
    for(j = 0; j < 20; j++)
        t += A[j][0] * B;
The reason is that the cache uses the lowest bits of the memory address to determine where to store the data. So in this situation although all elements of A that are used could fit in the cache, they try to use the same cache position. So there are tons of cache misses and this hurts performance badly. The second version doesn't have this problem. All used elements of A get a position in the cache and so all accesses are cache hits.
Interesting. I tested the above on an Athlon 64 X2 3800+ and a dual Xeon 1.7GHz, and the Athlon 64 ran the code more than 200 times faster if I didn't initialize the array first! Initializing the array brought the code back to reasonable performance levels on the Xeon, and doubled performance on the Athlon 64.

Also, contrary to what I was expecting, it was the Xeon that does better when not implementing the optimization (it only loses about 15% performance). The Athlon 64 is about 2.5 times slower without the optimization. With the optimization, the Athlon 64 is about 30% faster than the Xeon.
 
Mate Kovacs said:
Haven't tried this one yet, just some random junk. :D
Some actual timing data would be nice, tho'.

Given a laaarge array of (IEEE-754) floats to sort, we can apply a binary transformation on each, then sort them as ints, then apply the inverse transformation.
There are several other funny tricks like that. For example, for taking the integer logarithm of an integer (basically the same as finding the bit position of the most significant bit that it set, you can do (for positive numbers only):
Code:
inline int ilog2( int p )
   {
   union {
     int i;
     float f;
     } intfloat;
   intfloat.f = (float)p;
   return (intfloat.i >> 23) - 127;
   }
While x86 processors do have an instruction named 'bsr' that does basically the same thing, the code sequence above is usually much faster.

Another, unrelated trick for this thread: in the somewhat-common case that you need to divide a whole bunch of numbers by the same number, the code sequence
Code:
double *arr;
for(i=0;i<size;i++)
   arr[i] /= divisor;
is often a hell of a lot slower than
Code:
double *arr;
double divisor_rcp = 1.0/divisor;
for(i=0;i<size;i++)
   arr[i] *= divisor_rcp;
This is pretty much trivial to do for floats; it is also possible - but much harder - to do this conversion of division to multiplication for integers as well (gcc does this if you are dividing by a compile-time constant; otherwise you are on your own).

If you're really going for the last drop of performance: On some processors/compilers, it is possible to make the compiler use the carry flag, albeit in a somewhat roundabout way:
Code:
unsigned a,b;

a += b;
if(a < b) // only true if a+b overflowed, in which case the a+b computation set the carry flag.
   {
   // do the carry-flag-dependent stuff here.
   c = *d++;
   }
(I have seen this code sequence compile to a grand total of 2 instructions on certain CPUs.)
 
Nick said:
That is just horrible!

First of all it's the compiler's task to perform that optimization if it's applicable. Secondly, you can seriously hurt performance this way if you use the index for memory addressing. Some (or even most) processors fetch data ahead, but not backward. So you end up with cache misses that can easily undo any advantage you might get from comparing to zero. Last but not least, never sacrifice readability for performance in a high-level language. If you had to debug or rewrite this loop later it will cost you extra time. Doing a loop backward for an algorithm that is supposed to have a forward loop can get you nasty data dependencies that are hard to track.

There is a serious risk that non-proficient programmers would pick this up and write their loops this way as a habit, never thinking of possible consequences and writing horrible looking unmaintainable code...

The only situation in which this would be acceptable is when writing assembly code. Since that gives you full control and readability is already non-existant. Plus, newbies wouldn't risk assembly programming.
Ive had a different experience with this type of optimization. Ive never had a penalty and almost always a nice gain in performance. Granted the last time i did any sort of serious coding was about 4-5 years ago, compilers might have incorporated this optimization into the compiling process.

Out of curiousity, have you given it a try?

epic
 
About the count-to-zero for-loop: if you use the counter as an array index and do not want to step backwards through memory, you can instead start it out with a negative value and have it count upwards towards 0, something as follows:
Code:
int *arr = ....

arr += size;  // yay for pointer arithmetic!
for(int i=-size;i<0;i++)
    arr[i] = your_data();
It's not something that I would want to be doing in non-performance-critical code, though.
 
arjan de lumens said:
It's not something that I would want to be doing in non-performance-critical code, though.
Very true. Depending on the program/enviromnent, optimizing might be a waste of time. Sometimes having a easy to read/edit codebase is more important than something hard to understand but very fast. While working on my augmented reality thesis, speed was more important that readibility. I was the only one working on the code, so i didnt have to worry about someone else looking at it, outside of my professor who was at least 1000x smarter than i am. :)

epic
 
epicstruggle said:
Ive had a different experience with this type of optimization. Ive never had a penalty and almost always a nice gain in performance. Granted the last time i did any sort of serious coding was about 4-5 years ago, compilers might have incorporated this optimization into the compiling process.

Out of curiousity, have you given it a try?
It made zero difference using Visual C++ 2005 on an Athlon 64 X2 4400+.

I program in C++ and assembly daily, and I strongly advise everyone to never do this sort of things in a high-level language. It's just not worth it. It affects core manageability and there is no guarantee of improved performance. It's an attempt at controlling the assembly output, but this is totally compiler specific. If you need this level of control, step to assembly completely for the specific function, after all functional development is finished.
 
Nick said:
It made zero difference using Visual C++ 2005 on an Athlon 64 X2 4400+.

I program in C++ and assembly daily, and I strongly advise everyone to never do this sort of things in a high-level language. It's just not worth it. It affects core manageability and there is no guarantee of improved performance. It's an attempt at controlling the assembly output, but this is totally compiler specific. If you need this level of control, step to assembly completely for the specific function, after all functional development is finished.
Ive posted at least 2 different sources that do show an improvement in performance. ;) I guess they are idiots.
 
Nick said:
JavaScript is an interpreted language. Any discussion about performance is totally futile. I mean if performance is of any concern then don't choose JavaScript in the first place. It's like giving a crippled person running shoes.
Well, that's not such a bad thing, as long as you don't expect them to actually run :)

Anyway, tested it a couple more times myself and also noticed no difference. This includes doing the "offset and increment to zero" method.

Edit: There is, however, a difference when you do the loop unrolling method, but it seems to help only if the array I'm incrementing is an integer array. With floating point math, there doesn't seem to be any difference.
 
We will have to agree to disagree, ive seen some nice gains using it. You many not. Depending on what exactly is being run, we might never have overlapping results. I will leave it up to everyone to try it for themselves, it's simple enough that it wont take more than a minute or two to place it and test it. If you see a gain, great, if not, just change it back. Simple.

epic
 
epicstruggle said:
Ive posted at least 2 different sources that do show an improvement in performance. ;) I guess they are idiots.
I never said that a performance improvement is not possible. In fact some of the loops I write in assembly use an inversed index and it makes a tiny difference in these few select situations. Nobody's an idiot if they know how to use it.

My point is that the effect is so minimal and easily negated that it's not worth affecting the quality of high-level code. It definitely falls into the 'premature optimization is the root of all evil' category. It's very easy to mistype when you're not careful. And when you have to change something in that code a couple months later (or someone else has to), chances are that you'll make a mistake. And the first thing you'll do to debug it is rewriting it as a forward loop. With again a risk of breaking things or at least getting confused. So even though it looks innocent at first, it's evil.

It's really ironic that high-level languages were invented to abstract machine-dependent code and save us time by making things readable and easy to manage. And now some people expect extra control but wish to stay at the same level. The only solution is to write assembly code. And there's absolutely nothing wrong with that (many C library functions like memcpy use hand-written assembly). But please keep high-level code high-level. It's important for fast development and manageability.
 
Back
Top