Post your code optimizations

Nick said:
Continuing the topic, here's an optimization I posted at DevMaster.net a few weeks ago: Fast and accurate sine/cosine.
My somewhat cheesy default-answer to fast-and-not-uber-precise transcendental functions is: Lookup Table. If the table gets too big or you need more precision, interpolate between its elements. With quadratic or cubic interpolation, the table can get surprisingly small for a given precision requirement, at least for the more common self-similar functions (sin,cos,exp,log,rcp,sqrt).

Your posting also does not address the problem of reducing the input number to the [-pi,+pi] range; here is a quite fast, accurate, but amazingly nasty way it can be done, assuming full IEEE754 compliance (and yes, on my gcc installation the code does indeed break if you remove the 'volatile' keyword :!: ) :
Code:
#define MAGIC 12582912.0f
float reduce_range( float p )
    {
    float p0 = p * (0.5/M_PI);
    volatile float p1, p2;
    p1 = p0 + MAGIC;
    p2 = p1 - MAGIC;
    p0 -= p2;
    return p0 * (2.0*M_PI);
    }
 
Last edited by a moderator:
Nice, that's pretty slick :)

It looks like what the code does is tha the MAGIC number is represented by a number with a mantissa 100000... (it's 1.5*2^23). Now, add this to a number smaller than MAGIC, and all bits that represent the fractional part of the original number get cut off. Subtract MAGIC from this new number, and you're left with the integer part. Play with that as you will :)

P.S.
Wouldn't using just 2^23 as MAGIC also work? why use 1.5*2^23?

Edit:
Hmm, it looks like the above code actually doesn't quite do what I thought. It doesn't give the integer part, but rather rounds the number to an integer, meaning the "fractional part" has a range of (-0.5, 0.5)
 
Chalnoth said:
Nice, that's pretty slick :)

It looks like what the code does is tha the MAGIC number is represented by a number with a mantissa 100000... (it's 1.5*2^23). Now, add this to a number smaller than MAGIC, and all bits that represent the fractional part of the original number get cut off. Subtract MAGIC from this new number, and you're left with the integer part. Play with that as you will :)
That's basically the heart of the entire method. I leave it as an exercise to the reader to derive appropriate values of MAGIC for the 'double' and 'long double' floating-point types.
P.S.
Wouldn't using just 2^23 as MAGIC also work? why use 1.5*2^23?
If I use 2^23 directly as MAGIC and then feed in a negative number, then the result of the first addition becomes smaller than 2^23, and in that case there will be one fractional bit left in p1's mantissa that is not chopped off correctly.
Edit:
Hmm, it looks like the above code actually doesn't quite do what I thought. It doesn't give the integer part, but rather rounds the number to an integer.
That's IEEE754 'nearest rounding' that steps into action.
 
Arjan - I have to object to your use of the term 'nasty' :) IMO there is absolutely nothing wrong with that code.

Somewhat OT, but range reduction demonstrates why storing angles in radians sucks. If it weren't for the M_PI constant, your code would also be perfectly exact.

Why not implement sinpi(x) = sin( pi * x )? Then you could store angles x in fixed point and get indexes into lookup tables with bit ops alone...

Serge
 
Here's a trick (I believe from David Seal) that I thought was really beautiful when I first saw it. It computes the number of trailing zero bits in a 64bit integer:

Code:
    inline int trailingZeros(unsigned long long v)
    {
        return magicTable[ ((v & -v) * 0x0214639644ed74dfULL) >> 57 ];
    }

Explanation: The magic constant above (call it M) must obey the following :
each of the 64 sequences of 7 consecutive bits in the 70 bit number (M << 6)
must be unique and non-zero.

Now, (v & -v) is a power of 2 (equal to the least significant set bit in v) -
multiplying by (v & -v) therefore corresponds to a left shift.

Since there is a one to one mapping between the number of trailing zeros in v,
(v & -v), and the 7 bit number produced by ((v & -v) * M) >> 57, that 7 bit
number can be used as an index into a lookup table to find the number of
trailing zeroes in v.

All without a single branch :)
 
An optimization that can really pay off if you're building any form of data structure where you'd call malloc()/new a lot for nodes and stuff, like a BSPs, Kd-trees, hash tables etc., is to allocate a large chunk of sequential memory (you can often compute maximum memory that would ever be needed) and implement your own little memory allocator that just steps forward in that memory chunk. Like for instance:

Code:
unsigned byte *getMem(const int size){
    unsigned byte *mem = currMem;
    currMem += size;
    return mem;
}

In my model class where I used Kd-trees and hash tables to remove duplicated vertices using this method doubled the performance.
 
It works best tho if you have one private area for each fixed size struct, since you can then implement a free() function without worrying about fragmentation and free()/get() can still run in constant time. Otherwise, what you've got is a stack analog in the heap where memory can only be allocated/freed efficiently on the tail end.
 
psurge said:
Arjan - I have to object to your use of the term 'nasty' :) IMO there is absolutely nothing wrong with that code.
The main reason I called it 'nasty' is that one normally thinks of floating-point numbers as a computer representation of real-numbers, which is a useful abstraction/illusion most of the time (with the most common exception being exact-equality tests). The trick I did violates that illusion - instead of carefully trying to ignore or minimize the difference between true real-numbers and floating-point numbers, it intentionally makes use of the difference to achieve a certain effect. It is indeed a valid and correct way of writing code, but it's one that probably should be accompanied by about one screenful of comments to explain just what the hell it does.
 
arjan de lumens said:
My somewhat cheesy default-answer to fast-and-not-uber-precise transcendental functions is: Lookup Table. If the table gets too big or you need more precision, interpolate between its elements. With quadratic or cubic interpolation, the table can get surprisingly small for a given precision requirement, at least for the more common self-similar functions (sin,cos,exp,log,rcp,sqrt).
I'd be more than interested in fast and reasonably accurate approximations of exp, log and pow...
Code:
#define MAGIC 12582912.0f
float reduce_range( float p )
    {
    float p0 = p * (0.5/M_PI);
    volatile float p1, p2;
    p1 = p0 + MAGIC;
    p2 = p1 - MAGIC;
    p0 -= p2;
    return p0 * (2.0*M_PI);
    }
That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?
 
Nick said:
That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?
Well, I haven't tested this, but since it's doing nothing but floating point math, I wouldn't expect any issues, except that, I believe, inf - inf = nan.
 
Chalnoth said:
Here's a couple that I've found to be useful:
Take the following code:
Code:
for (i = 0; i < maxvalue; i++)
  p += arr[i];
We can make use of compiler optimizations to unroll an inner loop by doing the following:
Code:
const int stride = 16;
index = 0;
for (i = 0; i < maxvalue/stride; i++)
  for (j = 0; j < stride; j++)
    p += arr[index++];

for (i = 0; i < maxvalue % stride; i++)
  p+= arr[index++];
I'd be concerned about the number of increments of the index - you're introducing a lot of data dependancies. Why not try something like...

Code:
pBase = array;
for (i = 0; i < maxvalue/stride; i++)
{
  for (j = 0; j < stride; j++)
  {
    p += pBase[j];
  }

  pBase += stride;
}
...etc..
 
Nick said:
I'd be more than interested in fast and reasonably accurate approximations of exp, log and pow...
exp (base 2; for other bases, premultiply the input by log2(base) ) :
  • Convert to some fixed-point format, resulting in an integer part 'i' and a fractional part 'f'(with 'f' being in the range [0,1])
  • Use a LUT or a taylor-series or other curve-fitting to compute exp(f). The result is a number in the range [1,2].
  • Compute 2^i; this is probably best done with some sort of bit-fiddling.
  • Multiply together the two results.
log (base-2, for other bases, post-multiply the result by 1/log2(base) ) :
  • Pry the input number (which we call 'x') into two parts 'm' and 'e' so that 'm' is in the range 1 to 2 and the relation 'x' = 'm'+2^'e' holds. If the input number is given as a floating-point number, you should be able to compute both 'e' and 'm' from the various bitfields of the floating-point number through just plain bit-fiddling.
  • Use a LUT or taylor-series or other curve-fitting to compute log(m) - which will end up in the range [0,1]
  • The result you want is now given by e+log(m).
pow(a,b):
  • If b is a positive integer, then you can use the Exponentiation by squaring algorithm.
  • if b is a negative integer, then you can use the relation pow(a,b) = pow(1/a, -b) and then apply exponentiation-by-squaring.
  • Otherwise you will have to use the relation pow(a,b) = exp(b*log(a))

That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?
The MAGIC-based fractional-value method only works correctly for numbers whose absolute value is less than MAGIC/3 (which is presumably not a big problem for sine range reduction, but something you may wish to keep in mind otherwise.)

The method relies critically on p1 being truncated to FP32, with no additional precision kept; this is unachievable in x87 without a memory write/read - the 'volatile' keyword is indeed there to force such a memory write/read cycle - or setting the x87 mode register (very slow), but should actually be pretty easy in something like SSE.

If you feed it INF as an input, it will throw a NAN back at you (which IMO is a perfectly mathematically sensible result, as you're effectively asking for the limit-at-infinity of a periodic function! ); I would expect denormals to survive though it, though.
 
psurge said:
Arjan - I have to object to your use of the term 'nasty' :) IMO there is absolutely nothing wrong with that code.
One thing that's nasty about it is the use of volatile to force rounding of intermediate results to 32 bit floats. I wonder whether the compiler actually generates FLD/FSTP for those or does escape analysis and notices that p1 and p2 can't leave the local scope, so just sets the FPU to 32 bit mode.

psurge said:
Explanation: The magic constant above (call it M) must obey the following :
each of the 64 sequences of 7 consecutive bits in the 70 bit number (M << 6)
must be unique and non-zero.
Unique I can understand, but why non-zero? And why 7 consecutive bits instead of 6?
For every integer n you can build a sequence of 2^n + n - 1 bits that contains all permutations of n bits.
 
Last edited by a moderator:
Xmas said:
One thing that's nasty about it is the use of volatile to force rounding of immediate results to 32 bit floats. I wonder whether the compiler actually generates FLD/FSTP for those or does escape analysis and notices that p1 and p2 can't leave the local scope, so just sets the FPU to 32 bit mode.
Setting the x87 FPU's mode register is IIRC very slow - this is the main reason why C's ordinary float-to-int conversion sucks balls performance-wise.

It IS possible to avoid the FSTP/FLD sequence, however to do that in C in a way that doesn't depend critically on platform/compiler/optimization-dependent issues, you need additional code to test out various values of MAGIC for what works and what doesn't - also, you may want to go to some lengths to prevent the function from getting inlined (even that can affect actual precision being used if you stick with x87.)
 
Chalnoth said:
Making use of the above recursive algorithm for integration sped up my integral by a factor of 100, while maintaining better accuracy, than my previous "divide the integral into N steps and sum" setup.
I remembered this from Numerical Algorithms ... it's a simplified form of the Romberg Algorithm.

Not really a code optimization ;)

Just to be on topic though ... you can use int->float conversion to do a quick and dirty Exp function.
 
MfA said:
I remembered this from Numerical Algorithms ... it's a simplified form of the Romberg Algorithm.

Not really a code optimization ;)

Just to be on topic though ... you can use int->float conversion to do a quick and dirty Exp function.
Something similar to this, I presume?
Code:
float scale_float( float f, int scale )
   {
   union {
      float f;
      int i;
      } intfloat;
   intfloat.f = f;
   intfloat.i += scale << 23;
   return intfloat.f;
   }
 
Xmas -
Non-zero because when you input the value v=0, the result of the multiplication with M is zero. The sub-sequences have to be non-zero since you don't want any of your representeable powers of 2 resulting in 7 0 MSBs when multiplied by M. Since you need 64 unique non-zero sequences, you need 7 bits instead of 6.

Serge
 
arjan de lumens said:
Setting the x87 FPU's mode register is IIRC very slow - this is the main reason why C's ordinary float-to-int conversion sucks balls performance-wise.

It IS possible to avoid the FSTP/FLD sequence, however to do that in C in a way that doesn't depend critically on platform/compiler/optimization-dependent issues, you need additional code to test out various values of MAGIC for what works and what doesn't - also, you may want to go to some lengths to prevent the function from getting inlined (even that can affect actual precision being used if you stick with x87.)

arjan, Xmas:
I totally forgot about this x87 stuff. I would like to retract my earilier hasty comment. A serious question though: I had thought that SSE (1,2,3?) included scalar fp instructions (which encode the operand precision in the instruction) - are those not there yet from a compiler/HW POV?
 
psurge said:
Xmas -
Non-zero because when you input the value v=0, the result of the multiplication with M is zero. The sub-sequences have to be non-zero since you don't want any of your representeable powers of 2 resulting in 7 0 MSBs when multiplied by M. Since you need 64 unique non-zero sequences, you need 7 bits instead of 6.

Serge
Well, I now understand why the 64 bits of the pseudo-70 (or 69)-bit number need to have no sequence of zeros, because the implicit last bits are zero. But still there is no problem at all with ((v & -v) * M) == 0, as long as that result is unique as well.
And because of this, 6 bit sequences are sufficient.
 
Back
Top