Post your code optimizations

Nick · May 21, 2006

Continuing the topic, here's an optimization I posted at DevMaster.net a few weeks ago: Fast and accurate sine/cosine.

arjan de lumens · May 22, 2006

Nick said:
Continuing the topic, here's an optimization I posted at DevMaster.net a few weeks ago: Fast and accurate sine/cosine.

My somewhat cheesy default-answer to fast-and-not-uber-precise transcendental functions is: Lookup Table. If the table gets too big or you need more precision, interpolate between its elements. With quadratic or cubic interpolation, the table can get surprisingly small for a given precision requirement, at least for the more common self-similar functions (sin,cos,exp,log,rcp,sqrt).

Your posting also does not address the problem of reducing the input number to the [-pi,+pi] range; here is a quite fast, accurate, but amazingly nasty way it can be done, assuming full IEEE754 compliance (and yes, on my gcc installation the code does indeed break if you remove the 'volatile' keyword :!:

) :

Code:

#define MAGIC 12582912.0f
float reduce_range( float p )
    {
    float p0 = p * (0.5/M_PI);
    volatile float p1, p2;
    p1 = p0 + MAGIC;
    p2 = p1 - MAGIC;
    p0 -= p2;
    return p0 * (2.0*M_PI);
    }

KimB · May 22, 2006

Nice, that's pretty slick

It looks like what the code does is tha the MAGIC number is represented by a number with a mantissa 100000... (it's 1.5*2^23). Now, add this to a number smaller than MAGIC, and all bits that represent the fractional part of the original number get cut off. Subtract MAGIC from this new number, and you're left with the integer part. Play with that as you will

P.S.
Wouldn't using just 2^23 as MAGIC also work? why use 1.5*2^23?

Edit:
Hmm, it looks like the above code actually doesn't quite do what I thought. It doesn't give the integer part, but rather rounds the number to an integer, meaning the "fractional part" has a range of (-0.5, 0.5)

arjan de lumens · May 22, 2006

Chalnoth said:
Nice, that's pretty slick

It looks like what the code does is tha the MAGIC number is represented by a number with a mantissa 100000... (it's 1.5*2^23). Now, add this to a number smaller than MAGIC, and all bits that represent the fractional part of the original number get cut off. Subtract MAGIC from this new number, and you're left with the integer part. Play with that as you will

That's basically the heart of the entire method. I leave it as an exercise to the reader to derive appropriate values of MAGIC for the 'double' and 'long double' floating-point types.

P.S.
Wouldn't using just 2^23 as MAGIC also work? why use 1.5*2^23?

If I use 2^23 directly as MAGIC and then feed in a negative number, then the result of the first addition becomes smaller than 2^23, and in that case there will be one fractional bit left in p1's mantissa that is not chopped off correctly.

Edit:
Hmm, it looks like the above code actually doesn't quite do what I thought. It doesn't give the integer part, but rather rounds the number to an integer.

That's IEEE754 'nearest rounding' that steps into action.

psurge · May 22, 2006

Arjan - I have to object to your use of the term 'nasty'

IMO there is absolutely nothing wrong with that code.

Somewhat OT, but range reduction demonstrates why storing angles in radians sucks. If it weren't for the M_PI constant, your code would also be perfectly exact.

Why not implement sinpi(x) = sin( pi * x )? Then you could store angles x in fixed point and get indexes into lookup tables with bit ops alone...

Serge

psurge · May 22, 2006

Here's a trick (I believe from David Seal) that I thought was really beautiful when I first saw it. It computes the number of trailing zero bits in a 64bit integer:

Code:

    inline int trailingZeros(unsigned long long v)
    {
        return magicTable[ ((v & -v) * 0x0214639644ed74dfULL) >> 57 ];
    }

Explanation: The magic constant above (call it M) must obey the following :
each of the 64 sequences of 7 consecutive bits in the 70 bit number (M << 6)
must be unique and non-zero.

Now, (v & -v) is a power of 2 (equal to the least significant set bit in v) -
multiplying by (v & -v) therefore corresponds to a left shift.

Since there is a one to one mapping between the number of trailing zeros in v,
(v & -v), and the 7 bit number produced by ((v & -v) * M) >> 57, that 7 bit
number can be used as an index into a lookup table to find the number of
trailing zeroes in v.

All without a single branch

Humus · May 22, 2006

An optimization that can really pay off if you're building any form of data structure where you'd call malloc()/new a lot for nodes and stuff, like a BSPs, Kd-trees, hash tables etc., is to allocate a large chunk of sequential memory (you can often compute maximum memory that would ever be needed) and implement your own little memory allocator that just steps forward in that memory chunk. Like for instance:

Code:

unsigned byte *getMem(const int size){
    unsigned byte *mem = currMem;
    currMem += size;
    return mem;
}

In my model class where I used Kd-trees and hash tables to remove duplicated vertices using this method doubled the performance.

DemoCoder · May 22, 2006

It works best tho if you have one private area for each fixed size struct, since you can then implement a free() function without worrying about fragmentation and free()/get() can still run in constant time. Otherwise, what you've got is a stack analog in the heap where memory can only be allocated/freed efficiently on the tail end.

arjan de lumens · May 22, 2006

psurge said:
Arjan - I have to object to your use of the term 'nasty' IMO there is absolutely nothing wrong with that code.

The main reason I called it 'nasty' is that one normally thinks of floating-point numbers as a computer representation of real-numbers, which is a useful abstraction/illusion most of the time (with the most common exception being exact-equality tests). The trick I did violates that illusion - instead of carefully trying to ignore or minimize the difference between true real-numbers and floating-point numbers, it intentionally makes use of the difference to achieve a certain effect. It is indeed a valid and correct way of writing code, but it's one that probably should be accompanied by about one screenful of comments to explain just what the hell it does.

Nick · May 22, 2006

arjan de lumens said:
My somewhat cheesy default-answer to fast-and-not-uber-precise transcendental functions is: Lookup Table. If the table gets too big or you need more precision, interpolate between its elements. With quadratic or cubic interpolation, the table can get surprisingly small for a given precision requirement, at least for the more common self-similar functions (sin,cos,exp,log,rcp,sqrt).

I'd be more than interested in fast and reasonably accurate approximations of exp, log and pow...

Code:

#define MAGIC 12582912.0f
float reduce_range( float p )
    {
    float p0 = p * (0.5/M_PI);
    volatile float p1, p2;
    p1 = p0 + MAGIC;
    p2 = p1 - MAGIC;
    p0 -= p2;
    return p0 * (2.0*M_PI);
    }

That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?

KimB · May 22, 2006

Nick said:
That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?

Well, I haven't tested this, but since it's doing nothing but floating point math, I wouldn't expect any issues, except that, I believe, inf - inf = nan.

Simon F · May 22, 2006

Chalnoth said:
Here's a couple that I've found to be useful:
Take the following code:

Code:

for (i = 0; i < maxvalue; i++) p += arr[i];

We can make use of compiler optimizations to unroll an inner loop by doing the following:

Code:

const int stride = 16; index = 0; for (i = 0; i < maxvalue/stride; i++) for (j = 0; j < stride; j++) p += arr[index++]; for (i = 0; i < maxvalue % stride; i++) p+= arr[index++];

I'd be concerned about the number of increments of the index - you're introducing a lot of data dependancies. Why not try something like...

Code:

pBase = array;
for (i = 0; i < maxvalue/stride; i++)
{
  for (j = 0; j < stride; j++)
  {
    p += pBase[j];
  }

  pBase += stride;
}
...etc..

arjan de lumens · May 22, 2006

Nick said:
I'd be more than interested in fast and reasonably accurate approximations of exp, log and pow...

exp (base 2; for other bases, premultiply the input by log2(base) ) :

Convert to some fixed-point format, resulting in an integer part 'i' and a fractional part 'f'(with 'f' being in the range [0,1])
Use a LUT or a taylor-series or other curve-fitting to compute exp(f). The result is a number in the range [1,2].
Compute 2^i; this is probably best done with some sort of bit-fiddling.
Multiply together the two results.

log (base-2, for other bases, post-multiply the result by 1/log2(base) ) :

Pry the input number (which we call 'x') into two parts 'm' and 'e' so that 'm' is in the range 1 to 2 and the relation 'x' = 'm'+2^'e' holds. If the input number is given as a floating-point number, you should be able to compute both 'e' and 'm' from the various bitfields of the floating-point number through just plain bit-fiddling.
Use a LUT or taylor-series or other curve-fitting to compute log(m) - which will end up in the range [0,1]
The result you want is now given by e+log(m).

pow(a,b):

If b is a positive integer, then you can use the Exponentiation by squaring algorithm.
if b is a negative integer, then you can use the relation pow(a,b) = pow(1/a, -b) and then apply exponentiation-by-squaring.
Otherwise you will have to use the relation pow(a,b) = exp(b*log(a))

That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?

The MAGIC-based fractional-value method only works correctly for numbers whose absolute value is less than MAGIC/3 (which is presumably not a big problem for sine range reduction, but something you may wish to keep in mind otherwise.)

The method relies critically on p1 being truncated to FP32, with no additional precision kept; this is unachievable in x87 without a memory write/read - the 'volatile' keyword is indeed there to force such a memory write/read cycle - or setting the x87 mode register (very slow), but should actually be pretty easy in something like SSE.

If you feed it INF as an input, it will throw a NAN back at you (which IMO is a perfectly mathematically sensible result, as you're effectively asking for the limit-at-infinity of a periodic function! ); I would expect denormals to survive though it, though.

Xmas · May 22, 2006

psurge said:
Arjan - I have to object to your use of the term 'nasty' IMO there is absolutely nothing wrong with that code.

One thing that's nasty about it is the use of volatile to force rounding of intermediate results to 32 bit floats. I wonder whether the compiler actually generates FLD/FSTP for those or does escape analysis and notices that p1 and p2 can't leave the local scope, so just sets the FPU to 32 bit mode.

psurge said:
Explanation: The magic constant above (call it M) must obey the following :
each of the 64 sequences of 7 consecutive bits in the 70 bit number (M << 6)
must be unique and non-zero.

Unique I can understand, but why non-zero? And why 7 consecutive bits instead of 6?
For every integer n you can build a sequence of 2^n + n - 1 bits that contains all permutations of n bits.

arjan de lumens · May 22, 2006

Xmas said:
One thing that's nasty about it is the use of volatile to force rounding of immediate results to 32 bit floats. I wonder whether the compiler actually generates FLD/FSTP for those or does escape analysis and notices that p1 and p2 can't leave the local scope, so just sets the FPU to 32 bit mode.

Setting the x87 FPU's mode register is IIRC very slow - this is the main reason why C's ordinary float-to-int conversion sucks balls performance-wise.

It IS possible to avoid the FSTP/FLD sequence, however to do that in C in a way that doesn't depend critically on platform/compiler/optimization-dependent issues, you need additional code to test out various values of MAGIC for what works and what doesn't - also, you may want to go to some lengths to prevent the function from getting inlined (even that can affect actual precision being used if you stick with x87.)

MfA · May 22, 2006

Chalnoth said:
Making use of the above recursive algorithm for integration sped up my integral by a factor of 100, while maintaining better accuracy, than my previous "divide the integral into N steps and sum" setup.

I remembered this from Numerical Algorithms ... it's a simplified form of the Romberg Algorithm.

Not really a code optimization

Just to be on topic though ... you can use int->float conversion to do a quick and dirty Exp function.

arjan de lumens · May 22, 2006

MfA said:
I remembered this from Numerical Algorithms ... it's a simplified form of the Romberg Algorithm.

Not really a code optimization

Just to be on topic though ... you can use int->float conversion to do a quick and dirty Exp function.

Something similar to this, I presume?

Code:

float scale_float( float f, int scale )
   {
   union {
      float f;
      int i;
      } intfloat;
   intfloat.f = f;
   intfloat.i += scale << 23;
   return intfloat.f;
   }

psurge · May 22, 2006

Xmas -
Non-zero because when you input the value v=0, the result of the multiplication with M is zero. The sub-sequences have to be non-zero since you don't want any of your representeable powers of 2 resulting in 7 0 MSBs when multiplied by M. Since you need 64 unique non-zero sequences, you need 7 bits instead of 6.

Serge

psurge · May 22, 2006

arjan de lumens said:
Setting the x87 FPU's mode register is IIRC very slow - this is the main reason why C's ordinary float-to-int conversion sucks balls performance-wise.

It IS possible to avoid the FSTP/FLD sequence, however to do that in C in a way that doesn't depend critically on platform/compiler/optimization-dependent issues, you need additional code to test out various values of MAGIC for what works and what doesn't - also, you may want to go to some lengths to prevent the function from getting inlined (even that can affect actual precision being used if you stick with x87.)

arjan, Xmas:
I totally forgot about this x87 stuff. I would like to retract my earilier hasty comment. A serious question though: I had thought that SSE (1,2,3?) included scalar fp instructions (which encode the operand precision in the instruction) - are those not there yet from a compiler/HW POV?

Xmas · May 22, 2006

psurge said:
Xmas -
Non-zero because when you input the value v=0, the result of the multiplication with M is zero. The sub-sequences have to be non-zero since you don't want any of your representeable powers of 2 resulting in 7 0 MSBs when multiplied by M. Since you need 64 unique non-zero sequences, you need 7 bits instead of 6.

Serge

Well, I now understand why the 64 bits of the pseudo-70 (or 69)-bit number need to have no sequence of zeros, because the implicit last bits are zero. But still there is no problem at all with ((v & -v) * M) == 0, as long as that result is unique as well.
And because of this, 6 bit sequences are sufficient.

Post your code optimizations

Nick

arjan de lumens

KimB

arjan de lumens

psurge

psurge

Humus

Crazy coder

DemoCoder

arjan de lumens

Nick

KimB

Simon F

Tea maker

arjan de lumens

Xmas

Porous

arjan de lumens

MfA

arjan de lumens

psurge

psurge

Xmas

Porous

Similar threads