Continuing the topic, here's an optimization I posted at DevMaster.net a few weeks ago: Fast and accurate sine/cosine.
My somewhat cheesy default-answer to fast-and-not-uber-precise transcendental functions is: Lookup Table. If the table gets too big or you need more precision, interpolate between its elements. With quadratic or cubic interpolation, the table can get surprisingly small for a given precision requirement, at least for the more common self-similar functions (sin,cos,exp,log,rcp,sqrt).Nick said:Continuing the topic, here's an optimization I posted at DevMaster.net a few weeks ago: Fast and accurate sine/cosine.
#define MAGIC 12582912.0f
float reduce_range( float p )
{
float p0 = p * (0.5/M_PI);
volatile float p1, p2;
p1 = p0 + MAGIC;
p2 = p1 - MAGIC;
p0 -= p2;
return p0 * (2.0*M_PI);
}
That's basically the heart of the entire method. I leave it as an exercise to the reader to derive appropriate values of MAGIC for the 'double' and 'long double' floating-point types.Chalnoth said:Nice, that's pretty slick
It looks like what the code does is tha the MAGIC number is represented by a number with a mantissa 100000... (it's 1.5*2^23). Now, add this to a number smaller than MAGIC, and all bits that represent the fractional part of the original number get cut off. Subtract MAGIC from this new number, and you're left with the integer part. Play with that as you will
If I use 2^23 directly as MAGIC and then feed in a negative number, then the result of the first addition becomes smaller than 2^23, and in that case there will be one fractional bit left in p1's mantissa that is not chopped off correctly.P.S.
Wouldn't using just 2^23 as MAGIC also work? why use 1.5*2^23?
That's IEEE754 'nearest rounding' that steps into action.Edit:
Hmm, it looks like the above code actually doesn't quite do what I thought. It doesn't give the integer part, but rather rounds the number to an integer.
inline int trailingZeros(unsigned long long v)
{
return magicTable[ ((v & -v) * 0x0214639644ed74dfULL) >> 57 ];
}
unsigned byte *getMem(const int size){
unsigned byte *mem = currMem;
currMem += size;
return mem;
}
The main reason I called it 'nasty' is that one normally thinks of floating-point numbers as a computer representation of real-numbers, which is a useful abstraction/illusion most of the time (with the most common exception being exact-equality tests). The trick I did violates that illusion - instead of carefully trying to ignore or minimize the difference between true real-numbers and floating-point numbers, it intentionally makes use of the difference to achieve a certain effect. It is indeed a valid and correct way of writing code, but it's one that probably should be accompanied by about one screenful of comments to explain just what the hell it does.psurge said:Arjan - I have to object to your use of the term 'nasty' IMO there is absolutely nothing wrong with that code.
I'd be more than interested in fast and reasonably accurate approximations of exp, log and pow...arjan de lumens said:My somewhat cheesy default-answer to fast-and-not-uber-precise transcendental functions is: Lookup Table. If the table gets too big or you need more precision, interpolate between its elements. With quadratic or cubic interpolation, the table can get surprisingly small for a given precision requirement, at least for the more common self-similar functions (sin,cos,exp,log,rcp,sqrt).
That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?Code:#define MAGIC 12582912.0f float reduce_range( float p ) { float p0 = p * (0.5/M_PI); volatile float p1, p2; p1 = p0 + MAGIC; p2 = p1 - MAGIC; p0 -= p2; return p0 * (2.0*M_PI); }
Well, I haven't tested this, but since it's doing nothing but floating point math, I wouldn't expect any issues, except that, I believe, inf - inf = nan.Nick said:That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?
I'd be concerned about the number of increments of the index - you're introducing a lot of data dependancies. Why not try something like...Chalnoth said:Here's a couple that I've found to be useful:
Take the following code:
We can make use of compiler optimizations to unroll an inner loop by doing the following:Code:for (i = 0; i < maxvalue; i++) p += arr[i];
Code:const int stride = 16; index = 0; for (i = 0; i < maxvalue/stride; i++) for (j = 0; j < stride; j++) p += arr[index++]; for (i = 0; i < maxvalue % stride; i++) p+= arr[index++];
pBase = array;
for (i = 0; i < maxvalue/stride; i++)
{
for (j = 0; j < stride; j++)
{
p += pBase[j];
}
pBase += stride;
}
...etc..
exp (base 2; for other bases, premultiply the input by log2(base) ) :Nick said:I'd be more than interested in fast and reasonably accurate approximations of exp, log and pow...
The MAGIC-based fractional-value method only works correctly for numbers whose absolute value is less than MAGIC/3 (which is presumably not a big problem for sine range reduction, but something you may wish to keep in mind otherwise.)That's awesome! Without the pi related stuff it's a three instruction fraction function without slow float to int and int to float conversion! Does it have any drawbacks? I assume it behaves funky with denormals and infinity?
One thing that's nasty about it is the use of volatile to force rounding of intermediate results to 32 bit floats. I wonder whether the compiler actually generates FLD/FSTP for those or does escape analysis and notices that p1 and p2 can't leave the local scope, so just sets the FPU to 32 bit mode.psurge said:Arjan - I have to object to your use of the term 'nasty' IMO there is absolutely nothing wrong with that code.
Unique I can understand, but why non-zero? And why 7 consecutive bits instead of 6?psurge said:Explanation: The magic constant above (call it M) must obey the following :
each of the 64 sequences of 7 consecutive bits in the 70 bit number (M << 6)
must be unique and non-zero.
Setting the x87 FPU's mode register is IIRC very slow - this is the main reason why C's ordinary float-to-int conversion sucks balls performance-wise.Xmas said:One thing that's nasty about it is the use of volatile to force rounding of immediate results to 32 bit floats. I wonder whether the compiler actually generates FLD/FSTP for those or does escape analysis and notices that p1 and p2 can't leave the local scope, so just sets the FPU to 32 bit mode.
I remembered this from Numerical Algorithms ... it's a simplified form of the Romberg Algorithm.Chalnoth said:Making use of the above recursive algorithm for integration sped up my integral by a factor of 100, while maintaining better accuracy, than my previous "divide the integral into N steps and sum" setup.
Something similar to this, I presume?MfA said:I remembered this from Numerical Algorithms ... it's a simplified form of the Romberg Algorithm.
Not really a code optimization
Just to be on topic though ... you can use int->float conversion to do a quick and dirty Exp function.
float scale_float( float f, int scale )
{
union {
float f;
int i;
} intfloat;
intfloat.f = f;
intfloat.i += scale << 23;
return intfloat.f;
}
arjan de lumens said:Setting the x87 FPU's mode register is IIRC very slow - this is the main reason why C's ordinary float-to-int conversion sucks balls performance-wise.
It IS possible to avoid the FSTP/FLD sequence, however to do that in C in a way that doesn't depend critically on platform/compiler/optimization-dependent issues, you need additional code to test out various values of MAGIC for what works and what doesn't - also, you may want to go to some lengths to prevent the function from getting inlined (even that can affect actual precision being used if you stick with x87.)
Well, I now understand why the 64 bits of the pseudo-70 (or 69)-bit number need to have no sequence of zeros, because the implicit last bits are zero. But still there is no problem at all with ((v & -v) * M) == 0, as long as that result is unique as well.psurge said:Xmas -
Non-zero because when you input the value v=0, the result of the multiplication with M is zero. The sub-sequences have to be non-zero since you don't want any of your representeable powers of 2 resulting in 7 0 MSBs when multiplied by M. Since you need 64 unique non-zero sequences, you need 7 bits instead of 6.
Serge