View Full Version : Sqrt() (Not actually a 3D question)
Øyvind Østlund (http://www.noteme.com)
I am making a 3D game with some firends and we are trying not to use Sqrt as often but some times we are have to use it. We sat down a night to try to optimze it in C++/ASM (not that accurate to make it faster) but had no luck. The best we could do was still slower then the one in VC++. Anyone else tryed this before...any luck?
arjan de lumens
14-Feb-2004, 21:51
2 ways to beat the standard sqrt(): Use a lookup table, with interpolation between elements. Just don't make the table too big, or else you get a cache miss with each sqrt, which is more expensive than calling sqrt(). use the SSE 'RSQRTPS' instruction (which gives you 4 reciprocal square roots in ~4-7 cycles; requires inline assembly if you are working in C/C/++). If you need the square root rather than its reciprocal, post-multiply with the original number; don't use 'SQRTPS' which, unlike RSQRTPS, is dog slow.
Here is some example code, that, using lookup tables, processes an array of floats, computing the reciprocal square root (~14bit precision) for each of them at 7 cycles per float (measured on an Athlon), entirely written in C:
#include <math.h>
unsigned int rsq_table[8192];
unsigned int exp_table[512];
void prepare_tables()
{
int i;
union { int i; float f; } intfloat;
for(i=0;i<4096;i++)
{
intfloat.f = 1.0/sqrt( (i+4096.0)/4096.0);
rsq_table[i+4096] = (intfloat.i & 0x7FFFFF) >> 7;
intfloat.f = 1.0/sqrt( (i+4096.0)/2048.0);
rsq_table[i] = (intfloat.i & 0x7FFFFF) >> 7;
}
for(i=0;i<256;i++)
exp_table[i] = 0x5E800000-(((i*0x800000)>>1) & 0x3F800000);
}
void array_rsqrt(const float *in, float *out, int count)
{
int i;
int P = out - in;
const float *limit = in + count;
do {
int *V1 = (int *)in;
unsigned int *V2 = (unsigned int *)in;
int exp = exp_table[ (*V2 - 0x800800) >> 23 ];
int mant = rsq_table[ (*V1 >> 11) & 0x1FFF ] << 7;
V1[P] = mant | exp;
in++;
}
while(in < limit);
}
To understand what the hell happens in this code, you probably need a *very* good understanding of how floating-point numbers work .....
Thanks for the last warning....I totally missed on that question in my computer sciece class at Enginnering School before X-Mas, but now when I have my book beside me I think I can manage it. Thanks for the very good answer. And the code to.
I will post again if I have furuther questions...
ØØ
darkblu
15-Feb-2004, 09:38
2 ways to beat the standard sqrt(): Use a lookup table, with interpolation between elements. Just don't make the table too big, or else you get a cache miss with each sqrt, which is more expensive than calling sqrt().
with a tabe size of 1kB (256 float entries) and a bit more cpu cycles (employing newton-raphson instead of linear interpolation), one can obtain rsqrt of full float precision in about 6 multiplies.
With only 256 entries?? Hmmm...how would you do that? Do you have any examples. Can't say that I understand how Newton Rapshons could be used with a look up table at all.
Sorry was a bit drunk when I replied my first time yesterday, so I havn't had the time to test the first code before now. But am I doing something wrong? I am getting worse test results then for the Sqrt() function. Here is my test code. Using count as 1.
prepare_tables();
float inn = 0;
float* pinn = &inn;
float ut = 0;
float* put = &ut;
Time1 = GetTickCount();
for(inn = 0; inn < 999999; inn += 0.1){
array_rsqrt(pinn, put, 1);
}
Time2 = GetTickCount();
cout << "array_rsqrt test time: " << (Time2 - Time1) << endl;
Sorry for my Norwegian variable names. And please don't laugh if I am doing something really wrong. Havn't done C++ that long...:)
I have the function as inline, shouldn't it be that? And BTW in your function you are not initializing i. Is that safe?
and here are the test results for the Square functions that I have tried to make and have found. One of them are from the Quake 3 engine, and even that one is slower. I must be doing something really wrong.
Sqrtf test time: 610
Sqrtl test time: 609
Sqr2 test time: 4047
Sqr3 test time: 1047
Sqr4 test time: 2125
Sqr6 test time: 563
SqrQ test time: 640
Sqr7 test time: 594
Sqr8 test time: 3843
array_rsqrt test time: 703
Sqrt test time: 344
your test time is the array_rsqrt test
Here's a piece of code that approximates 1 / sqrt(x) pretty well. To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).
inline float rsqrtf(float v){
float v_half = v * 0.5f;
long i = *(long *) &v;
i = 0x5f3759df - (i >> 1);
v = *(float *) &i;
return v * (1.5f - v_half * v * v);
}
Here's a piece of code that approximates 1 / sqrt(x) pretty well. To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).
inline float rsqrtf(float v){
float v_half = v * 0.5f;
long i = *(long *) &v;
i = 0x5f3759df - (i >> 1);
v = *(float *) &i;
return v * (1.5f - v_half * v * v);
}
Thanks, but that code snipet is from the Quace 3 engine...all ready tested that one. I loved that trick
i = 0x5f3759df - (i >> 1);
but I can't get this one to go faster either. And the really confusing thing is that all the functions are going even slower then the Sqrt() if I am using optiomization for speed in the VC++6 compiler...:(
arjan de lumens
15-Feb-2004, 17:50
Sorry was a bit drunk when I replied my first time yesterday, so I havn't had the time to test the first code before now. But am I doing something wrong? I am getting worse test results then for the Sqrt() function. Here is my test code. Using count as 1.
prepare_tables();
float inn = 0;
float* pinn = &inn;
float ut = 0;
float* put = &ut;
Time1 = GetTickCount();
for(inn = 0; inn < 999999; inn += 0.1){
array_rsqrt(pinn, put, 1);
}
Time2 = GetTickCount();
cout << "array_rsqrt test time: " << (Time2 - Time1) << endl;
Sorry for my Norwegian variable names. And please don't laugh if I am doing something really wrong. Havn't done C++ that long...:)
[Edit] I have the function as inline, shouldn't it be that? And BTW in your function you are not initializing i. Is that safe?
The 'i' variable is unused in the entire function, so it shouldn't be anything to worry about. And I would recommend batching up multiple numbers to do square root on if you are going to try my routine or RSQRTPS; calling it with a length of 1 will force the float to memory for each pass, causing an STLF (store-to-load forwarding) stall, and incur a bunch of overhead with loop and pointer setup, adding a bunch of cycles. In particular, I suspect that the STLF stall on certain recent processors (are you using a Pentium4?) is THE reason why neither my code nor the Quake3 code can beat the sqrt() function when doing just 1 number.
I am using a AMD MP prosessor 1900+. So maybe you are right. I don't know much about STLF. So I am just nodding and listening..:D...but it doesn't mather that much. I can live with the Sqrt() function at the moment. It is probably not going to be critical anyway. It was just that we started trying to beet it in ASM but couldn't do it, so it ruined all our sleep...:D...
but I can't get this one to go faster either.
Really? In my testing this has been a good deal faster, like twice the speed of sqrt().
What compiler are you using? I am using VC++6. And have used "optimize for max speed" on it. And it is about the same speed.
Simon F
16-Feb-2004, 08:51
Here's a piece of code that approximates 1 / sqrt(x) pretty well. To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).
Just to add to Humus's post: if you're doing 3D graphics you probably don't even want sqrt but 1/sqrt, and so these sort of routines are a big winner.
I used a variation of the 1/sqrt code from "Graphics Gems" (the code's available on the net) in the original SGL library's lighting functions because it was much faster than the Pentium's native sqrt+divide!!
I was alsow thinking about that. Thats why I didn't multiply it with X at all when I tested it. But thanks for commenting on that. I have ordered that book BTW...:D But it hasn't got here yet. Takes so much time to get here to Notway...:(
What compiler are you using? I am using VC++6. And have used "optimize for max speed" on it. And it is about the same speed.
Using MSVC 6.0 too. It was quite some time since I made the test, but I sure did get a good deal better performance with it over sqrt(), especially over 1 / sqrtf(). It has been my experience all the time that it is faster, even recently in my cloth demo I saw a few fps increase when I used it. But I suppose it could differ between AMD and Intel CPUs.
Simon F
16-Feb-2004, 13:01
I was alsow thinking about that. Thats why I didn't multiply it with X at all when I tested it. But thanks for commenting on that. I have ordered that book BTW...:D But it hasn't got here yet. Takes so much time to get here to Notway...:(
Did you mean you've ordered "Graphic Gems"? Do you know there's actually 5 of them! There are several square root/ Inverse Square root algorithms in volumes 1, 3, and 5, but the one I was refering to is in volume 5.
Was writing before I was reading there...I have ordered gameprogramming gems. 2 and 3, number 4 is not out yet. Hav't looked at graphics gems yet.
ØØ
arjan de lumens
18-Feb-2004, 15:35
One thing I just thought of: are you sure that the VC6 sqrt() function call doesn't get dead-code-eliminated in your tests? If you don't actually use the result of sqrt() for something, the compiler may very well detect this and just remove the sqrt() call altogether as part of the optimizations it does on your code, screwing up your benchmarks.
Ohhh....didn't think of that....maybe I should have a look at it. And try to use a smaller loop, but outputting the answer in every iteration....
Thanks
ElDonAntonio
10-Mar-2004, 07:14
This may be dumb, but are you compiling in release?
What are you doing that causes Sqrt to be a bottleneck? Are you sure it is?
I can't think of any kind of game that would need to do so many sqrts...
This may be dumb, but are you compiling in release?
I did it both ways to test....
What are you doing that causes Sqrt to be a bottleneck? Are you sure it is?
I can't think of any kind of game that would need to do so many sqrts...
You "non all post reader"..:D..just kidding with you...but I will quote my self for you...)
...but it doesn't mather that much. I can live with the Sqrt() function at the moment. It is probably not going to be critical anyway. It was just that we started trying to beet it in ASM but couldn't do it, so it ruined all our sleep......
inline float rsqrtf(float v){
float v_half = v * 0.5f;
long i = *(long *) &v;
i = 0x5f3759df - (i >> 1);
v = *(float *) &i;
return v * (1.5f - v_half * v * v);
}
The 'correct' number to use in the fourth line is 5F400000h. This will result in rsqrtf(1) == 1. There's lots of interesting information about this approximation (and others) in this thread at flipCode: arcus cosinus (http://www.flipcode.com/cgi-bin/msg.cgi?showThread=00002548&forum=3dtheory&id==-1#msgmark_63). If you need a better approximation, one Newton-Rhapson iteration makes is quite precise.
To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).
Unless x = 0... A little while ago my application's performance was suddenly halved. After a lot of searching, I found out that 0 * Inf = NaN. And doing further computations with NaN is extremely slow. Here's a fool-proof SSE implementation:
inline float fsqrt(float x)
{
__asm
{
rsqrtss xmm0, x
rcpss xmm0, xmm0
movss x, xmm0
}
return x;
}
Although rcpss is an approximation instruction as well, I found out that it doesn't noticably reduce total precision of the function.
An alternative fool-proof implementation would be something like:
inline float fsqrt(float x)
{
static float zero[4] = { 0, 0, 0, 0 };
__asm
{
movaps xmm0, x
movaps xmm1, zero
cmpsseq xmm1, xmm0
addss xmm0, xmm1
rsqrtss xmm0, xmm0
mulss xmm0, x
movss x, xmm0
}
return x;
}
(Could be mistakes here, I'm not on a PC with Intel docs and/or SSE/compiler to test it, so just from the top of my head).
Anyway, the basic idea is to add 1 (result from the compare operation) to x if x == 0, so that you will not get 1/0, but 1/1.
The mul afterwards will make sure that the result is still 0 in this case (0*1/1 == 0).
It should be slightly more accurate than the 1/(1/sqrt), and speedwise it should be okay, since the rcp operation is relatively slow compared to cmp, add and mul. On some CPUs it may actually be faster, I don't know.
How about:
inline float fsqrt(float x)
{
static int big = 7F7FFFFFh;
__asm
{
rsqrtss xmm0, x
minss xmm0, [big]
mulss xmm0, x
movss x, xmm0
}
return x;
}
if every bit of precision is welcome...
Ah, that's even better. I overlooked that instruction.
This is definitely preferred over the rcp then. Probably more precise AND faster.
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.