PDA

View Full Version : Sqrt() (Not actually a 3D question)


NoteMe
14-Feb-2004, 21:18
Øyvind Østlund (http://www.noteme.com)

I am making a 3D game with some firends and we are trying not to use Sqrt as often but some times we are have to use it. We sat down a night to try to optimze it in C++/ASM (not that accurate to make it faster) but had no luck. The best we could do was still slower then the one in VC++. Anyone else tryed this before...any luck?

arjan de lumens
14-Feb-2004, 21:51
2 ways to beat the standard sqrt(): Use a lookup table, with interpolation between elements. Just don't make the table too big, or else you get a cache miss with each sqrt, which is more expensive than calling sqrt(). use the SSE 'RSQRTPS' instruction (which gives you 4 reciprocal square roots in ~4-7 cycles; requires inline assembly if you are working in C/C/++). If you need the square root rather than its reciprocal, post-multiply with the original number; don't use 'SQRTPS' which, unlike RSQRTPS, is dog slow.
Here is some example code, that, using lookup tables, processes an array of floats, computing the reciprocal square root (~14bit precision) for each of them at 7 cycles per float (measured on an Athlon), entirely written in C:

#include <math.h>

unsigned int rsq_table[8192];
unsigned int exp_table[512];

void prepare_tables()
{
int i;
union { int i; float f; } intfloat;
for(i=0;i<4096;i++)
{
intfloat.f = 1.0/sqrt( (i+4096.0)/4096.0);
rsq_table[i+4096] = (intfloat.i & 0x7FFFFF) >> 7;
intfloat.f = 1.0/sqrt( (i+4096.0)/2048.0);
rsq_table[i] = (intfloat.i & 0x7FFFFF) >> 7;
}
for(i=0;i<256;i++)
exp_table[i] = 0x5E800000-(((i*0x800000)>>1) & 0x3F800000);
}

void array_rsqrt(const float *in, float *out, int count)
{
int i;
int P = out - in;
const float *limit = in + count;

do {
int *V1 = (int *)in;
unsigned int *V2 = (unsigned int *)in;
int exp = exp_table[ (*V2 - 0x800800) >> 23 ];
int mant = rsq_table[ (*V1 >> 11) & 0x1FFF ] << 7;
V1[P] = mant | exp;
in++;
}
while(in < limit);
}

To understand what the hell happens in this code, you probably need a *very* good understanding of how floating-point numbers work .....

NoteMe
14-Feb-2004, 21:57
Thanks for the last warning....I totally missed on that question in my computer sciece class at Enginnering School before X-Mas, but now when I have my book beside me I think I can manage it. Thanks for the very good answer. And the code to.

I will post again if I have furuther questions...


ØØ

darkblu
15-Feb-2004, 09:38
2 ways to beat the standard sqrt(): Use a lookup table, with interpolation between elements. Just don't make the table too big, or else you get a cache miss with each sqrt, which is more expensive than calling sqrt().

with a tabe size of 1kB (256 float entries) and a bit more cpu cycles (employing newton-raphson instead of linear interpolation), one can obtain rsqrt of full float precision in about 6 multiplies.

NoteMe
15-Feb-2004, 11:19
With only 256 entries?? Hmmm...how would you do that? Do you have any examples. Can't say that I understand how Newton Rapshons could be used with a look up table at all.

NoteMe
15-Feb-2004, 12:06
Sorry was a bit drunk when I replied my first time yesterday, so I havn't had the time to test the first code before now. But am I doing something wrong? I am getting worse test results then for the Sqrt() function. Here is my test code. Using count as 1.


prepare_tables();
float inn = 0;
float* pinn = &inn;
float ut = 0;
float* put = &ut;
Time1 = GetTickCount();
for(inn = 0; inn < 999999; inn += 0.1){
array_rsqrt(pinn, put, 1);
}
Time2 = GetTickCount();

cout << "array_rsqrt test time: " << (Time2 - Time1) << endl;


Sorry for my Norwegian variable names. And please don't laugh if I am doing something really wrong. Havn't done C++ that long...:)



I have the function as inline, shouldn't it be that? And BTW in your function you are not initializing i. Is that safe?


and here are the test results for the Square functions that I have tried to make and have found. One of them are from the Quake 3 engine, and even that one is slower. I must be doing something really wrong.

Sqrtf test time: 610
Sqrtl test time: 609
Sqr2 test time: 4047
Sqr3 test time: 1047
Sqr4 test time: 2125
Sqr6 test time: 563
SqrQ test time: 640
Sqr7 test time: 594
Sqr8 test time: 3843
array_rsqrt test time: 703
Sqrt test time: 344

your test time is the array_rsqrt test

Humus
15-Feb-2004, 15:07
Here's a piece of code that approximates 1 / sqrt(x) pretty well. To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).


inline float rsqrtf(float v){
float v_half = v * 0.5f;
long i = *(long *) &v;
i = 0x5f3759df - (i >> 1);
v = *(float *) &i;
return v * (1.5f - v_half * v * v);
}

NoteMe
15-Feb-2004, 15:21
Here's a piece of code that approximates 1 / sqrt(x) pretty well. To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).


inline float rsqrtf(float v){
float v_half = v * 0.5f;
long i = *(long *) &v;
i = 0x5f3759df - (i >> 1);
v = *(float *) &i;
return v * (1.5f - v_half * v * v);
}



Thanks, but that code snipet is from the Quace 3 engine...all ready tested that one. I loved that trick

i = 0x5f3759df - (i >> 1);

but I can't get this one to go faster either. And the really confusing thing is that all the functions are going even slower then the Sqrt() if I am using optiomization for speed in the VC++6 compiler...:(

arjan de lumens
15-Feb-2004, 17:50
Sorry was a bit drunk when I replied my first time yesterday, so I havn't had the time to test the first code before now. But am I doing something wrong? I am getting worse test results then for the Sqrt() function. Here is my test code. Using count as 1.


prepare_tables();
float inn = 0;
float* pinn = &inn;
float ut = 0;
float* put = &ut;
Time1 = GetTickCount();
for(inn = 0; inn < 999999; inn += 0.1){
array_rsqrt(pinn, put, 1);
}
Time2 = GetTickCount();

cout << "array_rsqrt test time: " << (Time2 - Time1) << endl;


Sorry for my Norwegian variable names. And please don't laugh if I am doing something really wrong. Havn't done C++ that long...:)



[Edit] I have the function as inline, shouldn't it be that? And BTW in your function you are not initializing i. Is that safe?

The 'i' variable is unused in the entire function, so it shouldn't be anything to worry about. And I would recommend batching up multiple numbers to do square root on if you are going to try my routine or RSQRTPS; calling it with a length of 1 will force the float to memory for each pass, causing an STLF (store-to-load forwarding) stall, and incur a bunch of overhead with loop and pointer setup, adding a bunch of cycles. In particular, I suspect that the STLF stall on certain recent processors (are you using a Pentium4?) is THE reason why neither my code nor the Quake3 code can beat the sqrt() function when doing just 1 number.

NoteMe
15-Feb-2004, 18:03
I am using a AMD MP prosessor 1900+. So maybe you are right. I don't know much about STLF. So I am just nodding and listening..:D...but it doesn't mather that much. I can live with the Sqrt() function at the moment. It is probably not going to be critical anyway. It was just that we started trying to beet it in ASM but couldn't do it, so it ruined all our sleep...:D...

Humus
15-Feb-2004, 21:40
but I can't get this one to go faster either.

Really? In my testing this has been a good deal faster, like twice the speed of sqrt().

NoteMe
15-Feb-2004, 21:43
What compiler are you using? I am using VC++6. And have used "optimize for max speed" on it. And it is about the same speed.

Simon F
16-Feb-2004, 08:51
Here's a piece of code that approximates 1 / sqrt(x) pretty well. To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).

Just to add to Humus's post: if you're doing 3D graphics you probably don't even want sqrt but 1/sqrt, and so these sort of routines are a big winner.

I used a variation of the 1/sqrt code from "Graphics Gems" (the code's available on the net) in the original SGL library's lighting functions because it was much faster than the Pentium's native sqrt+divide!!

NoteMe
16-Feb-2004, 11:40
I was alsow thinking about that. Thats why I didn't multiply it with X at all when I tested it. But thanks for commenting on that. I have ordered that book BTW...:D But it hasn't got here yet. Takes so much time to get here to Notway...:(

Humus
16-Feb-2004, 12:56
What compiler are you using? I am using VC++6. And have used "optimize for max speed" on it. And it is about the same speed.

Using MSVC 6.0 too. It was quite some time since I made the test, but I sure did get a good deal better performance with it over sqrt(), especially over 1 / sqrtf(). It has been my experience all the time that it is faster, even recently in my cloth demo I saw a few fps increase when I used it. But I suppose it could differ between AMD and Intel CPUs.

Simon F
16-Feb-2004, 13:01
I was alsow thinking about that. Thats why I didn't multiply it with X at all when I tested it. But thanks for commenting on that. I have ordered that book BTW...:D But it hasn't got here yet. Takes so much time to get here to Notway...:(
Did you mean you've ordered "Graphic Gems"? Do you know there's actually 5 of them! There are several square root/ Inverse Square root algorithms in volumes 1, 3, and 5, but the one I was refering to is in volume 5.

NoteMe
16-Feb-2004, 14:13
Was writing before I was reading there...I have ordered gameprogramming gems. 2 and 3, number 4 is not out yet. Hav't looked at graphics gems yet.

ØØ

arjan de lumens
18-Feb-2004, 15:35
One thing I just thought of: are you sure that the VC6 sqrt() function call doesn't get dead-code-eliminated in your tests? If you don't actually use the result of sqrt() for something, the compiler may very well detect this and just remove the sqrt() call altogether as part of the optimizations it does on your code, screwing up your benchmarks.

NoteMe
20-Feb-2004, 13:40
Ohhh....didn't think of that....maybe I should have a look at it. And try to use a smaller loop, but outputting the answer in every iteration....

Thanks

ElDonAntonio
10-Mar-2004, 07:14
This may be dumb, but are you compiling in release?

ector
10-Mar-2004, 10:06
What are you doing that causes Sqrt to be a bottleneck? Are you sure it is?
I can't think of any kind of game that would need to do so many sqrts...

NoteMe
13-Mar-2004, 16:22
This may be dumb, but are you compiling in release?


I did it both ways to test....

NoteMe
13-Mar-2004, 16:23
What are you doing that causes Sqrt to be a bottleneck? Are you sure it is?
I can't think of any kind of game that would need to do so many sqrts...


You "non all post reader"..:D..just kidding with you...but I will quote my self for you...)


...but it doesn't mather that much. I can live with the Sqrt() function at the moment. It is probably not going to be critical anyway. It was just that we started trying to beet it in ASM but couldn't do it, so it ruined all our sleep......

Nick
15-Mar-2004, 14:08
inline float rsqrtf(float v){
float v_half = v * 0.5f;
long i = *(long *) &v;
i = 0x5f3759df - (i >> 1);
v = *(float *) &i;
return v * (1.5f - v_half * v * v);
}

The 'correct' number to use in the fourth line is 5F400000h. This will result in rsqrtf(1) == 1. There's lots of interesting information about this approximation (and others) in this thread at flipCode: arcus cosinus (http://www.flipcode.com/cgi-bin/msg.cgi?showThread=00002548&forum=3dtheory&id==-1#msgmark_63). If you need a better approximation, one Newton-Rhapson iteration makes is quite precise.
To get the sqrt(x) you simply have to remember that x * (1 / sqrt(x)) = sqrt(x).
Unless x = 0... A little while ago my application's performance was suddenly halved. After a lot of searching, I found out that 0 * Inf = NaN. And doing further computations with NaN is extremely slow. Here's a fool-proof SSE implementation:

inline float fsqrt(float x)
{
__asm
{
rsqrtss xmm0, x
rcpss xmm0, xmm0
movss x, xmm0
}

return x;
}

Although rcpss is an approximation instruction as well, I found out that it doesn't noticably reduce total precision of the function.

Scali
15-Mar-2004, 15:59
An alternative fool-proof implementation would be something like:


inline float fsqrt(float x)
{
static float zero[4] = { 0, 0, 0, 0 };

__asm
{
movaps xmm0, x
movaps xmm1, zero
cmpsseq xmm1, xmm0
addss xmm0, xmm1
rsqrtss xmm0, xmm0
mulss xmm0, x
movss x, xmm0
}

return x;
}


(Could be mistakes here, I'm not on a PC with Intel docs and/or SSE/compiler to test it, so just from the top of my head).
Anyway, the basic idea is to add 1 (result from the compare operation) to x if x == 0, so that you will not get 1/0, but 1/1.
The mul afterwards will make sure that the result is still 0 in this case (0*1/1 == 0).
It should be slightly more accurate than the 1/(1/sqrt), and speedwise it should be okay, since the rcp operation is relatively slow compared to cmp, add and mul. On some CPUs it may actually be faster, I don't know.

Nick
15-Mar-2004, 16:38
How about:

inline float fsqrt(float x)
{
static int big = 7F7FFFFFh;

__asm
{
rsqrtss xmm0, x
minss xmm0, [big]
mulss xmm0, x
movss x, xmm0
}

return x;
}

if every bit of precision is welcome...

Scali
15-Mar-2004, 19:33
Ah, that's even better. I overlooked that instruction.
This is definitely preferred over the rcp then. Probably more precise AND faster.