Compiler optimization of quaternion multiplication

rpg.314 · Mar 7, 2009

This is my code for performing quaternion multiplication. I know, it's an old hat, but my question is from POV of compiler optimization. The results match nicely from a known library.

I my convetion. the first three are the vector parts and the last is the scalar part.

#include<iostream>
#include<cmath>
#include<pmmintrin.h>
#include<Eigen/Geometry>
using namespace std;
using namespace Eigen;

float __attribute__((aligned(16))) printBuf[4];

#define vec4f_swizzle(v,p,q,r,s) (_mm_shuffle_ps( (v),(v), ((s)<<6|(r)<<4|(q)<<2|(p))))

const int zero=0;
const int flipSign=0x80000000;

const __m128 quat_mask=_mm_setr_ps( *(float*)&zero,
*(float*)&zero,
*(float*)&zero,
*(float*)&flipSign);

void print( __m128 a)
{
_mm_store_ps(printBuf,a);
cout<<printBuf[0]<<' '<<
printBuf[1]<<' '<<
printBuf[2]<<' '<<
printBuf[3]<<endl;
}

__m128 quat_mul(__m128 a, __m128 b)
{
__m128 swiz1=vec4f_swizzle(b,3,3,3,3);
__m128 swiz2=vec4f_swizzle(a,2,0,1,0);
__m128 swiz3=vec4f_swizzle(b,1,2,0,0);
__m128 swiz4=vec4f_swizzle(a,3,3,3,1);
__m128 swiz5=vec4f_swizzle(b,0,1,2,1);
__m128 swiz6=vec4f_swizzle(a,1,2,0,2);
__m128 swiz7=vec4f_swizzle(b,2,0,1,2);
__m128 mul4=_mm_mul_ps(swiz6,swiz7);
__m128 mul3=_mm_mul_ps(swiz4,swiz5);
__m128 mul2=_mm_mul_ps(swiz2,swiz3);
__m128 mul1=_mm_mul_ps(a,swiz1);
__m128 flip1=_mm_xor_ps(mul4,quat_mask);
__m128 flip2=_mm_xor_ps(mul3,quat_mask);
__m128 retVal=_mm_sub_ps(mul1,mul2);
__m128 retVal2=_mm_add_ps(flip1,flip2);
return _mm_add_ps(retVal,retVal2);
}

int main()
{
Quaternionf a(1,2,3,4);
Quaternionf b(11,12,13,14);
__m128 a1,b1;
a1=_mm_setr_ps(2.0,3.0,4.0,1);
b1=_mm_setr_ps(12,13,14,11);
print(a1);
print(b1);
print(quat_mul(a1,b1));
cout<<"quat\n";
asm("#start");
Quaternionf c=a*b;
asm("#end");

cout<<c.x()<<' '<<
c.y()<<' '<<
c.z()<<' '<<
c.w()<<"\n";
cout<<"test\n";
cout<<a.x()<<' '<<
a.y()<<' '<<
a.z()<<' '<<
a.w()<<"\n";
return 0;
}

compiled with gcc 4.3 using

g++ quat_mul.cpp -msse3 -O3 -I ./eigen-2.0.0 -march=native

on my athlon 1.6 GHz (K8) cpu. with sse3.

The assembly for the core routine is

00000000004009e0 <_Z8quat_mulU8__vectorfS_>:
4009e0: 0f 28 f9 movaps %xmm1,%xmm7
4009e3: 0f 28 f1 movaps %xmm1,%xmm6
4009e6: 0f 28 d8 movaps %xmm0,%xmm3
4009e9: 0f 28 e9 movaps %xmm1,%xmm5
4009ec: 0f 28 d0 movaps %xmm0,%xmm2
4009ef: 0f c6 f9 ff shufps $0xff,%xmm1,%xmm7
4009f3: 0f c6 f1 09 shufps $0x9,%xmm1,%xmm6
4009f7: 0f c6 e9 64 shufps $0x64,%xmm1,%xmm5
4009fb: 0f 28 e0 movaps %xmm0,%xmm4
4009fe: 0f c6 d8 7f shufps $0x7f,%xmm0,%xmm3
400a02: 0f c6 d0 89 shufps $0x89,%xmm0,%xmm2
400a06: 0f c6 c9 92 shufps $0x92,%xmm1,%xmm1
400a0a: 0f c6 e0 12 shufps $0x12,%xmm0,%xmm4
400a0e: 0f 59 dd mulps %xmm5,%xmm3
400a11: 0f 59 d1 mulps %xmm1,%xmm2
400a14: 0f 28 0d 45 0f 20 00 movaps 0x200f45(%rip),%xmm1 # 601960 <_ZL9quat_mask>
400a1b: 0f 59 e6 mulps %xmm6,%xmm4
400a1e: 0f 59 c7 mulps %xmm7,%xmm0
400a21: 0f 57 d1 xorps %xmm1,%xmm2
400a24: 0f 57 d9 xorps %xmm1,%xmm3
400a27: 0f 5c c4 subps %xmm4,%xmm0
400a2a: 0f 58 d3 addps %xmm3,%xmm2
400a2d: 0f 58 c2 addps %xmm2,%xmm0
400a30: c3 retq
400a31: 66 66 66 66 66 66 2e nopw %cs:0x0(%rax,%rax,1)
400a38: 0f 1f 84 00 00 00 00
400a3f: 00

My questions are

1) I used a lot of variables in the mul routine and yet the compiler used only 8 of them. Are they that good with register allocation and lifetime analysis?

2) If yes to 1) then should one use lots of intermediate if fleeting variables in the core routines to help in compiler optimization.

3) The instructions seem to be nicely scheduled. Is it because I ordered my C code in that way or what?

4) I noticed that sse assembly goes the other way. IE destination register appears on the rhs of an instruction. Right or wrong?

5) any other tips while writing optimized sse code in general or improving it in particular.

Comments/criticisms/suggestions welcome.

crystall · Mar 7, 2009

rpg.314 said:
My questions are

1) I used a lot of variables in the mul routine and yet the compiler used only 8 of them. Are they that good with register allocation and lifetime analysis?

Your code is very simple from a register-allocation point of view. Variables which are written once, used once quickly after and then never used again are dealt w/o many problems. Even a simple linear-scan allocator does well on such code. Variables with a long life-time pose much more problems. BTW the next iteration of gcc (4.4) will sport a significantly improved register-allocator.

2) If yes to 1) then should one use lots of intermediate if fleeting variables in the core routines to help in compiler optimization.

Not necessarily, compilers are good enough to do other optimizations like global common sub-expression elimination so you don't need to stuff every step of your code inside temporaries. Qualifying pointers correctly with 'const' or 'restrict' helps a compiler much more for example.

3) The instructions seem to be nicely scheduled. Is it because I ordered my C code in that way or what?

GCC schedules instructions but it's not so good at it. In this case they could be better (for example by alternating movaps and shufps instructions) but it would hardly make any difference on today processors.

4) I noticed that sse assembly goes the other way. IE destination register appears on the rhs of an instruction. Right or wrong?

You are seeing x86 assembler in the AT&T format which is <op> <source>, <dest>. If you want to see the output in a more familiar format disassemble the .o by passing the '-M intel' option to objdump.

5) any other tips while writing optimized sse code in general or improving it in particular.

Pure arithmetic code such as the one you posted is not a problem for a modern optimizer, I'll focus much more on how you deal with memory if I were you. For example flagging pointers to constant data as 'const' or qualifying pointers to different objects with 'restrict' can yield significantly better code so try to use them as much as you can in pointer declarations and even more so in function prototypes. For example GCC doesn't have a link-time optimizer yet so unless you feed it all the files it cannot know if a function declared in another unit doesn't touch the memory you passed it unless you const-qualify the pointer to such memory.

rpg.314 · Mar 7, 2009

Thanks for that. Is there a tool to check for pipeline stalls BTW?

martinsm · Mar 7, 2009

rpg.314 said:
1) I used a lot of variables in the mul routine and yet the compiler used only 8 of them. Are they that good with register allocation and lifetime analysis?

That is also because there is only 8 SSE registers xmm0...xmm7.
x86-64 architecture added 8 more SSE registers xmm8...xmm15.

rpg.314 · Mar 7, 2009

I compiled for x86-64 only. Though it seems odd that even in register heavy codes that I have written, I haven't seen more than xmm5 or 6 used. Must be something I am doing wrong (or may be gcc).

crystall · Mar 7, 2009

rpg.314 said:
Thanks for that. Is there a tool to check for pipeline stalls BTW?

If you are under Linux Valgrind is a powerful debugging tool which also provides pipeline information like cache misses, branch prediction effectivness, etc... Then there are the processor-specific tools like Intel's VTune and AMD's CodeAnalyst.

crystall · Mar 7, 2009

rpg.314 said:
I compiled for x86-64 only. Though it seems odd that even in register heavy codes that I have written, I haven't seen more than xmm5 or 6 used. Must be something I am doing wrong (or may be gcc).

It probably means that there are no more than 6 or 7 variables live at once. Also take into account that GCC will always try to use the first 8 XMM registers for the most used variables as using the extra 8 XMM registers in x86-64 mode requires an extra byte per instruction (the REX prefix) which lowers code density and can also lower pre-decode bandwidth on C2D.

rpg.314 · Mar 7, 2009

Aha, that would explain a lot of things. I haven't seen any register beyond xmm7 used so far even though I have written some pretty register intensive code. I wonder why gcc does that. the 8 extra registers are meant to be used dude. All this will do is push the extra registers into the internal registers that the cpu exposes via register renaming.

de666 · Mar 30, 2009

thank you guys a lot!
this thread and your answers helped me a lot!

ps
i registered just to say thanks

Simon F · Mar 31, 2009

crystall said:
If you are under Linux Valgrind is a powerful debugging tool which also provides pipeline information like cache misses, branch prediction effectivness, etc...

Just wanted to second this. The "cachegrind" and "callgrind" subtools are very useful.

Compiler optimization of quaternion multiplication

Tea maker