integer dot product

rpg.314

Veteran
Hi,

There's an app I am doing that needs fast integer dot products. My cpu supports SSE3 but not SSSE3:cry:. Right now I am looking at multiplying them and then implementing the "horizontal add" by 2 shuffles and 2 adds. The dot products I need are actually 3 wide, instead of the usual 4 wide. The fourth component is guaranteed to be zero before I take the dot product. I am unable to see any way to use this simplification to reduce the op count.

If anyone can suggest me a faster way of doing this then I would be very grateful. It's a frequently called function(inlined, of course) in the innermost loops and any optimizations will really help

Mods, if this question is in appropriate for this forum, feel free to move it.

Thanks in advance.
 
SSSE3 is of little use here. The PHADDD instruction is actually slower than a series of shuffles and adds (unless you can use all the resulting components).

You can only achieve a really good speedup by parallelizing things into structure-of-array (SoA) form. Instead of doing one dot product at a time, do four of them, one in each component of the vector. A floating-point dot product then looks like this:
Code:
movaps xmm0, x0
movaps xmm1, y0
movaps xmm2, z0
mulps xmm0, x1
mulps xmm1, y1
mulps xmm2, z1
addps xmm0, xmm1
addps xmm0, xmm2   // Results in xmm0
This uses all four components in each instruction, resulting in maximum efficiency. On a Core 2 or Phenom this can execute in just 3 cycles (throughput). That's one dot product in 0.75 cycles. ;)

So try hard to rewrite your algorithm in SoA form and it should pay off. What exactly are you working on?
 
There's an app I am doing that needs fast integer dot products. My cpu supports SSE3 but not SSSE3:cry:. Right now I am looking at multiplying them and then implementing the "horizontal add" by 2 shuffles and 2 adds. The dot products I need are actually 3 wide, instead of the usual 4 wide. The fourth component is guaranteed to be zero before I take the dot product. I am unable to see any way to use this simplification to reduce the op count.
If I understand this correctly you want a replacement for PHADDD right? Since you are already using SSE instructions I suppose that your data is already 16-bytes aligned and if you can make 4 of those dot products together as Nick suggested (albeit with integer instructions). What I was wondering is exactly what instruction are you using for the multiply?
 
So try hard to rewrite your algorithm in SoA form and it should pay off. What exactly are you working on?

Thanks a lot for pointing that out. I knew SoA arrangement tends to be better than AoS arrangement for vector math, but it didn't occur to me for this. Don't know why.

I am working on BFS (on a random graph), tailored to my specific needs and the special case (of my data).

If I understand this correctly you want a replacement for PHADDD right? Since you are already using SSE instructions I suppose that your data is already 16-bytes aligned and if you can make 4 of those dot products together as Nick suggested (albeit with integer instructions). What I was wondering is exactly what instruction are you using for the multiply?

I didn't really knew about that instruction. I just looked at the vector intrinsics available for SSE, SSE2, SSE3 cpu's.There was no integer dot product instruction available so I tried to do the best I could and I came up with the solution I wrote about in my first post.

Yes, my data is 16 byte aligned. I am using gcc intrinsics. I am a little hesitant to touch raw assembly btw. Now that you are asking about it, I wonder if using the vector intrinsics aren't good enough? I hope they are. In case they aren't, I would like to know why.
 
I didn't really knew about that instruction. I just looked at the vector intrinsics available for SSE, SSE2, SSE3 cpu's.There was no integer dot product instruction available so I tried to do the best I could and I came up with the solution I wrote about in my first post.
I see... GCC x86 intrinsics match 1-to-1 the various instructions so it's just like you were writing assembly with them, there would be no benefit in going below them.
Yes, my data is 16 byte aligned. I am using gcc intrinsics. I am a little hesitant to touch raw assembly btw. Now that you are asking about it, I wonder if using the vector intrinsics aren't good enough? I hope they are. In case they aren't, I would like to know why.
Have you measured a significant performance difference with this approach compared to plain scalars? If you really need a significant speed up going for a SoA (or at least interleaved) arrangement which allows you to do 4 DPs using truly vectorized code would be the way to go. Horizontal/inner-vector operations are always a kludge, in many cases they're not even worth the time over plain scalar code.
What would be great is if you could post the code you are dealing with and possibly a disassembly of what GCC is producing.
 
Have you measured a significant performance difference with this approach compared to plain scalars? If you really need a significant speed up going for a SoA (or at least interleaved) arrangement which allows you to do 4 DPs using truly vectorized code would be the way to go. Horizontal/inner-vector operations are always a kludge, in many cases they're not even worth the time over plain scalar code.

Well since the idea of SoA was pointed out, I reworked some of the things into SoA form here. It has worked out really nicely. I can now see your point about horizontal ops. Indeed, they are a pain. However, they (Aos) are typically the first idea I get when I am trying to vectorize code. :rolleyes:
 
Back
Top