Hi,
There's an app I am doing that needs fast integer dot products. My cpu supports SSE3 but not SSSE3. Right now I am looking at multiplying them and then implementing the "horizontal add" by 2 shuffles and 2 adds. The dot products I need are actually 3 wide, instead of the usual 4 wide. The fourth component is guaranteed to be zero before I take the dot product. I am unable to see any way to use this simplification to reduce the op count.
If anyone can suggest me a faster way of doing this then I would be very grateful. It's a frequently called function(inlined, of course) in the innermost loops and any optimizations will really help
Mods, if this question is in appropriate for this forum, feel free to move it.
Thanks in advance.
There's an app I am doing that needs fast integer dot products. My cpu supports SSE3 but not SSSE3. Right now I am looking at multiplying them and then implementing the "horizontal add" by 2 shuffles and 2 adds. The dot products I need are actually 3 wide, instead of the usual 4 wide. The fourth component is guaranteed to be zero before I take the dot product. I am unable to see any way to use this simplification to reduce the op count.
If anyone can suggest me a faster way of doing this then I would be very grateful. It's a frequently called function(inlined, of course) in the innermost loops and any optimizations will really help
Mods, if this question is in appropriate for this forum, feel free to move it.
Thanks in advance.