In your specific case additions are enough, otherwise you use fmadds in the general case.
In 4 cycles you can do 1 to 4 dot4 (up to you how many things you want to process at the same time, I usually work with at least 4 entities (vertices, edges, whatever I need to process..) at the same time.
Even if you decide to do a dot product in 4 cycles you probably end up having less latency on the result of your dot product so that the CPU can schedule an instruction using your results sooner than later.
Another advantage of this way to work is that your dot products scales linearly with the width of your vectors, so that 4 dot3 take 3 cycles and 4 dot2 take 2 cycles.
Overall you have higher throughput and less latency.
Thanks guys! Good ideas, but I probably should have fleshed out what I was doing a bit more to show how I get those -1 to 0 values. Plus, maybe you guys will have a better way of doing what I need to do
Imagine an array of positive numbers, lets say with these numbers, that last of which is always infinity:
{ 1000, 5700, 9350, infinity }
I have another number, say 7675, and I want to know which slot it falls just under, and I need that slot number to use later on. It used to be figured out with a branch, which if course is bad. Instead I did this, all with vmx instructions to make sure everything stays in the vmx unit registers:
1) splat the target number across another vector so we have { 7675, 7675, 7675, 7675 }
2) subtract this new with the original one, giving: { -6675, -1975, 1675, huge +num }
3) do a max on that vector with -1, giving: { -1, -1, 1675, huge +num }
4) do a min on that vector with 0, giving: { -1, -1, 0, 0 }
5) dot that vector with itself, giving: 2
The value I need is 2, which I use elsewhere for other things. So, the above.....dumb or not dumb? Or maybe there is a better way to do this? Oh ya, I inline the above into a function and call it unrolled in groups of 4, so that the compiler can make great use of the 128 vmx registers and schedule things much better.