What are the advantages of a hardware dot product instruction ?

I have read several times about Xenon having implemented in its cores a specific dot product instruction, and, as also said, this can have been one of the most difficult things for IBM to apply in its design.
I have also read many times that a dot product is one of the most used vector operations nowadays and that it could be usefull for things like phisics and graphics calculations.
Well, my question is... which advantages could this instruction give Xenon over Cell if Cell does not have an specific hardware instruction for it ? only a cycles saving ?
And moreover, in a graphics engine, which tasks could be left done to the VMXs units due to this instruction ? would this imply more polys and more crowded games like for example it can be see in Kameo or N3 ?
 
Courtesy of Jaws who brought up an old thread, this posting shows the palaver one has to go through in implementing fast dot-products without a DP instruction:

http://www.beyond3d.com/forum/showpost.php?p=477372&postcount=16

The second approach there runs in 4 cycles. Which is nice and fast, for 4 dot-products (i.e. 4 pairs of Vec4s are DP'd). It's fairly standard practice as far as I can tell.

Unfortunately the code required to encode the four pairs of source vectors into the starting "8 vector format" isn't given. I don't know how many instructions/cycles that would take.

The point of Xenon's VMX pipeline is that the programmer doesn't have to encode the source vectors like this, and still gets the benefit of fast DP.

Anyway, I'm interested in this topic so it would be great to get some detailed discussion of these approaches :smile:

Jawed
 
If i remember correctly Xenon can perform a DP per cycle, right?

If that's the case, then xenon's DP calculations will be actually a quite faster than the ones on a processor without a DP instruction, even with the Jaw's aproaches?
 
Last edited by a moderator:
No, it can't do a DP per cycle. That'd require a massive ALU array. The ALU's still process each of the adds and multiplies serially. I think the DP command does save some cycles on Load and Store and stuff though.
 
LightHeaven said:
If i remember correctly Xenon can perform a DP per cycle, right?

If that's the case, then xenon's DP calculations will be actually a quite faster than the ones on a processor without a DP instruction, even with the Jaw's aproaches?


Is that even feasible, really? to have a single cycle dot product instruction? Seems like you'd need quite a bit of transistors and it'd stress the pipeline (you'd be hard pressed to do that on a 20+ stage pipeline it seems) -- it takes ~4 instructions to do a DP, I can't imagine fitting those ~4 instructions worth of data manipulation in a single cycle without a lot of coercion and extra transistors.

If that is the case then maybe I just don't know what I'm talking about (which is quite possible), but it seems to me that'd not be an easy task (especially on a pipeline this long -- is it feasible for a single cyclec DP instruction on a CPU like XeCPU?). I thought the point of a DP instruction was for ease of use and programmer friendliness, not necessarily a huge speed boost...
 
Shifty Geezer said:
No, it can't do a DP per cycle. That'd require a massive ALU array. The ALU's still process each of the adds and multiplies serially. I think the DP command does save some cycles on Load and Store and stuff though.

It can issue one DP/cycle.
Latency is >> 1cy.
 
Shifty Geezer said:
Why can't one use a DotProduct() function and have the compiler produce the low level code? :???:
As I understand, reordering vectors to make groups of dot products faster requires knowledge on the part of the programmer. It's too complex for a compiler to sort out.
 
Shifty Geezer said:
Why can't one use a DotProduct() function and have the compiler produce the low level code? :???:
I'm sure you can.

The issue I was pointing to was the encoding of 8 vectors into an "interleaved 8-vector DP format". That consumes more cycles, so your 4-cycle DP for four Vec4s is no longer an average of 1 DP per clock.

Jawed
 
ERP said:
It can issue one DP/cycle.
Latency is >> 1cy.
But it doesn't calculate (do) the DP in one cycle. From the time you issue the command several cycles must pass before you get your answer, right? But then what if you issue a DP every cycle? :???:
 
Shifty Geezer said:
But it doesn't calculate (do) the DP in one cycle. From the time you issue the command several cycles must pass before you get your answer, right? But then what if you issue a DP every cycle? :???:

You can issue as many as you want, at 1 cycle intervals.
Yes the result takes >>1cy to get back.

That's what the numbers I gave mean.
 
It's difficult/expensive to implement a horizontal add, which is required for the 1 cycle (throughput) dp instruction. Usually, we'd just transpose our vectors/matrices and do 4 dp's in '4' cycles, which is essentially the same. The XeCPU would have the advantage if you need to do less than 4 dp's, I suppose.
 
Bobbler said:
Is that even feasible, really? to have a single cycle dot product instruction? Seems like you'd need quite a bit of transistors and it'd stress the pipeline (you'd be hard pressed to do that on a 20+ stage pipeline it seems) -- it takes ~4 instructions to do a DP, I can't imagine fitting those ~4 instructions worth of data manipulation in a single cycle without a lot of coercion and extra transistors.

If that is the case then maybe I just don't know what I'm talking about (which is quite possible), but it seems to me that'd not be an easy task (especially on a pipeline this long -- is it feasible for a single cyclec DP instruction on a CPU like XeCPU?). I thought the point of a DP instruction was for ease of use and programmer friendliness, not necessarily a huge speed boost...
I really don't know actually. I dont remember very well but i guess this is what Ms claimed...

BTW, I know its not the most realible source out there, but I'm pretty sure that on Major Nelson's comparisson between ps3 and 360, he states Xenon can do 1 dot per cycle...
I spent a lot of time trying to figure how they managed to do this, but i guess that they most probably didn't :p
 
Jawed said:
The issue I was pointing to was the encoding of 8 vectors into an "interleaved 8-vector DP format".
Unless we are talking about temporary generated results you would arrange your data to optimal structures in advance - it's something you HAVE to do for any SIMD anyhow, it's just the exact layout that varies.

To be fair, temporary results sometimes not being in correct orientation can be an issue you encounter on every SIMD I ever used save for one, it's just less frequent if you have an explicit DP instruction.

Fresh said:
The XeCPU would have the advantage if you need to do less than 4 dp's, I suppose.
Well the issue is really that optimal processing element in SoA SIMD is a 4x4 quad, not 4x1 vector. So when you're designing your data structures, you get certain restrictions on how data can be placed.
 
Back
Top