Can the R300 issue a 4-component vector operation in its PS?

I remember sireric posting in the following thread: http://www.beyond3d.com/forum/viewtopic.php?t=2622&highlight=. It described the fact that the R300 pixel shader can issue a 3-component vector instruction, a scalar, and a texture operation per cycle (Vliw core). Would the R300 hypothetically be able to issue a 4 component vector op and a texture operation? I am guessing that the pixel program unit contains 4 fmacs, which can be set to be either 4-way or 3-way simd, which leaves 1 fmac available as a scalar unit in 3-way simd mode. This would allow the processor to execute a dot4 operation every cyle, as well as a texture operation.
 
Re: Can the R300 issue a 4-component vector operation in its

Luminescent said:
I remember sireric posting in the following thread: http://www.beyond3d.com/forum/viewtopic.php?t=2622&highlight=. It described the fact that the R300 pixel shader can issue a 3-component vector instruction, a scalar, and a texture operation per cycle (Vliw core). Would the R300 hypothetically be able to issue a 4 component vector op and a texture operation?
Yes, the R300 can perform a 4 component vector operation per cycle in conjunction with a texture op.
 
Thanks opengl, just wondering. The R300 ps seems to be very flexible and capable, I wonder if the NV30 will be able to keep up with this nifty setup.
 
Just for kicks, the 9700pro would be able to do a 2-component vector, a scalar, and a texture operation, right? This would definitely be disadvantageous, being that it would leave 1 fmad completely idle, but I was wondering if the pixel shader Vliw format was flexible enough to issue any arbitrary vector, along with a scalar or texture op.
 
Well, doing 2 FMADs or 3 FMADs doesn't really matter -- You can just ignore the 3rd result. But if you want to do a dot2, yes, there's a way to do that. But you can't have 4 scalars (4 different operations per cycle per pixel). Just a 1,2 or 3 component vector and a scalar, or a single 4 component operation (i.e. following DX7/8 generalized co-issue) with a texture op.
 
Thankyou sireric, for the clarification.

On a second note, what does the R300 do when it is faced with a 3 or 4-component vector operation and a complex math operation, such as an rsq or rcp? Is there any sort of special purpose unit (in the R300 pixel program processor) which could operate in parallel with the fmac array? With a 3-component vector operation there is an fmac available for scalar ops, but would it be possible to execute a rcp, or rsq operation, for something like anisotropic lighting of pixels?
 
Yes, the scalar op can operate on rcp or sqrt while the vector op is operating on 3 vector ops. When doing a 4 vector op, the scalar could still operate on one of the components in parallel, but the output is defined by the PS's as being that of the DOT4.

rcp, sqrt and multiple other functions are complexe, but executed in one cycle, even with input mods.

Later
 
Humus, I believe rsq is the reciprocal-square root command while sqrt is plain square root.

What I don't understand is how a Fmad unit could execute an sqrt function or rcp with only multiply-add operations available to it.
 
Luminescent said:
What I don't understand is how a Fmad unit could execute an sqrt function or rcp with only multiply-add operations available to it.
Dunno about r300 internals..but what you need to perform a rcp is just a fmad unit, some precalculated tables and a couple of iterations (instructions can be artificially delayed to make them appear all the same execution time wise)

ciao,
Marco
 
Luminescent said:
Humus, I believe rsq is the reciprocal-square root command while sqrt is plain square root.

Yeah, but no sqrt instruction is available in the fragment shader, just like division isn't available. Reciprocals and reciprocal square roots are available though as they AFAIK are easier to implement in hardware, and actually are more useful. You can do sqrt with rsq and division with rcp just by adding another multiplication.
Anyway, if the hardware could do a sqrt in one cycle one would think there would be a sqrt instruction in the fragment shader extension. Thus I think sireric really meant the rsq instruction.
 
Isn't an rsq instruction just a pipelined sqrt rcp instruction. I would have thought that the sqrt function would be available to the developer. I guess you would just take the rcp of the rsq result for the square root :D.
 
Humus said:
Yeah, but no sqrt instruction is available in the fragment shader, just like division isn't available. Reciprocals and reciprocal square roots are available though as they AFAIK are easier to implement in hardware, and actually are more useful. You can do sqrt with rsq and division with rcp just by adding another multiplication.
Anyway, if the hardware could do a sqrt in one cycle one would think there would be a sqrt instruction in the fragment shader extension. Thus I think sireric really meant the rsq instruction.

Well, if it supports a logarithmic/exponential-based power algorithm, then doing a rsq or sqrt (Squirt!) would be trivial. That is, provided that the hardware is capable of doing a power operation each clock, it can trivially do either of the square root ops each clock just by executing pow(x, 0.5) or pow(x, -0.5).

And if the hardware can do a power operation each clock, supporting rsq or sqrt via dedicated transistors would be a waste...

Speaking of which...does anybody know if there's another good way to do square roots on computer hardware without using the exponential/logarithmic method?
 
logarithm/exponent-based pow may or may not suffer from limited precision, limiting their usefulness for sqrt/rsq approximation.

Reciprocal square root can be done easily with a lookup table and a couple of newton iterations. AMD did this with 3dnow - IIRC it gave about 14 bits of precision from the LUT alone, and 24 bits after 1 full iteration. For plain square root, multiply with the original number afterwards.

LUTs are also useful for other operations (exp, log, rcp, sin, cos, etc).
 
Humus said:
Luminescent said:
Humus, I believe rsq is the reciprocal-square root command while sqrt is plain square root.

Yeah, but no sqrt instruction is available in the fragment shader, just like division isn't available. Reciprocals and reciprocal square roots are available though as they AFAIK are easier to implement in hardware, and actually are more useful. You can do sqrt with rsq and division with rcp just by adding another multiplication.
Anyway, if the hardware could do a sqrt in one cycle one would think there would be a sqrt instruction in the fragment shader extension. Thus I think sireric really meant the rsq instruction.

Could be it was reciprocal sqrt -- Too lazy to go back and check. I remember implementing both rcp and sqrt somewhere -- The sqrt isn't very hard in float. You just need to do n/2 for even exponent, n/2 with sqrt(2) scaling for odd powers. Then need to compute the sqrt of the already normalized mantissa (you can do two look-ups; one scaled by sqrt(2), the other not, if you don't want to do a full multiply). For reasonable precision (let's say 18~20 bits), it's pretty easy to do a dual table lookup and do a point/slope approximation. For higher precision, some form of Newton-raphson is usually used, though there are multiple variations that exist (simple, second order, etc...).
 
Sireric wrote:
Yes, the scalar op can operate on rcp or sqrt while the vector op is operating on 3 vector ops. When doing a 4 vector op, the scalar could still operate on one of the components in parallel, but the output is defined by the PS's as being that of the DOT4.

How is this possible, to execute two operations simultaneously on the same fmad unit (for the vector 4 operations with a scalar)? A scalar and a vector op at the same time on the same fpu?
 
Luminescent said:
I guess you would just take the rcp of the rsq result for the square root :D.

It's better to tale advantage of this simple mathemtical fact:
x * (1 / sqrt(x)) = sqrt(x)
 
Luminescent said:
Sireric wrote:
Yes, the scalar op can operate on rcp or sqrt while the vector op is operating on 3 vector ops. When doing a 4 vector op, the scalar could still operate on one of the components in parallel, but the output is defined by the PS's as being that of the DOT4.

How is this possible, to execute two operations simultaneously on the same fmad unit (for the vector 4 operations with a scalar)? A scalar and a vector op at the same time on the same fpu?

There are 4 FMAD units, three reserved for vector units, 1 for scalar units. However, the scalar unit can kick in to give you 4 vector ops (dot4). Now, beyond the fmad, the scalar unit has a bunch of other units, including all the exotic functions (inv,log,exp, etc...), which can operate in parallel with the MAD. Those don't share the FMAD since it could not meet our timing requirements mixed with lut's. So, a simplified MAD was merged in to perform table lookups.
 
Back
Top