Jawed: Here's a little shader I made for you that runs at ~1200MPixels/s on my G80...
Please note that this is with 101.41, because those are the only drivers exposing the MUL (at least under OpenGL, I didn't check some of the 158.xx under D3D10, but I'm pretty sure Rys did) and I'm not going to go through 3 zillion reboots just now, hehe.
struct PS_Output {
float4 color : COLOR;
};
PS_Output ps(uniform sampler2D mytexture : TEX0)
{
PS_Output OUT;
float4 texCoord = 0.1f;
float scalar = tex2D(mytexture, texCoord.xy).x;
// 128 x MADD
scalar = scalar*scalar+scalar;
scalar = scalar*scalar+scalar;
scalar = scalar*scalar+scalar;
...
// 32 x LOG2
scalar = log2(scalar);
scalar = log2(scalar);
scalar = log2(scalar);
...
OUT.color = scalar.xxxx;
return OUT;
}
Now, interestingly, with sin() instead of log2(), I get ~1000MPixels/s. And removing the SFs completely, I get ~1300MPixels/s (96% of peak). Without the MADDs, I get ~1340MPixels/s (99% of peak) with either sin() or log2().
So, clearly, there is no "bubble" or anything similar happening for LOG2, with these instructions coming nearly for "free" since the SFU pipeline (which is decoupled, just like the TMUs) was empty in the MADD-only case. In the sin() case, however, performance goes down. So, to see if that was because of a "bubble", I retried with 32x(4xMADD+1xSF), with all instructions being dependent on the previous one. I got ~1150MPixels/s with LOG2 (lower!) and with SIN, I got ~1000MPixels/s again. Doing this 16x with a Vec2 instead (->independent instructions...) gave practically identical scores.
Part, if not all, of the difference between LOG2 and SIN can most likely be explained by the fact that the latter abuses the MADD unit to put the value in range, as explained in the following diagram from Stuart Oberman's FPU patent:
http://www.notforidiots.com/G80/ALUOps.png. Also, as per Bob's suggestion (we had already tested something quite similar back in November, I think, or got the point in another way at least):
struct PS_Output {
float4 color : COLOR;
};
PS_Output ps(uniform sampler2D mytexture : TEX0,
float4 texCoord1 : TEXCOORD0,
...
float4 texCoord8 : TEXCOORD11)
{
PS_Output OUT;
float4 texCoord = 0.1f;
float scalar = tex2D(mytexture, texCoord.xy).x;
// 32 x MADD with Attribute Interpolation
scalar = scalar*scalar+texCoord1.x;
scalar = scalar*scalar+texCoord1.y;
scalar = scalar*scalar+texCoord1.z;
scalar = scalar*scalar+texCoord1.w;
...
scalar = scalar*scalar+texCoord8.x;
scalar = scalar*scalar+texCoord8.y;
scalar = scalar*scalar+texCoord8.z;
scalar = scalar*scalar+texCoord8.w;
OUT.color = scalar.xxxx;
return OUT;
}
This runs at ~4000MPixels/s, which is 74% of peak. I'd have expected higher, but perhaps the fact there *only* are dependent instructions (needing results from BOTH the SFU and ALU every time!) is showing a latency-related bottleneck, because there aren't enough warps/threads to hide the ALU/SFU's own latencies. That's just a theory, though. Still quite a bit higher than the theoretical 20% peak of R600 anyway, not that such a worst-case scenario matters...
And finally, the RCP for perspective correction is indeed done once at the beginning of the shader. This can easily be verified by writing a SFU-limited program that uses one interpolated attribute and another similar program that does not. The former will take 5 extra clocks (4 for the RCP, 1 for the interpolation).
As for the MUL, that information will have to be published at another date, probably in an article, so stay tuned... There still are some things I'd like to understand better first though, sigh...
EDIT: Added info & link to the patent diagram showcasing the FPU abuse for SIN.