Yes and no.
There's no way to specify combiner operations inside a fragment program, so technically no. However, the number of FX12-operations you can perform between each FP32 operation exactly matches what you could do with two combiners (without the extra scale/bias tricks or dual issue...
When doing just Z/Stencil, there's no situation when any of the fragment processing (FP or FX) can be used. That doesn't mean that the FP/FX hardware isn't used, but probably the pipeline would be designed more around the 4 pixel case with the 8Z being an addition, instead of a main feature...
More numbers below. Looking at them more closely, it seems there is more structure to the slowdown than just every two registers slowing things. I've grouped the numbers a bit to show this.
32.12 rounds 8.48 cycle/fragm: 1 regs, 32 instr
32.05 rounds 8.46 cycle/fragm: 2 regs, 32...
Yes you are right. Integer units can be bypassed or the float data just flows through unchanged.
Program length doesn't seem to matter. Program is probably loaded from memory and long programs would only be slower in bandwidth limited cases. I also tried texture fetches and they went the same...
The performance numbers don't really tell how many units are involved, just how many results we get each cycle. There could be four texture units that can do one trilinear or two bilinear. Or there could be 8 texture units, two connected to each fragment processor. The actual hardware wiring is...
After reading these comments and doing further tests based on them, I wrote a summary with more details and posted it as a new thread (as it only concerns NV30 and contains test results and not just speculation).
link: http://www.beyond3d.com/forum/viewtopic.php?t=5150
According to further tests, this is what the NV30 fragment shader architecture looks like:
FLOAT/TEXTURE-UNIT (handles FP16, FP32 and texture)
|
INTEGER-UNIT (handles FX12, 1-2 ops in parallel)
|
INTEGER-UNIT (handles FX12, 1-2 ops in parallel)
|
(loopback or output)
There are 4...
Actually GF4 can do 4 multiply/dot or 2 add/mux ops/clock (for both RGB and alpha, equivalent to one RGBA vector). It can also do scale/bias/expand etc operations on inputs/outputs which are not supported in fragment programs.
Now that you mention it, this is actually very close to the FX12...
As for architectural speculation on the NV30 fragment shader:
Earlier Geforces had TEXTURE SHADER followed by REGISTER COMBINERS. Texture shader is really very much like a simple fragment shader unit, as it can fetch textures and do some limited floating point operations.
It would make...
I have some results that might help you.
I've been testing NV30 (5800 Ultra) fragment program performance with driver 43.45 (results are the same as for 42.92 with which I started). Testing is done with OpenGL NV_fragment_program.
I have tested performance for all instructions with FP32...