Some initial Cg NV30 vs ARBFP1 vs PS2x thoughts...

pocketmoon_

Newcomer
I have 5 shaders in both full and partial precision versions:
1) A fake noise shader - no samplers, lots of nasty maths requires full precision.
2) Simple 2x sampler (same texture) , add and output
3) 5xSampler - cascaded dependend reads
4) A median filter - 5 samples, LOTS of 'conditional branching'
5) A bilinear filter - 4 samples and some maths.

I have used NV30 and ARBFP1 profiles for OpenGL and PS_2_x for DirectX9 tested in a Quadro FX 2000.

Results.
1) PS2X runs fasters but all are close. Only NV30 profile with FP displays correct results (FP needed for this one)
2) NV30 FP is about 10% faster than PS2X and 20% faster than ARBFP1. PP was slowest of them all!
3)Similar results to 2.
4)NV30 wins. Both FP and PP are >6x faster than PS2X and >3x ARBFP1
5)NV30 30% faster than PS2X

What does this tell us ? God knows :)

If you shader is mostly texture samplers than using PP appears to slow things down ?!

With lots of conditionals the NV30 profile is very very quick.

Some optimisations work better on floats e.g.

collapsing
a = (b.x + b.y + b.z) * 0.20
into
a = dot(b.xyz, float(.2,.2,.2))

gives a speed up for floats but not halfs ?!

Of course a LOT depends on the compiled shader code produced by each Cg profile. E.g. the NV30 profile for shader 4 (the median filter) is 104 instructions, but for PS2X is 149 instructions.

It will be interesting to see the impact the DirectX PP fix will have in the future.
 
pocketmoon_ said:
I have 5 shaders in both full and partial precision versions:
1) A fake noise shader - no samplers, lots of nasty maths requires full precision.
2) Simple 2x sampler (same texture) , add and output
3) 5xSampler - cascaded dependend reads
4) A median filter - 5 samples, LOTS of 'conditional branching'
5) A bilinear filter - 4 samples and some maths.

I have used NV30 and ARBFP1 profiles for OpenGL and PS_2_x for DirectX9 tested in a Quadro FX 2000.

This is Cg optimized output and not DX 9 HLSL optimized output for the PS_2_x tests, then? I'm not clear on what your wording means here. Whichever it is, it would be interesting to compare the alternate as well.

Results.
1) PS2X runs fasters but all are close. Only NV30 profile with FP displays correct results (FP needed for this one)

So...ARBFP1 seems to be "precision optimized" now? What version drivers?

What does that "1" in that name signify anyways?

2) NV30 FP is about 10% faster than PS2X and 20% faster than ARBFP1. PP was slowest of them all!
3)Similar results to 2.
Hmm...if I understand comments by certain people, they have suggested a processing unit does duty as either fp32 processor or texture sampler. The percentage difference over PS 2.x and ARBFP seems likely to be representative of scheduling overhead/failure to optimize fully.

In such a case, NV30 PP problems seem likely associated with an optimization conflict in Cg. Overhead for changing processing unit behavior to fp16 negating the benefits of fp16 processing speed? It would be nice to see the relative performance (in fps or instruction count, maybe even "proxel" fillrate! ;) ) to the other shaders to get an idea of how heavily performance is impacted by the presumable deadlock due to texture sampling and being unable to schedule around it

I'm assuming a fp texture?
4)NV30 wins. Both FP and PP are >6x faster than PS2X and >3x ARBFP1
Independent branch handling for each of the two pixels in each "proxel" pipeline makes sense...is that the cause? Without a frame of reference to relative instruction execution speed I don't know whether the difference is due to the NV30 path operating at a higher level of performance relative to other shaders (which would indicate independent branching management for each pixel in a proxel pipeline) and/or whether the other paths are being penalized for inadequate branching instructions (does ARB have such?).
Might be interesting to compare the speed for integer processing in the branching shaders (again, I'm assuming fp values are being specifically used throughout the shaders).

5)NV30 30% faster than PS2X

fp16 processing speed allowing it to make up a bit for the theoretical "switch" penalty?

That's enough guessing for me for today I think. :p
 
Back
Top