Uttar said:
demalion said:
Uttar,
Why are you sure the NV35 has no FX12 registers? It probably simply does not expose them at default for shader PS 2.0...i.e., behaves properly when NV3x < NV35 did not. The NV3x specific FX12 shaders (e.g., Dawn) and PS 1.x performance figures seem to support this.
I'm talking *registers* here.
Not any member of the NV3x supports FX12 registers. They support FX12 *instructions*, but not FX12 registers.
Uttar
I still don't see why you're sure that no NV3x has FX12 registers, and are therefore sure that the NV35 does not.
Why do FP16 and FX12 performance for Dawn differ at all? My theory is that there is a complete "PS 1.x" (Hmm...maybe I should say "PS 1.3") set of temporary registers for FX12 separate from floating point registers in atleast <NV35, and that the fp16 versus fx12 results for Dawn (for the NV35) are associated with some loss of "Free" MOV instructions when using half precision due to not being able to take advantage of them.
Note from the NV35 results:
PS 1.4 - Simple - 565.649109M pixels/sec
PS 2.0 - Simple - 422.224335M pixels/sec
Now, this could be because the FX12 uses floating point register space, but can use more values before running into performance issues (that would be fairly impressive pack/unpack flexibility). But that doesn't seem to make sense with the shader files in the fillrate tester (I'll include what I have in mind at the bottom, in case it is out of date) with its limited register usage. What does seem to make sense, AFAICS, is that the mov at the end is free (and maybe some instructions in Dawn) when a different set of registers can be utilized...I'm proposing those are FX12 registers.
The pretty much identical 5800 1.4 and 2.0 results contradict this, but I do still think the 5800 is using FX12 for PS 2.0, depending on driver version and maybe some sort of triggering mechanism. (FYI: the range for the simple shader files I have is 0 to 1...one test of my prior "trigger" theory would be to change that for the PS 2.0 file and test it on the 5800).
I'm not proposing this theory as proven, I'm just not seeing why you are sure this theory is false, or what you are proposing as an alternate explanation (hence the question). I am proposing it as not
disproven, so addressing it from that angle would probably be most direct.
To state more clearly, so it is more convenient to be disproven if you have thoughts in that regard: What seems more likely, AFAICS at the moment, is that FX12 register count is fully and uniquely accessible throughout the pipeline. I'm not aware of any test where FX12 performance degradation shows the same characteristics as floating point performance degradation within the same "register space" limit overflow. Pocketmoon's benchmarks run on an NV35 would probably provide some insight in that regard, though there are probably relevant benchmark indications already mentioned somewhere.
If there aren't different registers, where is the opportunity for performance increase coming from? There are other possibilities I can think of, but this one seems to fit right now.
Note, you also seem to necessarily be arguing against the idea that when the NV35 "fixed" the prior NV3x designs, it enabled existing fp32 units to output more than FX12, since that theory seems to depend on FX12 registers being a limitation for the prior NV3x designs. However, if you believe fp32 units were added and not reallocated or expanded slightly, this isn't a conflict.
My fillrate tester shader files:
Code:
ps_1_1
def c0, 0.3f, 0.7f, 0.2f, 0.4f
add r0, c0, -v0
add r0, r0, v1
Code:
ps_1_4
def c0, 0.3f, 0.7f, 0.2f, 0.4f
texcrd r1.xyz, t0
texcrd r2.xyz, t1
add r3.xyz, c0, -r1
add r3.xyz, r3, r2
phase
mov r0.rgb, r3
+mov r0.a, c0.a
Code:
ps_2_0
dcl v0
dcl v1
def c0, 0.3f, 0.7f, 0.2f, 0.4f
add r0, c0, -v0
add r0, r0, v1
mov oC0, r0