Dawn FP32 figures

PS 2.0 - Longer - 169.349930M pixels/sec
PS 2.0 PP - Longer - 338.089874M pixels/sec
PS 2.0 - Longer 4 Registers - 181.126297M pixels/sec
PS 2.0 PP - Longer 4 Registers - 421.864349M pixels/sec
Now this thing is actually interesting...??
 
Well, the 5800 just looks like it is using FX12...isn't that established?

As for the 5900 performance improvements for some tests...that does seem interesting. I think we need to compare with the same driver, though. Couldn't find any same version tests in the forum, though. Maybe Wavey?
 
demalion said:
Well, the 5800 just looks like it is using FX12...isn't that established?
Using FX12 where? With 44.10 drivers it performs somewhat as expected: fx12 in ps_1_1 and fp16 in ps_1_4 and ps_2_0...
 
Uttar,
Why are you sure the NV35 has no FX12 registers? It probably simply does not expose them at default for shader PS 2.0...i.e., behaves properly when NV3x < NV35 did not. The NV3x specific FX12 shaders (e.g., Dawn) and PS 1.x performance figures seem to support this.
 
MDolenc said:
demalion said:
Well, the 5800 just looks like it is using FX12...isn't that established?
Using FX12 where? With 44.10 drivers it performs somewhat as expected: fx12 in ps_1_1 and fp16 in ps_1_4 and ps_2_0...

For PS 2.0. The 5800 is using fp16 in ps 1.4 and 2.0? That's news to me...what did I miss? Hmm...well, perhaps I'm confused with 3dmark behavior?
 
Hmm...OK, I guess I'm a bit fuzzy headed. FP16 for PS 2.0 is the established 5800 behavior for default shaders, except for targetted benchmarks. FP16 for PS 1.4 is still completely contrary to anything that makes sense to me: FX12 usage for PS 1.4 is actually "legal" AFAIK (just compares unfavorably to the 8500/9000), and with a dependence on that for performance, why would nVidia use anything else?

With that bit of fuzzy-headedness out of the way, it seems what I should have been wondering is whether this benchmark has been targetted because of its popularity in these forums or whether there is an optimization triggering mechanism (like limited range constants) that causes FX12 to be used. Not the only possiblities, but ones that atleast make more sense.

For the first: hmm...well, if they tried to anticipate benchmark usage for reviews for FUD, they certainly seem to have the requisite expertise in targetting. Maybe more details from whatever Unwinder is looking into would be an answer.

For the second: well, maybe that explains some of the results...I guess more examination of the shaders might be helpful.
 
demalion said:
Uttar,
Why are you sure the NV35 has no FX12 registers? It probably simply does not expose them at default for shader PS 2.0...i.e., behaves properly when NV3x < NV35 did not. The NV3x specific FX12 shaders (e.g., Dawn) and PS 1.x performance figures seem to support this.

I'm talking *registers* here.
Not any member of the NV3x supports FX12 registers. They support FX12 *instructions*, but not FX12 registers.


Uttar
 
Uttar said:
demalion said:
Uttar,
Why are you sure the NV35 has no FX12 registers? It probably simply does not expose them at default for shader PS 2.0...i.e., behaves properly when NV3x < NV35 did not. The NV3x specific FX12 shaders (e.g., Dawn) and PS 1.x performance figures seem to support this.

I'm talking *registers* here.
Not any member of the NV3x supports FX12 registers. They support FX12 *instructions*, but not FX12 registers.


Uttar

I still don't see why you're sure that no NV3x has FX12 registers, and are therefore sure that the NV35 does not.

Why do FP16 and FX12 performance for Dawn differ at all? My theory is that there is a complete "PS 1.x" (Hmm...maybe I should say "PS 1.3") set of temporary registers for FX12 separate from floating point registers in atleast <NV35, and that the fp16 versus fx12 results for Dawn (for the NV35) are associated with some loss of "Free" MOV instructions when using half precision due to not being able to take advantage of them.

Note from the NV35 results:

PS 1.4 - Simple - 565.649109M pixels/sec
PS 2.0 - Simple - 422.224335M pixels/sec

Now, this could be because the FX12 uses floating point register space, but can use more values before running into performance issues (that would be fairly impressive pack/unpack flexibility). But that doesn't seem to make sense with the shader files in the fillrate tester (I'll include what I have in mind at the bottom, in case it is out of date) with its limited register usage. What does seem to make sense, AFAICS, is that the mov at the end is free (and maybe some instructions in Dawn) when a different set of registers can be utilized...I'm proposing those are FX12 registers.

The pretty much identical 5800 1.4 and 2.0 results contradict this, but I do still think the 5800 is using FX12 for PS 2.0, depending on driver version and maybe some sort of triggering mechanism. (FYI: the range for the simple shader files I have is 0 to 1...one test of my prior "trigger" theory would be to change that for the PS 2.0 file and test it on the 5800).

I'm not proposing this theory as proven, I'm just not seeing why you are sure this theory is false, or what you are proposing as an alternate explanation (hence the question). I am proposing it as not disproven, so addressing it from that angle would probably be most direct.

To state more clearly, so it is more convenient to be disproven if you have thoughts in that regard: What seems more likely, AFAICS at the moment, is that FX12 register count is fully and uniquely accessible throughout the pipeline. I'm not aware of any test where FX12 performance degradation shows the same characteristics as floating point performance degradation within the same "register space" limit overflow. Pocketmoon's benchmarks run on an NV35 would probably provide some insight in that regard, though there are probably relevant benchmark indications already mentioned somewhere.

If there aren't different registers, where is the opportunity for performance increase coming from? There are other possibilities I can think of, but this one seems to fit right now.

Note, you also seem to necessarily be arguing against the idea that when the NV35 "fixed" the prior NV3x designs, it enabled existing fp32 units to output more than FX12, since that theory seems to depend on FX12 registers being a limitation for the prior NV3x designs. However, if you believe fp32 units were added and not reallocated or expanded slightly, this isn't a conflict.

My fillrate tester shader files:

Code:
ps_1_1

def c0, 0.3f, 0.7f, 0.2f, 0.4f

add r0, c0, -v0
add r0, r0, v1
Code:
ps_1_4

def c0, 0.3f, 0.7f, 0.2f, 0.4f

texcrd r1.xyz, t0
texcrd r2.xyz, t1

add r3.xyz, c0, -r1
add r3.xyz, r3, r2

phase

mov r0.rgb, r3
+mov r0.a, c0.a
Code:
ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f

add r0, c0, -v0
add r0, r0, v1
mov oC0, r0
 
demalion said:
Hmm...OK, I guess I'm a bit fuzzy headed. FP16 for PS 2.0 is the established 5800 behavior for default shaders, except for targetted benchmarks. FP16 for PS 1.4 is still completely contrary to anything that makes sense to me: FX12 usage for PS 1.4 is actually "legal" AFAIK (just compares unfavorably to the 8500/9000), and with a dependence on that for performance, why would nVidia use anything else?

because nVidia's FX12 doesn't sport the range required by PS 1.4
 
demalion said:
As for the 5900 performance improvements for some tests...that does seem interesting. I think we need to compare with the same driver, though. Couldn't find any same version tests in the forum, though. Maybe Wavey?

I can post results from 44.03 and 44.10 later on. I also have a NV30 so I can post result from the same rig using the same driver.
 
Ante P said:
demalion said:
Hmm...OK, I guess I'm a bit fuzzy headed. FP16 for PS 2.0 is the established 5800 behavior for default shaders, except for targetted benchmarks. FP16 for PS 1.4 is still completely contrary to anything that makes sense to me: FX12 usage for PS 1.4 is actually "legal" AFAIK (just compares unfavorably to the 8500/9000), and with a dependence on that for performance, why would nVidia use anything else?

because nVidia's FX12 doesn't sport the range required by PS 1.4

What do you mean? I thought the FX cards simply reported range support from -2 to 2 for PS 1.4? That's less than the R200 and RV250/280 report, and that just seems to be because they support better than FX12. Has something changed?
 
demalion said:
What do you mean? I thought the FX cards simply reported range support from -2 to 2 for PS 1.4? That's less than the R200 and RV250/280 report, and that just seems to be because they support better than FX12. Has something changed?
The texture registers/calculations must be at least [-8, 8], but that's the only requirement. This can be done by doing part of the first phase with FP16 and the second phase with FX12.
 
demalion said:
Ante P said:
demalion said:
Hmm...OK, I guess I'm a bit fuzzy headed. FP16 for PS 2.0 is the established 5800 behavior for default shaders, except for targetted benchmarks. FP16 for PS 1.4 is still completely contrary to anything that makes sense to me: FX12 usage for PS 1.4 is actually "legal" AFAIK (just compares unfavorably to the 8500/9000), and with a dependence on that for performance, why would nVidia use anything else?

because nVidia's FX12 doesn't sport the range required by PS 1.4

What do you mean? I thought the FX cards simply reported range support from -2 to 2 for PS 1.4? That's less than the R200 and RV250/280 report, and that just seems to be because they support better than FX12. Has something changed?

uhmm?
R200 PS1.4 has a range from -8 to 8
nV3x clamps FX12 range to -2 to 2

or perhaps I'm just a bit confused today =)
 
Ante P said:
...
uhmm?
R200 PS1.4 has a range from -8 to 8
nV3x clamps FX12 range to -2 to 2

or perhaps I'm just a bit confused today =)

OK, both you and Xmas seem to have misunderstood me, or I'm misunderstanding what Xmas is saying.

I'm going to restate, and hopefully it will illustrate why I think you're stating something wrong, and Xmas is misunderstanding me with his response to what he is quoting:

R200 offers -8 to 8 range for PS 1.4, due to offering better than FX12 for shader usage in that shader model.
GF FX offers -2 to 2 range for PS 1.4, due to offering FX12 for shader usage (-2 to +2 with 10 bits of precision) in that shader model.
Both are legal.

:arrow: Ante P : From the above, it seems legal to me for the NV3x to use FX12 for PS 1.4, yet you said "because nVidia's FX12 doesn't sport the range required by PS 1.4".

:arrow: Xmas : I'm not contradicting anything you said in your reply, atleast as far as I understood you. However, I wasn't aware that fp16 could be used for texture coordinates even in PS 1.4 :?:...does that supposition invalidate something I stated? I guess I'll go look into texture sizes and filtering options for PS 1.4.
 
From my limited understanding I think that PS1.4 requires a range of -8 to 8 does it not?
Thus nVidia has to use FP16 for PS1.4.
 
Well, unless NVidia can change the way FX12 works, that is. 12 bits is enough to allow for 3 bits of range, 1 bit for sign, and 8 bits for mantissa -- which is exactly what PS1.4 needs. But, yeah, if it can't then FP16 is what they'd have to use to meet the spec.

EDIT: Actually, it should be easy to test this. Just assign 8 to a constant and move it into a register. Then multiply it by 0.125 and output it as the final colour. If the result is white, then FP16 is being used. If the result is a dark grey, then the value was clamped to 2, resulting in 0.25.
 
EDIT: Ante P (told you I was muzzy-feaded...err...something),
For the part Xmas is talking about, yes. But I was talking about other uses (EDIT: not just color). The texture processing functionality for NV3x is associated with the fp32 unit.

Regarding what Xmas said, using temporary registers as texture coordinates (first phase) would work with fp16. Texture registers for PS 1.4 are fixed point, and I was stuck in a bit of DX 9 / DX 8.1 disconnect.
 
Okay, I guess I might have not been sufficently accurate - and correct - there.

Let's say it this way then:
The NV3x *only* got FP32 registers. But it can divide them into FP16 registers. It may also be possible to not worry about the exponent part, in order to reduce latency and thus increase performance. This would be automatically done in the case of PS1.1. - maybe the drivers are sufficently smart to notice when they can do it, but I doubt that. Although it might be possible to manually ( through application detection ) implement that...

That's just my understanding, of course, could be wrong.


Uttar
 
Back
Top