strange nVidia shader compiler precision problems

RacingPHT

Newcomer
strange nVidia *format* precision problems

Hi all..
I'm writing a long shader doing raytracing, and encountered some serious problem with nVidia GPU/driver. Both vendors may give completely different result compared with MS REF device, but the nVidia side is more difficult to avoid. All vars are intend to be fp32.

####
edit: its a range problem. i can't store anything larger than 65536. i dont know why ATI and REF works.

1: one shader is doing screen space interleaving operation, such as
Code:
	int2 vEvenOdd = floor(fmod((scrPos.xy + 0.5), 2.0));
	int idx = vEvenOdd.x + vEvenOdd.y * 2;
	float4 index = float4(idx == 0, idx == 1, idx == 2, idx == 3);
when i use the index to sample fp32 texture, it's OK, but when sampling fp16 textures, the shader does not work correctly. This is not dependent read.
#####

####
edit: it might be a L16 format problem... thanks to Bob
2: I encode a texcoord to a 1D value, and then expand it in a standard way:
Code:
	float i = f / 256;
	return float2(frac(i), i / 256);

the shader just drop to fp16 presicion, because when f is larger than 1024, the 2D texcoord just samples 1 of every 2 texels, or less.
#####

these code work as expected on ATI and REF device.

Any suggestions of this?:?:

-------------------------------------------------------------
edit: NV43 with 91.47, tried 84.26 and is the same.
 
Last edited by a moderator:
Which card and which driver rev?

Not that that info will do me any good. . .but it might help someone else.
 
Default precision for all computations on nVidia hardware is fp16, because nVidia can't do fp32 fast, so they espouse that having 32-bit floats is inherently wrong and fp16 is superior in every way... and that if fp16 isn't enough precision for you, then you are predestined to be branded a galactically supreme fool for an eternity. :devilish:

You might want to see if a .f and typecasting some incoming data (much of which may be rounded to FP16 based on the nature of your vertex shaders) on every constant helps (it sometimes does with MS' compiler), and otherwise. If you're using Cg, you might also want to look at compile settings, since I think a "compile for speed" will cancel out any use of the word float.

There may also be a driver setting (for instance quality/speed thing) that causes the bytecode translation to ignore precision for the sake of speed.
 
I thought that kind of shenanigans were over with release of NV40? :oops:
Why, that would be silly. If anything, the assertions and the speed difference between FP16 and FP32 arithmetic are more extreme than ever with the NV40. You might be thinking of the G80.
 
thanks ShootMyMonkey.
I'm using HLSL, tried "D3DXSHADER_SKIPOPTIMIZATION" flag, set "high quality" in the driver, tried different type casting and all does not effect.

Yes I must use a lot temp registers(up to r17). but I think I just should have a chance to choose between correctness and speed, and optimize it myself. Maybe they must reduce register footprint?
 
I wonder... Is scrPos drawn from the VPOS register (akin to your previous shader MSAA things)? I have to wonder how that behaves (e.g. if it's FP16 whether you like it or not).

Also when it comes to things like reading from FP32 or FP16 textures, both vendors do some weird conversions that you can't really do anything about but try a different format. I think there were some threads before (pertaining to AndyTX's variance shadow map work, IIRC) where people had links to a chart of mappings of various texture formats for nVidia hardware.
 
RacingPHT said:
2: I encode a texcoord to a 1D value, and then expand it in a standard way:
How did you get that texcoord?

RacingPHT said:
when i use the index to sample fp32 texture, it's OK, but when sampling fp16 textures, the shader does not work correctly. This is not dependent read.
Describe "does not work correctly". Note that it's very easy to hit NaNs and Inf with fp16, so you might be falling into one of those cases.


ShootMyMonkey said:
Default precision for all computations on nVidia hardware is fp16, because nVidia can't do fp32 fast, so they espouse that having 32-bit floats is inherently wrong and fp16 is superior in every way... and that if fp16 isn't enough precision for you, then you are predestined to be branded a galactically supreme fool for an eternity.
Or the more reasonable explanation: You have no idea what you're taking about.
 
  • Like
Reactions: Geo
How did you get that texcoord?
store shorts as L16 and then * 65535 + 0.5 in the shader.

Describe "does not work correctly". Note that it's very easy to hit NaNs and Inf with fp16, so you might be falling into one of those cases.

Yes you are right. It seems to me that when a large number is stored to fp16, NV produce NAN but ATI gives INF...?
 
Last edited by a moderator:
Hi all..
####
edit: its a range problem. i can't store anything larger than 65536. i dont know why ATI and REF works.

when i use the index to sample fp32 texture, it's OK, but when sampling fp16 textures, the shader does not work correctly. This is not dependent read.
#####

But you aren't using fp16 to store those values above 65536 right? The maximum value supported by fp16 is 65509 IIRC, so any values above that would probably get you +infinity. Don't know why the same setup would work on ATI or REF though!
 
But you aren't using fp16 to store those values above 65536 right? The maximum value supported by fp16 is 65509 IIRC, so any values above that would probably get you +infinity. Don't know why the same setup would work on ATI or REF though!

I store 1e5 on the fp16 surface by mistake, and test it. The bug was fixed.
But I think NV use a different +INF solution than REF's. The overflow value shown on screen is 0, while REF and ATI is 1.

Any suggention on the latter problem? :)
 
I seem to remember that ATI and nVidia have slightly different values for infinity. The should both be in the order of 1e38 though (on Shader Model 3.0).
 
ATI fp16 doesn't support denorms or specials (Inf, NaN, etc) IIRC... that would explain the lack of inf. Note that it probably just saturates, which may *look* like the correct result in a lot of cases.
 
+Infinities are converted to 1.0 on output to (unsigned) fixed-point buffers. -Inf and NaN are both converted to 0.0. So if you see black, it's highly likely you have a NaN in there. Make sure you don't subtract infinities, or multiply them by 0, or something.

store shorts as L16 and then * 65535 + 0.5 in the shader.
GeForce 7 and under do not support unsigned normalized textures with 16-bit components. I'm not sure what they're converted to in the driver, but it's likely that fp16 is the end-result.

Why can't you use a 2 component 8-bit unsigned normalized texture (RG8/LA8), then skip over your needless "conversion" code?
 
+Infinities are converted to 1.0 on output to (unsigned) fixed-point buffers. -Inf and NaN are both converted to 0.0. So if you see black, it's highly likely you have a NaN in there. Make sure you don't subtract infinities, or multiply them by 0, or something.

GeForce 7 and under do not support unsigned normalized textures with 16-bit components. I'm not sure what they're converted to in the driver, but it's likely that fp16 is the end-result.

Why can't you use a 2 component 8-bit unsigned normalized texture (RG8/LA8), then skip over your needless "conversion" code?

Many thanks for your info Bob !
Use LA8 is painful somewhere... I'll try to fix it. :)
 
Back
Top