strange nVidia shader compiler precision problems

RacingPHT · Oct 24, 2006

strange nVidia *format* precision problems

Hi all..
I'm writing a long shader doing raytracing, and encountered some serious problem with nVidia GPU/driver. Both vendors may give completely different result compared with MS REF device, but the nVidia side is more difficult to avoid. All vars are intend to be fp32.

####
edit: its a range problem. i can't store anything larger than 65536. i dont know why ATI and REF works.

1: one shader is doing screen space interleaving operation, such as

Code:

	int2 vEvenOdd = floor(fmod((scrPos.xy + 0.5), 2.0));
	int idx = vEvenOdd.x + vEvenOdd.y * 2;
	float4 index = float4(idx == 0, idx == 1, idx == 2, idx == 3);

when i use the index to sample fp32 texture, it's OK, but when sampling fp16 textures, the shader does not work correctly. This is not dependent read.
#####

####
edit: it might be a L16 format problem... thanks to Bob
2: I encode a texcoord to a 1D value, and then expand it in a standard way:

Code:

	float i = f / 256;
	return float2(frac(i), i / 256);

the shader just drop to fp16 presicion, because when f is larger than 1024, the 2D texcoord just samples 1 of every 2 texels, or less.
#####

these code work as expected on ATI and REF device.

Any suggestions of this? :?:

-------------------------------------------------------------
edit: NV43 with 91.47, tried 84.26 and is the same.

Geo · Oct 24, 2006

Which card and which driver rev?

Not that that info will do me any good. . .but it might help someone else.

ShootMyMonkey · Oct 24, 2006

Default precision for all computations on nVidia hardware is fp16, because nVidia can't do fp32 fast, so they espouse that having 32-bit floats is inherently wrong and fp16 is superior in every way... and that if fp16 isn't enough precision for you, then you are predestined to be branded a galactically supreme fool for an eternity.

You might want to see if a .f and typecasting some incoming data (much of which may be rounded to FP16 based on the nature of your vertex shaders) on every constant helps (it sometimes does with MS' compiler), and otherwise. If you're using Cg, you might also want to look at compile settings, since I think a "compile for speed" will cancel out any use of the word float.

There may also be a driver setting (for instance quality/speed thing) that causes the bytecode translation to ignore precision for the sake of speed.

hoom · Oct 24, 2006

I thought that kind of shenanigans were over with release of NV40?

ShootMyMonkey · Oct 24, 2006

arrrse said:
I thought that kind of shenanigans were over with release of NV40?

Why, that would be silly. If anything, the assertions and the speed difference between FP16 and FP32 arithmetic are more extreme than ever with the NV40. You might be thinking of the G80.

RacingPHT · Oct 24, 2006

thanks ShootMyMonkey.
I'm using HLSL, tried "D3DXSHADER_SKIPOPTIMIZATION" flag, set "high quality" in the driver, tried different type casting and all does not effect.

Yes I must use a lot temp registers(up to r17). but I think I just should have a chance to choose between correctness and speed, and optimize it myself. Maybe they must reduce register footprint?

ShootMyMonkey · Oct 24, 2006

I wonder... Is scrPos drawn from the VPOS register (akin to your previous shader MSAA things)? I have to wonder how that behaves (e.g. if it's FP16 whether you like it or not).

Also when it comes to things like reading from FP32 or FP16 textures, both vendors do some weird conversions that you can't really do anything about but try a different format. I think there were some threads before (pertaining to AndyTX's variance shadow map work, IIRC) where people had links to a chart of mappings of various texture formats for nVidia hardware.

Bob · Oct 24, 2006

RacingPHT said:
2: I encode a texcoord to a 1D value, and then expand it in a standard way:

How did you get that texcoord?

RacingPHT said:
when i use the index to sample fp32 texture, it's OK, but when sampling fp16 textures, the shader does not work correctly. This is not dependent read.

Describe "does not work correctly". Note that it's very easy to hit NaNs and Inf with fp16, so you might be falling into one of those cases.

ShootMyMonkey said:
Default precision for all computations on nVidia hardware is fp16, because nVidia can't do fp32 fast, so they espouse that having 32-bit floats is inherently wrong and fp16 is superior in every way... and that if fp16 isn't enough precision for you, then you are predestined to be branded a galactically supreme fool for an eternity.

Or the more reasonable explanation: You have no idea what you're taking about.

RacingPHT · Oct 24, 2006

Bob said:
How did you get that texcoord?

store shorts as L16 and then * 65535 + 0.5 in the shader.

Bob said:
Describe "does not work correctly". Note that it's very easy to hit NaNs and Inf with fp16, so you might be falling into one of those cases.

Yes you are right. It seems to me that when a large number is stored to fp16, NV produce NAN but ATI gives INF...?

SuperCow · Oct 24, 2006

RacingPHT said:
Hi all..
####
edit: its a range problem. i can't store anything larger than 65536. i dont know why ATI and REF works.

when i use the index to sample fp32 texture, it's OK, but when sampling fp16 textures, the shader does not work correctly. This is not dependent read.
#####

But you aren't using fp16 to store those values above 65536 right? The maximum value supported by fp16 is 65509 IIRC, so any values above that would probably get you +infinity. Don't know why the same setup would work on ATI or REF though!

RacingPHT · Oct 24, 2006

SuperCow said:
But you aren't using fp16 to store those values above 65536 right? The maximum value supported by fp16 is 65509 IIRC, so any values above that would probably get you +infinity. Don't know why the same setup would work on ATI or REF though!

I store 1e5 on the fp16 surface by mistake, and test it. The bug was fixed.
But I think NV use a different +INF solution than REF's. The overflow value shown on screen is 0, while REF and ATI is 1.

Any suggention on the latter problem?

SuperCow · Oct 24, 2006

I seem to remember that ATI and nVidia have slightly different values for infinity. The should both be in the order of 1e38 though (on Shader Model 3.0).

Andrew Lauritzen · Oct 24, 2006

ATI fp16 doesn't support denorms or specials (Inf, NaN, etc) IIRC... that would explain the lack of inf. Note that it probably just saturates, which may *look* like the correct result in a lot of cases.

Bob · Oct 25, 2006

+Infinities are converted to 1.0 on output to (unsigned) fixed-point buffers. -Inf and NaN are both converted to 0.0. So if you see black, it's highly likely you have a NaN in there. Make sure you don't subtract infinities, or multiply them by 0, or something.

store shorts as L16 and then * 65535 + 0.5 in the shader.

GeForce 7 and under do not support unsigned normalized textures with 16-bit components. I'm not sure what they're converted to in the driver, but it's likely that fp16 is the end-result.

Why can't you use a 2 component 8-bit unsigned normalized texture (RG8/LA8), then skip over your needless "conversion" code?

RacingPHT · Oct 25, 2006

Bob said:
+Infinities are converted to 1.0 on output to (unsigned) fixed-point buffers. -Inf and NaN are both converted to 0.0. So if you see black, it's highly likely you have a NaN in there. Make sure you don't subtract infinities, or multiply them by 0, or something.

GeForce 7 and under do not support unsigned normalized textures with 16-bit components. I'm not sure what they're converted to in the driver, but it's likely that fp16 is the end-result.

Why can't you use a 2 component 8-bit unsigned normalized texture (RG8/LA8), then skip over your needless "conversion" code?

Many thanks for your info Bob !
Use LA8 is painful somewhere... I'll try to fix it.

strange nVidia shader compiler precision problems

RacingPHT

Geo

Mostly Harmless

ShootMyMonkey

hoom

ShootMyMonkey

RacingPHT

ShootMyMonkey

Bob

RacingPHT

SuperCow

RacingPHT

SuperCow

Andrew Lauritzen

Moderator

Bob

RacingPHT

Similar threads