As some has said, thanks for your comments Eric. However...
sireric said:
About some misconceptions, and some comments:
1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this.
Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
If you need 24b of mantissa precision, FP32 is not enough for you anyway.
FP32 is 1.8.23
2) IE^3 does not guarantee, in any way, that operations are order independant. For example, consider:
result = a + b + c
The above, generally, needs to be broken down into 2 operations. If we select a = 1.0, b = -1.0 and c = 2^(-30), then the result can be 0 or c. Depends on the implementation. IE^3 does not "specify" anything. The programmer needs to specify what he wants (i.e. (a+b)+c).
Er...no, not according to my understanding. IEEE doesn't have any such operation as "a+b+c" whose order is undefined. IEEE only has "a+b", so to add three numbers you need to either specify "(a+b)+c" or "a+(b+c)", either of which is reproducible. Or use a language like C which has well-defined precedence and associativity rules, so that "a+b+c" is defined as being exactly the same as "(a+b)+c" and not "a+(b+c)". Any violation of this in modern languages is an optional compiler optimization that defaults to "off" and is called something like "optimize floating-point operations aggressively". We don't want this even if it's already happening
2.5) IE^3 support for nans, inf and denorms is just not needed in PS. In the final conversion to [0,1.0] range, inf and 2.0 would give the same output. For that matter, it's not needed in VS either.
They're not needed if and only if you don't care about reproducability. But if you're going to do like NVidia does and sometimes run VS operations on the CPU for load-balancing, then you get different results along both paths, which is bad. This is exactly the kind of problem the IEEE spec was designed to remedy, so why not use it?
Saying "this device is IEEE compliant, with this small set of exceptions" is like saying "I am a virgin, with this small set of exceptions". Either Tagrineth is a virgin, or she's not
.
Again it all comes down to whether you see 3D hardware as a deterministic computational device which produces well-defined output for any input, or it's just some black box that you feed polygons into to produce some sort of random approximation of your scene.
3) FP24 has less precision than FP32, but has no worst other characteristics (well, range is reduce by 2^63 as well). Order of operations would be no "worst" than fp32, beyond the precision limits.
You're implying that the effect of precision loss on FP24 vs FP32 is linear, a sort of one-time penalty -- that's not the case at all in my books. In the worst case, cascading loss-of-precision errors can increase exponentially as a function of instruction count divided by mantissa bit count.
4) What are the outputs of the shader? There are two: The 10b color or 11b texture address (actually, with subtexel precision for filters, you could expect the texture address to be up 15b). With FP24, you get 17b of denormalized precision, which would allow you to have up to 8 (assuming 1/2 lsb of error per operation) ALU instructions at maximum error rate before even noticing any change in the texture address or texture filtering. Until texture sizes increase significantly (2kx2k right now, will give you a 16MB texture -- I don't know of many apps that even use that now), or that texture filtering on FP texture exists, there really is no need for added precision. On the other hand, it's obvious that FP16 is not good enough, unless you have smallish textures (i.e. < 512x512).
"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation. For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.
5) The only exception to 4 above is for long procedural texture generation. In those cases you could expect that some fp24 limits would come into play. There's no real use of that out there right now, and we are probably years from seeing it in mainstream.
If Intel applied this kind of "gee, there's no real use for this combination of instructions" when designing the x86, it would be impossible to write the kind of programs people wrote.
Nevertheless, our Ashli program can take procedural texture generation code from Renderman, and generate long pixel shaders. We've generated shaders that are thousands of instructions long. What we found is that it looks perfect, compared to the original image. No one would ever complain about that quality -- It's actually quite amazing. So, empirically, we found that even in these cases, FP24 is a nearly perfect solution.
Sure, it's easy to come up with a 1000-instruction shader that looks perfect with FP24, and easy to come up with a 3-instruction shader that looks like crap with FP24 then looks great with FP32, and then a 3-instruction shader that looks like crap with FP32 but is great with FP64. Floating point is like that.
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well. That added area would of increased cost and given no quality benefits. We could of implemented a multi-pass approach to give higher precision, but we felt that simplicity was a much higher benefit. Given the NV3x PP fiasco, we even more strongly believe now that a single precision format is much better from a programming standpoint (can anyone get better than FP16 on NV3x?).
FP24 was a reasonable decision for the R3x0, which was available basically a year before NV30. It's a lot better than 8-bit integer and gave everyone a sneak peak at DX9's capabilities. But it should be considered a stepping stone, to be phased out as soon as FP32 is commercially viable, rather than being considered a long-term solution. And FP32 may be becoming viable now with NV35 (haven't got one, can't really say for sure). It's just like 3dfx's situation with 16-bit: it was the right solution in 1997, but when 1999 came and they were arguing that it was good enough and nobody needed 32-bit, well, that was not a realistic view.
[edit]
Eh, didn't see Russ already mentioned the 3dfx-16-bit stance
At the end of it all, we believe we did the right engineering decision. We weighed all the factors and picked the best solution. DX9 did not come before the R300 and specify FP24. We felt that FP24 was the "right" thing, and DX9 agreed.
Like I said, the R3x0's is a fine part, a good start wrt FP. I just hope it doesn't become set in stone.
Like Simon F said, IANAHE so these are just my understanding.
[edit]
minor touch-ups