FP - DX9 vs IEEE-32

Reverend said:
Dave said:
"R400" (or whatever is next) is likely still to be a DX9 part so that will probably stay FP24 as well, I would expect DX10 parts to have different base precisions.
I hope this is not the case -- ATi is the leader now, with a lot of folks buying their parts, which is something developers always take into consideration... ATi should start pushing things, like NVIDIA did with 32-bit, static TnL and DX8 shaders back then. I'm sure nobody will complain. But then, time and market penetration is a very important factor so if R400 still has a FP24 "limit", I'd understand the decision but I personally would wish for more.

Given the recent rumours re: R400 program, we can be almost certain that what will now be marketed as R4XX will be FP24 limited. Sireric's (many thanks) view on FP framebuffers, etc, put the real limitations of FP24 into context. I, too, suspect no real change until the next process technology shift.
 
Reverend said:
Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
By “reproducibleâ€￾ and “deterministicâ€￾, I assume you mean “identical in all implementationsâ€￾. Otherwise, yes, operations are always reproducible and are deterministic. We did not add a random number generator J

On the other hand, even if floating point adds and muls have identical implementations, in general, you would still not be able to guarantee that two implementations of the same taylor expansion would always give the same results, with the same inputs. You would also need to have not only identical hw, but identical compilers, identical source code and identical OS/APIs. IE^3 makes no guarantees on a sequence of operations. Floating point numbers are inexact and so are operations performed on them. Programmers have learned to live with that.

FP32 is 1.8.23
The 23 bits is a normalized mantissa form of a 24b number. You have 24b of precision.
Er...no, not according to my understanding. IEEE doesn't have any such operation as "a+b+c" whose order is undefined. IEEE only has "a+b", so to add three numbers you need to either specify "(a+b)+c" or "a+(b+c)", either of which is reproducible. Or use a language like C which has well-defined precedence and associativity rules, so that "a+b+c" is defined as being exactly the same as "(a+b)+c" and not "a+(b+c)". Any violation of this in modern languages is an optional compiler optimization that defaults to "off" and is called something like "optimize floating-point operations aggressively". We don't want this even if it's already happening ;) :)

All I meant is that more complex operations do not have results that are guaranteed by IE^3. The implementation details influence the results. If a PowerPC implements an FMAD with higher precision than MUL/ADD combo, that will lead to slight differences between that HW and other. Doesn’t seem to offend most programmers.

Though there are some that require the exact same results. But they aren't programming pixel shaders :)

They're not needed if and only if you don't care about reproducability. But if you're going to do like NVidia does and sometimes run VS operations on the CPU for load-balancing, then you get different results along both paths, which is bad. This is exactly the kind of problem the IEEE spec was designed to remedy, so why not use it?

One has to judge the cost of an item and make a call. Certainly if one is with offloading VPU activity to the CPU, having identical implementations is required. On the other hand, PS shaders do not have the luxury of being offloadable. They could be used to offload the CPU, but there’s no API available to do that. Would you increase the cost of the product for something that could not be used?

Saying "this device is IEEE compliant, with this small set of exceptions" is like saying "I am a virgin, with this small set of exceptions". Either Tagrineth is a virgin, or she's not :LOL: .

Sure. Quite colorful.

Again it all comes down to whether you see 3D hardware as a deterministic computational device which produces well-defined output for any input, or it's just some black box that you feed polygons into to produce some sort of random approximation of your scene.

Again, 3D hardware is very deterministic. You plug in something, and the same thing comes out. Every time. However, the HW is just part of the whole. The system HW, the OS, the application, the API, the drivers, etc… are all changing. Expecting exactly the same output on all systems is not realistic.

You're implying that the effect of precision loss on FP24 vs FP32 is linear, a sort of one-time penalty -- that's not the case at all in my books. In the worst case, cascading loss-of-precision errors can increase exponentially as a function of instruction count divided by mantissa bit count.

I did not say that it was linear. I was saying that it has the same properties. Yes, the error ranges are larger, but the properties are the same. Going to FP24 does not require a “Changeâ€￾ of philosophy. It just needs the programmer to be aware of the ranges, and to code the applications taking that into account.


"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation. For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.

No, for each operation, assuming ½ lsb of error is correct. Your example is a composite operation. A ½ lsb of error in one operation can be amplified in the next operation, irregardless of the precision. But, you are correct that the errors do not add; you can certainly construction operations that magnify errors. However, when coding PS operations, one should strive for stable code. In most shader codes I’ve seen, things are simple and errors are stable.

If Intel applied this kind of "gee, there's no real use for this combination of instructions" when designing the x86, it would be impossible to write the kind of programs people wrote.

Sure. VPUs really aren’t replacements for CPUs. If ATI gets in that business, we will lose. Intel and AMD are much better at it. Given that VPU outputs eventually get truncated down to 10b for color and 11~15b for texture addresses, things are good (for now).

Sure, it's easy to come up with a 1000-instruction shader that looks perfect with FP24, and easy to come up with a 3-instruction shader that looks like crap with FP24 then looks great with FP32, and then a 3-instruction shader that looks like crap with FP32 but is great with FP64. Floating point is like that. :)

Sure, FP is inexact. However, I was using empirical evidence to show that our assumptions appear justified.

FP24 was a reasonable decision for the R3x0, which was available basically a year before NV30. It's a lot better than 8-bit integer and gave everyone a sneak peak at DX9's capabilities. But it should be considered a stepping stone, to be phased out as soon as FP32 is commercially viable, rather than being considered a long-term solution. And FP32 may be becoming viable now with NV35 (haven't got one, can't really say for sure). It's just like 3dfx's situation with 16-bit: it was the right solution in 1997, but when 1999 came and they were arguing that it was good enough and nobody needed 32-bit, well, that was not a realistic view.

Never said that FP24 is the end. Neither is FP32, for that matter. At SGI, on GE11 (IR, Impact), we had double precision ALUs, just to compute higher order geometries (circles, spheres). But my point is that FP24 is still brand new and there are no applications yet showing up that push it at all. I explained that FP32 (at full speed) is significantly more expensive than FP24. I also noted that other items (Larger textures, FP displays, etc…) need to kick in as well to justify FP32. One has to weigh the cost and the benefits. I stand by our decision to use FP24. It’s fast and it’s high precision; nobody else can claim those things.

Like I said, the R3x0's is a fine part, a good start wrt FP. I just hope it doesn't become set in stone.

You’re being silly. It’s obvious that FP32 will come, when it’s needed and cost effective. I don’t really understand what you meant to do by all this; FP24 is justified and makes sense, for now; FP32 is not. Why not enjoy the benefits of what is available now? Anyway, I’m glad you think R300 is a “Fineâ€￾ part.
 
Chalnoth said:
I'm not so sure. There are many different ways to deal with the error. It is, for example, quite feasible for the error in these calculations to always be additive.

Such error would be absolutely devastating for the accuracy of anything resembling a long program.

Are you assuming that what I said would be more prone to biased rounding? I don't see why.

I can put it in a more "marketing worthy" way. :) Is it worth 40% more gates to gain ~0.7 bit higher precision in the case with 24 bit mantissa (fp32).
 
Addressing the last point first since I don't like being called silly :) :
sireric said:
Rev said:
Like I said, the R3x0's is a fine part, a good start wrt FP. I just hope it doesn't become set in stone.
You're being silly. It's obvious that FP32 will come, when it's needed and cost effective. I don't really understand what you meant to do by all this; FP24 is justified and makes sense, for now; FP32 is not. Why not enjoy the benefits of what is available now?
I should've said "long term" instead of "set in stone". [edit]Note that R300 debuted in, what, Sept last? And the R4xx, say, this year and presumably still FP24? Let's assume R300-based boards are the majority consumers own this time next year. Do you think FP32 will be really important to developers if this is the scenario next year? Is it "needed" then based on this scenario? Certainly most developers now, with the R300 being the first DX9 board commercially available, and having the timeframe advantage over the NV30, won't be doing stuff requiring FP32 to work or run without looking shitty... not until it's needed according to you, which means not until we see FP32 hardware being the majority. Yes?

But I find it strange to hear you say "...FP32 will come, when it's needed and cost effective". I would argue with the "when it's needed part" but I can understand the cost effective part.

"when it's needed" sound awfully familiar to me and usually for the wrong reasons in my books -- surely you won't fault me for comparing this to 32-bit color depth back when the TNT debuted it... it "wasn't needed" back then but I could think of instances where it is definitely needed even before it made its debut. For that matter, many other things are not "needed"... until the hardware arrives, right? :) Why is AA and AF needed now? Because it was made available and folks got to see the difference. I think, for the few reasons I expressed either here or here, that once folks see on display what I expressed, it will be "needed". Hey, wait a minute, if I want to do this already, it is already needed! Of course, I don't work for a AAA development house... maybe that's one of the main points :)

I think I have expressed why I think FP32 is needed, even now, but that's probably because I know what IEEE-32 is, and what DX9 is not, compared to it. I don't know the entire history of DX9, from conception to fruition, but my understanding was that IEEE-32 was a basis for it wrt precision. Perhaps that is the entire summary of my thoughts in this thread! :)

I don't really understand what you meant to do by all this; FP24 is justified and makes sense, for now; FP32 is not.
"What I meant to do by all this" is to give examples of why I think FP32 is a necessity and why FP24 will not be good enough. Having brought R3x0 and ATi into the discussion probably wasn't a good idea but I needed an existing hardware with FP support but not quite full FP32 support to illustrate why I know FP32 is important. If this is redundant or a "duh" point, excuse me... I'm not so sure many others would know the imporatnt difference between FP24 and FP32.

Why not enjoy the benefits of what is available now?
I do enjoy seeing what FP24 gives me now via the R3x0. I would enjoy it more if I see full FP32, that's all.

Anyway, I'm glad you think R300 is a "Fine" part.
Aw, so sorry I didn't say a "very good part". :) It is as fine a part as NV20 was when it made its debut along with DX8, that's what I meant basically.

As for the rest, I think things are little clearer between us. Thanks for your comments. In the end, I gave my views in the hope of learning things that may show me I'm wrong in my understanding, that's all. I'm not here to argue the virtues or bad stuff about any one chip from any IHV. If anything, show me why you think it's wrong for me to think FP32 is that much more important than FP24 per se, leaving aside any IHV favoritisms or chip favouritisms or who you work for.

I'm in a hurry for lunch so excuse any inconsistent or dumb ramblings!
 
Reverend said:
I do enjoy seeing what FP24 gives me now via the R3x0. I would enjoy it more if I see full FP32, that's all.
You would? What would justify the cost? What application would require it? We've barely even touched the limits of FP24 and already you are complaining it's not enough. Sounds very similar to what I read before the NV30 launch. "FP24 is not enough, you gotta have FP32. Don't buy the R300, the NV30 will be better in all respects because it has real DX9 support." Yeah right.

Show me an interesting application that requires FP32 right now. Hell, show me an interesting application within the next year that will require FP32.
 
OpenGL guy, calm down. This isn't really about NVIDIA or ATI or some chip by either. Please read what I wrote. Your "show me an app" comment is really a chicken-and-egg situation, isn't it?

I'm not talking about current apps, please. I'm talking about specific instances where FP32 has an advantage over FP24, that's it, period. Is it wrong to wish for more than the R300 offers? You make it sound like I'm bashing the R300 when that is most definitely not what I'm doing.

Analogy - what's the point of discussing better AA algorithms and its benefits than what current hardware offers? There's no point in discussing it then?

[edit]I started talking about IEEE and DX9 and thought that using the R300 as an example (since it was the first FP supporting hardware) would be a good basis for discussions. sireric in his first posts made some excellent points but he somehow had to bring his "I'm an ATi employee" mindset into the discussion. I countered with my own points. sireric again provided additonal comments but still feels the need to be "defensive" about ATi (since he works for them). Why? Where can I go to talk about the need for progress and better solutions? Is it wrong to do so here without coming out like I'm criticizing a hardware or company unjustifiably? I want FP32. I appreciate FP24. I gave my views on why I think FP32 is more important than FP24 in specific instances. I did not say any app currently requires FP32 but I gave instances where such may be required in the future. Is that wrong?
 
Reverend said:
I'm not talking about current apps, please. I'm talking about specific instances where FP32 has an advantage over FP24, that's it, period. Is it wrong to wish for more than the R300 offers? You make it sound like I'm bashing the R300 when that is most definitely not what I'm doing.
As I said before, FP24 isn't a limitation now, and won't be for a while. What use is more precision? Sure, you may want it, just like I want a new sports car, but do you need it? Not that I can see.
Analogy - what's the point of discussing better AA algorithms and its benefits than what current hardware offers? There's no point in discussing it then?
Have we reached the limits of current AA algorithms? I don't know. I do know that I would like more samples, and I can see a need for that. I can't see a need for more than FP24 in the near future.

And I don't think FP24 is holding back anyone's development right now.
I gave my views on why I think FP32 is more important than FP24 in specific instances. I did not say any app currently requires FP32 but I gave instances where such may be required in the future. Is that wrong?
IHVs design chips with specific goals in mind. If FP24 meets those goals, then that's what will be designed. There's no reason to go overboard (i.e. over engineer) with FP32 because you won't be able to justify the extra cost. Again, you have to balance what you want vs. what you need, and it can be a tough balancing act.

I my opinion, ATi chose wisely and that's why the R300 products are doing well.

P.S. Don't worry, I'm not arguing against FP32, I'm just saying that right now, and in the near future, there is no need for it.
 
I suppose what you're saying is that my wants and needs and facts (presumably facts, since you didn't comment on them specifically, unlike sireric's attempts) about FP24 limitations vis-a-vis FP32 aren't terribly important right now :) (wait a sec, that should be a :( ).

Completely understandable.

Now, can we get back on topic? :) Can you tell me if I'm right or wrong in the examples (FP24 not being good enough compared to FP32 per se regardless of hardware availability) I gave, regardless of whether you think FP24 is good enough for now or near future? I'm not trying to be clever here -- I am sincerely not sure if I'm right or wrong in my thinking and experimentations (I'm still learning). Just the facts as per what I laid out re FP24 vs FP32. Forget you work for ATI :)
 
OpenGL guy said:
P.S. Don't worry, I'm not arguing against FP32, I'm just saying that right now, and in the near future, there is no need for it.

Not until ATI get their implementation right? ;)
 
Reverend said:
I suppose what you're saying is that my wants and needs and facts (presumably facts, since you didn't comment on them specifically, unlike sireric's attempts) about FP24 limitations vis-a-vis FP32 aren't terribly important right now :) (wait a sec, that should be a :( ).

Completely understandable.
Yay, so we're past that ;)
Now, can we get back on topic? :) Can you tell me if I'm right or wrong in the examples (FP24 not being good enough compared to FP32 per se regardless of hardware availability) I gave, regardless of whether you think FP24 is good enough for now or near future? I'm not trying to be clever here -- I am sincerely not sure if I'm right or wrong in my thinking and experimentations (I'm still learning). Just the facts as per what I laid out re FP24 vs FP32. Forget you work for ATI :)
Eventually, you can expose limitations of any level of precision. For example, I really wish I had (fast) FP128 for my fractal computations. Maybe I should put pressure on Intel and AMD for better FPUs :D Or maybe sireric will give me FP128 in the pixel shader ;)
 
LeStoffer said:
sireric said:
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well.

Interesting info. The FP24 choice obviously had a lot to do with your silicon budget, but I didn't realize that FP32 is so demanding over FP24.
Just do an N digit x N digit multiply by hand and count the steps needed. (=> O(N^2) complexity) The algorithm is going to be approximately the same for binary in silicon.

(Actually, there is an algorithm that does it faster than O(N^2) (possibly N.log(N)??) but it's complicated)

Reverend said:
As so
me has said, thanks for your comments Eric. However...

sireric said:
About some misconceptions, and some comments:

1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this.

Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.

Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.

Quote:
If you need 24b of mantissa precision, FP32 is not enough for you anyway.
FP32 is 1.8.23
But you are forgetting that, except for denormalised numbers, there is an implied '1' in the MSB, and so the mantissa is 24b.

"Assuming 1/2 lsb of error per operation" is not realistic. The number of lsb's of error in the worst case can be equal to the difference between the two exponents in the computation, so you can easily have 2, 4, 8, or 16 lsb's of error in any given computation.
Actually you can have all of the precision lost in a couple of calcs if you program things badly, eg
Code:
Inaccurate = (Big + Small) - Big
But if we assume sensible calculations, losing 1/2 a bit per calc is not a bad rule of thumb. I always liked the quote (can't remember who said it and this might be a misquotation)
Floating Point numbers are like piles of sand; every time you move one you lose a little sand, but you pick up a little dirt
 
It's arguable that double precision in the vertex shader could be substantially more useful than FP32 in the pixel shader.

There's probably quite a few guys in the professional (flightsim, etc. rather than film) space who would appreciate it already...
 
Reverend said:
Let's focus on FP24, FP32, DX9 and IEEE and what they all mean, without mentioning IHVs and parts.

Ok - sounds like a good approach.

Reverend said:
Dave said:
"R400" (or whatever is next) is likely still to be a DX9 part so that will probably stay FP24 as well, I would expect DX10 parts to have different base precisions.
I hope this is not the case -- ATi is the leader now, with a lot of folks buying their parts, which is something developers always take into consideration... ATi should start pushing things, like NVIDIA did with 32-bit, static TnL and DX8 shaders back then. I'm sure nobody will complain. But then, time and market penetration is a very important factor so if R400 still has a FP24 "limit", I'd understand the decision but I personally would wish for more.

Hmmm... interesting.

The situation here as I see it is this - the next clear step up from PS2.0 is really PS3.0, and my understanding is that PS3.0 is defined in the DX9 spec as having the same precision requirements as PS2.0.

Here is the quote from Microsoft's clarifying statement on the DXDev mailing list -

[from ps_2_0 section]
---Begin Paste---
Internal Precision
- All hardware that support PS2.0 needs to set
D3DPTEXTURECAPS_TEXREPEATNOTSCALEDBYSIZE.
- MaxTextureRepeat is required to be at least (-128, +128).
- Implementations vary precision automatically based on precision of
inputs to a given op for optimal performance.
- For ps_2_0 compliance, the minimum level of internal precision for
temporary registers (r#) is s16e7** (this was incorrectly s10e5 in spec)
- The minimum internal precision level for constants (c#) is s10e5.
- The minimum internal precision level for input texture coordinates (t#)
is s16e7.
- Diffuse and specular (v#) are only required to support [0-1] range, and
high-precision is not required.
---End Paste ---

For ps_3_0 the requirements are the same, however interpolated input
registers are now defined by semantic names. Inputs here behave like t#
registers in ps_2_0: they default to s16e7 unless _pp is specified (s10e5).

IHVs will be releasing PS3.0 parts, and those parts may well be making the step up to higher precisions that you desire, but as I see it here is the bind. Even if this is the case, developers still should not be assuming higher than 24 bit precision when coding, because that is what is specified.

You may get higher than 24-bit precision, which will be good, but we are in the situation that for the future of DX9 things are fixed at 24 bits and it is important that this is recognised by all parties so that the spec does not fragment.

As to the benefits of FP32 vs. FP24 in the short term I won't make any comment for the moment. My personal feeling is that it's the right balance for the current time (but then I would think that, wouldn't I ;)).

It will be interesting as we move forward to hear more from people developing shaders as and when they manage to run into any significant limitations caused by the use of 24-bit FP. At the moment it would seem to me to be rather early to be criticising the precision - after all, we stayed with at most 8 bits of guaranteed precision from well before DirectX even existed right through to DX8, which is really quite a long time. Even in DX8 the recommendation only went up to at least 9 bits for internal operations as I recall. Now in a single DX version we have made a huge step, so it seems appropriate to wait a bit for shader coding to catch up.

- Andy.
 
Reverend said:
For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.

You got it the wrong way. It's when the light and the texel are close to each other relative to the distance to origo you've got problems. And squaring it doesn't double the number of error bits, but it can double the error (meaning adding one error bit).

But it's bad (wrt performance and precision) to do the calc that way. It's better to write square(magnitude(X)) as dot(X,X), or to do the latter in an inline function magnitude2(X).

The subtraction lose as many bits of precision as is needed to store the ratio between light to tex distance, and light/tex to origo distance. Ie, if light-tex distance is one unit, and they are 32 units away from origo, then you'll lose 5 bits. If you lose to much precision (which certainly is possible if you're not careful), you can gain it back by moving part of the subtraction to VS (using its higher precision). The VS can make sure that PS get a local coordinate system for (lightPos-texPos).


PS:
I'm interested in what the ATI- (and other HW-) developers says about not doing perfect rounding of exact calculation.
 
Simon F said:
Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.

Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.

This is core part of the discussion IMO. I don't want shader compilers to have to live under the same restrictions as C compilers have. Placing all this IEEE crap on the shader part will just be a huge roadblock in the development of graphics technology. If someone would seriously propose something similar to be written into API spec I would actively oppose it. Or expressed in Carmackian terms, keep dumbass ideas that will hunt us for years out of the API. Reproducability is only guaranteed by the API in one way, you should on the same hardware and driver get the same results if you do the exact same operations. This kind of repeatability is useful. Expecting mathematical operations to always return the exact same results is not very useful. In fact, I would support getting rid of all these kinds of restrictions on C compilers too. Since 99.99% of the applications don't care about it I think the default should be that these kinds of optimisations are valid. If your code requires that level of repeatability then you either have very odd needs or you just don't know how to write proper code.

As for the fragment pipeline, we have never had this level of repeatability there, and I certainly don't think we should introduce it, ever.

About fp24 vs f32, sure I want more precision if possible. But 32 is an arbitrary number, why not 33, it's even better? Fp24 is pretty good for now, if I get more in the next generation I'm going to be happy, but I would be fine with it if it stayed at 24. They may choose to go 28 too, sounds like a good middle solution that's not too expensive but adds a little more precision for the few apps that would need it.

Edit:
Was also going to say, caring about the last bit or two of precision is kinda silly. Graphics is onyl half engineering, the other half is artristry. It's like bringing a microscope to the art museum and complain about the quality of the work of Van Gogh.
 
Agreed. As long as it meets the minimum DX9 specs, that's good enough for me. Didn't mean to imply anything else. Just making the observation.
 
I fully agree with Humus. Shader precision is about producing artifact-less images, not about getting reproducable results up to the lsb.
 
Back
Top