FP - DX9 vs IEEE-32

Nick · Jun 4, 2003

If you need full IEEE compliance and have control over evaluation order, you will have to do it in software and live with the fact that it's slower. Of course there are much better alternatives than the reference rasterizer: swShader.

Else just be content. Your eyes can't see the difference anyway. Well I should be speaking for myself...

Simon F · Jun 4, 2003

Humus said:
Simon F said:

Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.

Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.

Click to expand...

This is core part of the discussion IMO. I don't want shader compilers to have to live under the same restrictions as C compilers have.

Do you mean allowing the shader compiler to re-order the operations, eg assume associativity or distributive law? That's risky. As I said, a certain IHV appeared to be using different calcs in the 'fixed T&L' part of the drivers depending on whether shading was on or off and it definitely caused some major rendering errors. (eg Z values changing and making objects flicker).

Entropy · Jun 4, 2003

A small comment from the scientific computing field.

Code that critically depend on the minutiae that Reverend brings up is effectively broken. You should never, ever write anything which makes those kinds of assumtions.

Assuming rounded rather than truncated results is pretty much as far as you can hope for. If you _need_ control, you should explicitly code for it, never leave it to the system to take care of for you.

Now, in scientific computing, codes tend to have very long life and get ported all over the place, and is thus probably a worst case, but generally the experience should carry over.

Sireric explained nicely why FP24 is a good compromise for the tasks we ask of this hardware. If you do something else though and need fp32, by all means buy whatever supports it. But making the product significantly slower/costlier for some hypothetical benefit just doesn't make sense. The very same tradeoffs have been made on the CPUs you are currently running on.

BTW, the above should in no way be construed as endorsing general sloppyness when defining computational tools. From personal experience, I do however endorse extreme suspiciousness on the part of programmers as far as these issues are concerned. "Just don't count on it."

Entropy

Humus · Jun 4, 2003

Simon F said:
Humus said:

Simon F said:

Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!). It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.

Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.

Click to expand...

This is core part of the discussion IMO. I don't want shader compilers to have to live under the same restrictions as C compilers have.

Click to expand...

Do you mean allowing the shader compiler to re-order the operations, eg assume associativity or distributive law? That's risky. As I said, a certain IHV appeared to be using different calcs in the 'fixed T&L' part of the drivers depending on whether shading was on or off and it definitely caused some major rendering errors. (eg Z values changing and making objects flicker).

The only place where such optimizations would causes problems is on the vertex position output, since it affects fragment depths. Otherwise it should be pretty safe. In the fragment pipeline I see no reason why it should ever be a problem.

darkblu · Jun 4, 2003

Humus said:
The only place where such optimizations would causes problems is on the vertex position output, since it affects fragment depths. Otherwise it should be pretty safe. In the fragment pipeline I see no reason why it should ever be a problem.

what's with the 'fragment pipeline' and 'vertex pipeline'? - it all comes down to 'real' data ending up discrete (actually it's discrete data getting 'grossly more' discrete, so to say, but nevermind). so, until the very final color output of the very last 'pass' of the algorithm at hand you'd want as high error-proofness as possible (in consumer's terms - 'as money can buy'). saying that poeple don't scrutinize an artist's work under a microscope is not quite the analogy - microscopes deal w/ spatial not so w/ spectral precision, and with the latter you don't know if the artist wouldn't have liked the means to express his vision of a particular color even further than what the 'present art' allowed him. humans strive for perfection - and they wouldn't give it up if they had the means to achieve it (resources & time). in this regard, i'm perfectly fine with the dx9 ps/vs specs, but that does not mean i'm set with those for the rest of my life (any life span expectations aside).

ps: a pretty please w/ sugar on top goes to the well-respected ati employees who spend their well-deserved but sparse spare time to post on these forums - could you (arbitrarily) improve on the aniso algo for the next parts currently in design? i believe i'm speaking for those ppl of the mindset 'aniso should be rather costlier but nicer'. thank you.

pps: before anybody gets the wrong idea, my humble opinion is that r3xx is the best dx9 implementation by far for the time being. i just wish it could be a bit better

antlers · Jun 4, 2003

Is ATI going with FP24 more akin to 3dfx going with 16-bit only in the Voodoo 3, or 3dfx going 16-bit only in the Voodoo1?

I think it's more akin to going 16-bit only with the Voodoo1. Sure, FP32 is nice to have, but the applications that would demand it and the technology to support it at fast speeds aren't here yet (I've yet to be convinced that the NV35 can do FP32 shaders at adequate speeds).

Also, when it comes to color precision, there is diminishing returns. The visible difference between FP24 and FP32 would be much less than between FP16 and FP24.

Deflection · Jun 4, 2003

K.I.L.E.R said:
I would LOVE to see an instance where FP32 has an advantage over FP24.

Humus's Mandlebrot demo. It's basically a worst case scenario demo where precision errors can "spiral" out of control (pardon the pun

Even there you really have to zoom to see it. Kind of like the SS rotating floors for AF on the radeon. The stuff we've seen so far seems to be that FP24 can handle pretty much all that's out there to a very acceptible level.

Where I'm not sure, is that the same can't be said of FP16. ATI and MS don't seem to think so, but Humus's demo can't really be used to judge that because it is a worst case scenario. The 3Dmark demo did show differences too under close examination. The question is, does it fall more on the side of "worst case scenario" or "real games will see results like these". The framerates are rather low which implies intensive shaders that might not make it in to DX9 games. Some people on this forum have said textures need the extra precision, but I don't have the knowledge to judge that.

In any case, we're just now starting to see DX8 pixel shader games. I think it's safe to say the r300 is the best DX9 design so far, but it's tough to say by how much without the games to compare.

Dio · Jun 4, 2003

Textures already do need extra precision.

If you consider the concept of a 'location' in a texture - well, the biggest textures are 2048x2048. That's 11 bits. But for smooth bilinear filtering, you have to have subtexel precision (because the bilinear interpolation factor is the fractional part of the texture coordinate). That's at least four more bits to be acceptable, and might be more like six.

Luminescent · Jun 4, 2003

That is exectly what sireric referred to when he wrote this a while back, in reference to R3*:

sireric said:
We don't have the 2^127 as the largest number, it's 1.999*2^63 -- Smallest is 2^-64. The range was deemed large enough for most items (1.8*10^19), while giving us 17b of mantissa, which is more than enough for texture lookup (2k texture requires 11b, plus 4b subprecision takes you to 15b -- The extra two bits improve precision in computations and reduce the probability of introducing errors in the max texture addressing computation) Our choice of 24b total was based on this -- enough to cover all texture addresses and most numerical items as well; a "good" balance, imho.

Reverend · Jun 5, 2003

Simon F said:
Reverend said:

As so
me has said, thanks for your comments Eric. However...

sireric said:

About some misconceptions, and some comments:

1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this.

Click to expand...

Correct but here you have a fine alternative: if you need a reproducable version of cos, you can implement it yourself as a Taylor series, using floating point add and mul, if add and mul are deterministic. But if add and mul aren't deterministic, it's impossible to implement anything deterministically at all. The basic arithmetic ops are the building blocks.

Click to expand...

Maybe I'm misinterpretting what you've written, but the ops are certainly deterministic, i.e. a*b always produces c on hardware x. (assuming it's implemented properly!!!).

On a single machine, it's deterministic.
On all machines supporting DX9, no. NVIDIA's * function and ATI's * function are not the same function because NVIDIA's a*b and ATI's a*b differ.

It's just that floating point is not a mathematical group, so the usual maths rules you expect, eg associativity, (a+b)+c = a + (b + c), or the distributive law, a*(b+c)=a*b + a*c, aren't guaranteed.

Correct, the theoretical thing going on here is that floating point numbers form a "semi-field", rather than an field, because certain laws fail, such as associativity (a semifield is a data type equipped with addition, negation, multiplication, inverse, zero and one; a field is a semifield where all of the operations obey all of the associative, distributive, etc., laws). But at least IEEE defines the operations deterministically across machines

Actually, this raises an interesting point - presumably C compilers are not allowed to make optimisations of this nature in FP calculations simply because they could change the behaviour of the code.

Whoa, so true. So, C compilers tend to have optimization options that you can turn on to let the compiler pretend that identities like (a+b)+c = a+(b+c) are true so it can rearrange your code to make it faster. Like most compilers' "assume no aliasing" optimization flag, this isn't strictly safe, but is usually good enough for most tasks. The difference here is that with C, the programmer can choose whether to do things precisely or quickly, whereas with DirectX9, the hardware has already decided for you.

Reverend · Jun 5, 2003

Basic said:
Reverend said:

For example, in a shader that says something like 1/square(magnitude(LightPosition-TexelPosition)), unless your light and texel are real close together, that subtract can easily have many bits of lsb error, and squaring that quantity then doubles the number of error bits.

Click to expand...

You got it the wrong way. It's when the light and the texel are close to each other relative to the distance to origo you've got problems.

My bad, you're right. This case actually occurs, for example, when a player is in a small room with the lightsource, and the room is far away from the origin of the world.

And squaring it doesn't double the number of error bits, but it can double the error (meaning adding one error bit).

Yup, you're right.

But it's bad (wrt performance and precision) to do the calc that way. It's better to write square(magnitude(X)) as dot(X,X), or to do the latter in an inline function magnitude2(X).

The approaches are pretty similar. The problem occurs just as much in 1D as in 3D, for example (a-b)^2 where a and b are both large numbers that are almost equal.

The subtraction lose as many bits of precision as is needed to store the ratio between light to tex distance, and light/tex to origo distance. Ie, if light-tex distance is one unit, and they are 32 units away from origo, then you'll lose 5 bits. If you lose to much precision (which certainly is possible if you're not careful), you can gain it back by moving part of the subtraction to VS (using its higher precision). The VS can make sure that PS get a local coordinate system for (lightPos-texPos).

Yes, you can definitely reduce the amount of error by arranging calculations as carefully as possible, and moving certain things into the vertex shader (or doing them on the CPU in double-precision and passing the final results down to a VS). This all requires more programming effort of course. It also limits the generality of what you can set up. When you are writing a single pixel shader, you can look at the overall algorithm and manage its precision carefully.

But for example if you're writing a bunch of shader components that can be combined together to form pixel shaders (for example, a specular lighting module, a spherical harmonic module, attenuation modules, etc), you can't be so sure about how much precision will be lost as data is passed between the different routines, given that they can be plugged together arbitrarily, by artists. This is the essence of what engines are meant to do, not to provide a single shader or single feature, but a bunch of shaders that the content creators can piece together to achieve the effect they want.

I'm sure for all that I've written thus far, there is a spirit of the arguments for FP24 (or other hardcoded hardware limitations in general) being always something like "for all the shaders we can think of, this isn't a problem. If you think there's a problem, send us a shader and we'll show you how to work around our limitations with it.". The flaw in that logic is that it assumes isolated pieces of shader code matter, but what really matters is the set of all possible shaders an engine can generate. If you look at Max or Maya's lighting models and material systems, they're all along these lines, not a single shader with a few knobs you can twiddle with, but general frameworks for combining arbitrary other shader functionality.

Reverend · Jun 5, 2003

antlers4 said:
Is ATI going with FP24 more akin to 3dfx going with 16-bit only in the Voodoo 3, or 3dfx going 16-bit only in the Voodoo1?

I think it's more akin to going 16-bit only with the Voodoo1. Sure, FP32 is nice to have, but the applications that would demand it and the technology to support it at fast speeds aren't here yet (I've yet to be convinced that the NV35 can do FP32 shaders at adequate speeds).

Also, when it comes to color precision, there is diminishing returns. The visible difference between FP24 and FP32 would be much less than between FP16 and FP24.

That's the thing... the entire chicken-and-egg scenario. Sure, the apps that demand it aren't here yet but if we have a long timeframe where FP24 hardware is the majority of the video cards out there, it will be even longer before we see such apps than if FP32 hardware had debuted instead of FP24 in the first place. That's logical and makes business sense to developers who sells games.

Obviously, it all comes down to performance when you make a piece of hardware. But the point of my starting this thread really isn't about slower FP32 performance compared to FP24 -- it was simply about instances where I think FP24 has definite disadvantages compared to FP32 and I wanted others to confirm if my understanding and thinking about this is correct or not because I have never had much faith in myself when I see and know there are so many folks here more knowledgeable about coding and hardware than myself

Colourless · Jun 5, 2003

Even though I get a feeling this comment is going to bite me in the ass sometime in the future, Precision be damned! The biggest limitation that I'm facing with the R300 is purely instruction counts!

More instruction slots and more registers are needed right now, not really more precision. Of course, GFFX has all three, so Nvidia at least did something right with it.

Luminescent · Jun 5, 2003

It seems the R3xx's lattest incarnation (R350) supports an unlimited number of instructions (in the fragment shader) via f-buffer; altough I'm not sure if the functionality is currently exposed in drivers.

LeStoffer · Jun 5, 2003

Reverend said:
That's the thing... the entire chicken-and-egg scenario. Sure, the apps that demand it aren't here yet but if we have a long timeframe where FP24 hardware is the majority of the video cards out there, it will be even longer before we see such apps than if FP32 hardware had debuted instead of FP24 in the first place. That's logical and makes business sense to developers who sells games.

Yes, but I prefer to look at it from a much more practical view: Going from FP24 to FP32 for developers can't really take much of an effort when you look at how much they had to upgrade their skill to write shader in the first place (PS 1.1 - 1.4) and working with FP (PS 2.0) the second time around.

And then Colourless brings up the crucial point of what you want the IHV to include in their silicon budget (gotta love that word): Do you really want them to use up so much space for FP32 when we are still in the very start of cinematic rendering (like Colourless mention more instruction slots and registers)?

In other words: I sincerely doubt that the industry will stop in it's tracks if we don't see all IHV's doing FP32 before DX10.

Just for the record: I think ATI made the right decision with R300 for all us non-developers, while I can see why nVidia wanted the developers to have the opportunity to mess around with the future today.

I know this isn't the point you're making - I don't care about IEEE standards in my games

- but I just like to keep part of the discussion within the constraints of reality (the given silicon budget). IMHO.

Humus · Jun 5, 2003

Reverend said:
On a single machine, it's deterministic.
On all machines supporting DX9, no. NVIDIA's * function and ATI's * function are not the same function because NVIDIA's a*b and ATI's a*b differ.

And that's the way it should be. We have never had any more determinism and we should not enforce it because it's basically useless and a heck of a burden to put on the shoulder of IHVs and in the end on the customers.

Reverend said:
Whoa, so true. So, C compilers tend to have optimization options that you can turn on to let the compiler pretend that identities like (a+b)+c = a+(b+c) are true so it can rearrange your code to make it faster. Like most compilers' "assume no aliasing" optimization flag, this isn't strictly safe, but is usually good enough for most tasks. The difference here is that with C, the programmer can choose whether to do things precisely or quickly, whereas with DirectX9, the hardware has already decided for you.

In OGL2 there has been talks about providing ways to turn optimizations off, but I don't know the status of that though. That should satisfy everyone. For shaders optimisations should default to on.

Either way Reverend, you haven't explained why just 32 bits is significant. It's an arbitrary number just like every other. Assume ATI had provided fp32 already, this whole discussion would still apply, except all number += 8. The same argumentation could be made that "why don't we have fp40, there are applications that could use it".

Reverend · Jun 5, 2003

LeStoffer said:
Reverend said:

That's the thing... the entire chicken-and-egg scenario. Sure, the apps that demand it aren't here yet but if we have a long timeframe where FP24 hardware is the majority of the video cards out there, it will be even longer before we see such apps than if FP32 hardware had debuted instead of FP24 in the first place. That's logical and makes business sense to developers who sells games.

Click to expand...

Yes, but I prefer to look at it from a much more practical view: Going from FP24 to FP32 for developers can't really take much of an effort when you look at how much they had to upgrade their skill to write shader in the first place (PS 1.1 - 1.4) and working with FP (PS 2.0) the second time around.

There is no additional effort (FP24 -> FP32) if you know exactly what you aim for -- FP32 is available to me, I know what it offers and what its limitations are and I work on that from the very start... this isn't about "upgrading". All of my postings in this thread is based on using FP32 -- I can't do this (which is important to me, for what I have in mind, which as OpenGL guy pointed out in a hidden way, doesn't matter) with FP24. I don't know if what I want is important nor what a game developer may want to do, of course.

And then Colourless brings up the crucial point of what you want the IHV to include in their silicon budget (gotta love that word): Do you really want them to use up so much space for FP32 when we are still in the very start of cinematic rendering (like Colourless mention more instruction slots and registers)?

Do I want them to? Yes I do. But I don't have/need to consider competition and I don't work for a IHV

In other words: I sincerely doubt that the industry will stop in it's tracks if we don't see all IHV's doing FP32 before DX10.

This is rather silly -- of course the "industry" won't stop because of this.

Just for the record: I think ATI made the right decision with R300 for all us non-developers, while I can see why nVidia wanted the developers to have the opportunity to mess around with the future today.

Perhaps all that I have written is based on the fact that the R300 is a resounding success -- and usually when I see a resounding success, I start thinking "Why didn't they do this in the first place?" Kinda like asking for a mile when I am given an inch

Reverend · Jun 5, 2003

Humus said:
Either way Reverend, you haven't explained why just 32 bits is significant. It's an arbitrary number just like every other. Assume ATI had provided fp32 already, this whole discussion would still apply, except all number += 8. The same argumentation could be made that "why don't we have fp40, there are applications that could use it".

The entire point of starting this thread is based on DX9 and IEEE-32, both available standards. It's not about "XX bits" nor an additional 1 bit -- it's about the two standards I know of, which I offered as a the basis for this discussion. If I followed your way of thought, this thread wouldn't exist -- nothing is ever enough.

You appear to not know the basis of my wanting to start this discussion, which was very specific -- it's about FP24 and FP32, nothing more than that -- and you have digressed onto "But what is enough for you Rev?", which isn't what I want to talk about. I gave specific examples of why I want FP32, and not FP24. Not why I always want more. I have explained why FP32 is significant to me (to me, to me, TO ME ALONE!

) compared to the availability of FP24. I have not explained why FP32 is enough as a distinct floating point spec (32bit) because that would be pointless -- as I said, nothing is ever enough when you get more creative. I am simply working on FP32 and 32bits alone, compared to FP24 and 24-bits. Hope this is clear.

If you want to me to stick to talking about "what's available", I would have nothing to say and live with what's available because, well, that's all I can do, right?

Humus · Jun 5, 2003

Then what's this talk about reproducibility all about? If someone goes to fp40, then any reproducibility is once again kicked out of the window. Arguing for a particular precision is odd IMO, be it 32, 24 or anything else. It's more precision => better (assuming same performance).

Reverend · Jun 5, 2003

As is usually the case in any thread, things get sidetracked -- I didn't bring up reproducibility. Well, actually I did but I had to, in response to sireric's first post in this thread, hehe

I can tell you one thing though -- I already know why I want more than FP32... but that'll have to be in another thread. And another time where I'll be damned for wanting more than what is the "API" standard.

FP - DX9 vs IEEE-32

Nick

Simon F

Tea maker

Entropy

Humus

Crazy coder

darkblu

antlers

Deflection

Dio

Luminescent

Reverend

Reverend

Reverend

Colourless

Monochrome wench

Luminescent

LeStoffer

Humus

Crazy coder

Reverend

Reverend

Humus

Crazy coder

Reverend

Similar threads