FP - DX9 vs IEEE-32

Reverend

Banned
I had been playing around much with a R300 and a NV34 and something that has continuously stayed at the back of my mind was the apparent murky differences between DX9's FP "specs" and that of IEEE-32.

The thing that primarily stood out to me was the R300's 24-bit FP internal pixel pipeline vs IE3-32. The more I think about it, the more I favour the thought that the R300's designers wanted a DX9-compliant feature yet they definitely do know what IE3 is. Maybe it's a combination of transistors, a trusted process technology they feel they must go with and, well, performance. The R300 should've been IE3-32, not DX9-minimum-spec-compliancy IMHO.

Why?

IEEE defines all basic arithmetic operations (add, subtrace, multiply, divide, square root) as producing the properly rounded 32-bit version of the (theoretical) exact result. No problems here with DX9.

But when you get into compound operations like mulad, dp4, etc., the DX9 spec doesn't specify an "order of operations", so the result may differ depending on (for example) whether mulad is round(a+round(b*c)) or round(a+b*c), etc. Further, the trig operations aren't well-defined at all, so implementations may differ. In this day and age, when much of the functionality you'd expect (such as blending of floating-point textures) remains unimplemented, this is not at all a limiting factor. But in a few years, when the more dire problems have been solved, this one will be start poking up as a problem. Eventually, people will expect basic hardware operations to produce precisely-defined results, regardless of hardware manufacturer, control panel settings, time of day, etc.

Obviously, the R300 is a good part and ATi probably felt that their decision to go with FP24 internally is a good choice and I wouldn't really disagree with that because it is the first available "DX9 part". Hey, we gotta start somewhere! However, one has to wonder why they didn't have full FP32 internally and via drivers do some possible, er, optimizations depending on the scenarios, but leave the options open.

Am I asking for too much from the R300 given the current timeframe? Am I being a bit critical of DX9 in this aspect?
 
I remember that ATI actually did some similar funk with the Rage. They seemed to take less of a performance hit going from 16bit to 32bit (framebuffer), but it turned out that they were cutting corners on depth precision. It's not really a _big_ issue, but it just highlights they they have done similar things in the past.

FP24 is a good middle ground (and other parts of the architecture are good performance wise), but you're right about the IEEE thing.

Edit: 16->32bit compared to nvidia's chips
 
I'd just like to take this time to gripe on a major problem of DX9:

The documentation is crap! Ambiguities like you just spelled out should not exist. It would be just so much easier for everybody involved not to have to rely upon the reference rasterizer to check for adherence to the spec.
 
I hate me too posts, but I have to say that I agree.

There are a lot of gotchas in floating point programming, and IEEE at least hides some of them. It isn't needed for standard scanline graphics really, but it enables the gfx card to do other fun algorithms like matrix solvers and ray tracing. It turns it into an old school math co-processor.I wish that ATI could have crammed 32bit into the R300, but I would rather have R300 out when it was rather than wait until they could fit 32bit in as well.

I must say that even R300 was more than I was expecting. I expected 16bit/channel (64bit framebuffer) to be an intermediate step before "real" floating point came along.

If you think about it, there wern't too many artifacts at 8bpp, did we really need 16.7M times more accuracy? or even 65K times?

Edit: Oops. I meant 8bpc, not 8bpp. :oops:
 
Me said:
If you think about it, there wern't too many artifacts at 8bpp, did we really need 16.7M times more accuracy? or even 65K times?

We do, but some people are getting their knickers in a knot over precision and what it means for Image Quality in terms of the games they're playing (or the benchmarks they're playing *snicker* :LOL: .

We dont have floating point framebuffers yet, but render-to-texture stuff is just as useful..
 
umm what happens when you got a fixed platte on 8bpp images go down hill pretty quickly when you can define the index on 8bpp they look pretty good. Still with 24bpp framebuffer(z/stenicil don't count to image/alpha) on games I still see the banding at times which annoys me the biggest place I see it is with fog ( eww its horrible) so I'm still wainting for those 12-16 bit DACS and 12+ bit framebuffers.
 
I think full IEEE-754 compliancy is a bit too much for pixel shaders. The traps, for example, are not too useful for most pixel shaders. However, I think it's important to have a good precision (24 bits is good, but 32 bits is better, of course).

16 bits FP is enough for computing colors. However, you need to store something other than colors in pixel shaders. The most important example is perhaps position. 16 bits FP is also not enough for texture address.

To ensure good behaviour, I think it's important to have a guideline (or specification) of every instructions in shaders. IEEE 754 has only four basic operations (add, sub, mul, and div). 754R added fma, but shaders have more (such as dp3/dp4). The shaders specification should specify how they round, their preicisions, what to do when overflow or underflow, etc. These are specified in 754 but we don't need them to be that complete. However, we also have to extend 754 for other "basic" operations such as dp3.

Currently it seems not that important to specify shader operations precisely. However, if you don't want to limit shaders for the most simple works, you'll need to do so and force all IHVs to obey these guidelines.
 
Reverend said:
whether mulad is round(a+round(b*c)) or round(a+b*c), etc.

Why would there be rounding at all? If a, b, and c, along with all the intermediate results are 24-bit then a+b*c = a+(b*c). But if you're saying that the r300 does the intermediate stuff at 32-bit and converts the results to 24-bit then yes, all the math would stop making numerical sense (granted probably never enough to notice).

I guess what I'm asking is, care to explain the quoted line further? :)
 
Me said:
If you think about it, there wern't too many artifacts at 8bpp, did we really need 16.7M times more accuracy? or even 65K times?
For two things:

1. Non-color data
2. Very long shaders that don't hide errors well (Mandelbrot sets, for example).
 
Ilfirin said:
Reverend said:
whether mulad is round(a+round(b*c)) or round(a+b*c), etc.

Why would there be rounding at all? If a, b, and c, along with all the intermediate results are 24-bit then a+b*c = a+(b*c). But if you're saying that the r300 does the intermediate stuff at 32-bit and converts the results to 24-bit then yes, all the math would stop making numerical sense (granted probably never enough to notice).

I guess what I'm asking is, care to explain the quoted line further? :)
If a and b are 24-bits, then a+b can require more than 24 bits of precision. Here's a real (dumb) example. a and b are both 1 bit. If a = b= 1. Then a + b = 10, which cannot be represented in a single bit. Here's another example. a and b are 2 bits, a = b = 11. Then a * b = 1001.

This is why most (many/all?) floating point units usually carry a little extra precision for intermediate results. However, when the data is written to memory, that extra precision is lost.
 
OpenGL guy said:
If a and b are 24-bits, then a+b can require more than 24 bits of precision. Here's a real (dumb) example. a and b are both 1 bit. If a = b= 1. Then a + b = 10, which cannot be represented in a single bit. Here's another example. a and b are 2 bits, a = b = 11. Then a * b = 1001.

This is why most (many/all?) floating point units usually carry a little extra precision for intermediate results. However, when the data is written to memory, that extra precision is lost.

Oh right, I knew there was something, I just couldn't think of it (was thinking in terms of fractions, not overflows) while reading (hey, it's 3:45AM here).

Well, with that aside, shouldn't a MAD, or any other instruction (macros would of course be different) be done totally at full precision, with only the final answer rounded down to 24-bit?
 
And to answer the last part:

DX9 fp32 is stored in IEEE 754 format in framebuffers/textures. (Fp24 is converted to fp32 before it's stored.)
The calculations are not IEEE 754 compliant.
 
Reverend said:
I had been playing around much with a R300 and a NV34 and something that has continuously stayed at the back of my mind was the apparent murky differences between DX9's FP "specs" and that of IEEE-32.
Are you discussing the VS, PS, or both? The PS would need to be a lot faster (i.e. more highly parallel) than the VS and so is more likely to use lower precision (for silicon reasons). Besides, the accuracy of the VS is probably more critical anyway.

The more I think about it, the more I favour the thought that the R300's designers wanted a DX9-compliant feature yet they definitely do know what IE3 is.
Rev, I think (but don't quote me!) that today's silicon compilers have all the built-in IEEE functionality mul/add etc functionality if you want to use them so it's probably just an issue of how much silicon area you want to use.

IEEE defines all basic arithmetic operations (add, subtrace, multiply, divide, square root) as producing the properly rounded 32-bit version of the (theoretical) exact result. No problems here with DX9.
I think DX9 specifies a lower minimum accuracy so that the chips don't get ridiculously large.
But when you get into compound operations like mulad, dp4, etc., the DX9 spec doesn't specify an "order of operations", so the result may differ depending on (for example) whether mulad is round(a+round(b*c)) or round(a+b*c), etc.
This is certainly one area where you don't want to follow the ref-rast. Its DP4 order of operations is exactly the one you wouldn't want to use in a hardware implementation!

Anyway, it's not a problem to have HW 'A' and HW 'B' use slightly different ordering provided each always uses a consistent ordering. For example, a little while ago, a leading manufacturer's drivers used different transform operations depending on whether lighting was turned off or on, which led to Z sorting issues.

Further, the trig operations aren't well-defined at all, so implementations may differ. In this day and age, when much of the functionality you'd expect (such as blending of floating-point textures) remains unimplemented, this is not at all a limiting factor. But in a few years, when the more dire problems have been solved, this one will be start poking up as a problem. Eventually, people will expect basic hardware operations to produce precisely-defined results,
I'm sure in a few years the next generation of specifications will probably tighten up the specification. I don't think it's necessary now.


Obviously, the R300 is a good part and ATi probably felt that their decision to go with FP24 internally is a good choice and I wouldn't really disagree with that because it is the first available "DX9 part". Hey, we gotta start somewhere! However, one has to wonder why they didn't have full FP32 internally and via drivers do some possible, er, optimizations depending on the scenarios, but leave the options open.
IANAHE, but I suspect that because multiplication is possibly an O(n^2) operation, a change from, say, a 16bit to 24 bit mantissa probably represents a 2.25x increase in area. That's a big jump.
 
The shader spec should IMO be left as flexible as possible. It shouldn't matter what happends on the least significant bits. A shader that relies on exact operation is faulty. Everyone knows that floating point has some twitchiness, even on CPUs with all the IEEE standards crapping down on it. You typically don't do exact compares, such as "if (a == b)" statements.

I want the driver and hardware to have as much flexibility as possible. If I write
float a = b / 3;
then I'm perfectly fine with it composing that into a
MUL a, b, 0.3333333333
I would be dissappointed if it inserted RCP instructions and slowed down my performance.
 
There's also all the nan (not a number) stuff in IEEE that is completely useless in a shader since you don't want pixels to be void, but rather just saturate values to max, min or round to zero. This is true for both vertex and pixel shaders.

Cheers
Gubbi
 
Yes, even though the FX uses the IEEE float format for FP32, I'd be really surprised if it did all its calculations to the IEEE standards.
 
About some misconceptions, and some comments:

1) IE^3 standards do not specify what should be returned for transcendental functions (sqrt, sin, etc...). They specify the format of data (including nans, infinites, denorms) and the internal roundings for results -- This rounding is not the f2i conversion, but how to compute the lsbs of the results. Different HW can return different results. People have learned to live with this. If you need 24b of mantissa precision, FP32 is not enough for you anyway.

2) IE^3 does not guarantee, in any way, that operations are order independant. For example, consider:
result = a + b + c
The above, generally, needs to be broken down into 2 operations. If we select a = 1.0, b = -1.0 and c = 2^(-30), then the result can be 0 or c. Depends on the implementation. IE^3 does not "specify" anything. The programmer needs to specify what he wants (i.e. (a+b)+c).

2.5) IE^3 support for nans, inf and denorms is just not needed in PS. In the final conversion to [0,1.0] range, inf and 2.0 would give the same output. For that matter, it's not needed in VS either.

3) FP24 has less precision than FP32, but has no worst other characteristics (well, range is reduce by 2^63 as well). Order of operations would be no "worst" than fp32, beyond the precision limits.

4) What are the outputs of the shader? There are two: The 10b color or 11b texture address (actually, with subtexel precision for filters, you could expect the texture address to be up 15b). With FP24, you get 17b of denormalized precision, which would allow you to have up to 8 (assuming 1/2 lsb of error per operation) ALU instructions at maximum error rate before even noticing any change in the texture address or texture filtering. Until texture sizes increase significantly (2kx2k right now, will give you a 16MB texture -- I don't know of many apps that even use that now), or that texture filtering on FP texture exists, there really is no need for added precision. On the other hand, it's obvious that FP16 is not good enough, unless you have smallish textures (i.e. < 512x512).

5) The only exception to 4 above is for long procedural texture generation. In those cases you could expect that some fp24 limits would come into play. There's no real use of that out there right now, and we are probably years from seeing it in mainstream. Nevertheless, our Ashli program can take procedural texture generation code from Renderman, and generate long pixel shaders. We've generated shaders that are thousands of instructions long. What we found is that it looks perfect, compared to the original image. No one would ever complain about that quality -- It's actually quite amazing. So, empirically, we found that even in these cases, FP24 is a nearly perfect solution.

6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well. That added area would of increased cost and given no quality benefits. We could of implemented a multi-pass approach to give higher precision, but we felt that simplicity was a much higher benefit. Given the NV3x PP fiasco, we even more strongly believe now that a single precision format is much better from a programming standpoint (can anyone get better than FP16 on NV3x?).

At the end of it all, we believe we did the right engineering decision. We weighed all the factors and picked the best solution. DX9 did not come before the R300 and specify FP24. We felt that FP24 was the "right" thing, and DX9 agreed.
 
sireric said:
6) FP32 vs. FP24 would not only be 30% larger from a storage standpoint, the multipliers and adders would be nearly twice as big as well.

Interesting info. The FP24 choice obviously had a lot to do with your silicon budget, but I didn't realize that FP32 is so demanding over FP24.
 
Back
Top