FP16 and market support

Ailuros · Jan 12, 2004

Is that New Year in Turkey or New Year with a turkey?

I most certainly do not tend to hang out with my dinner

radar1200gs · Jan 31, 2004

http://www.beyond3d.com/forum/viewtopic.php?t=9982

In my opinion, Shader Model 3.0 is a huge step forward compared with Shader Model 2.0. Shader Model 3.0 adds dynamic branching in the pixel shader and while it's not required, I'm expecting the major IHVs to provide a complete orthogonal solution (FP16 texture filtering/blending) in their next HW iteration.

At this point (things may change), I'm expecting that Splinter Cell - X will only support SM 1.1 and SM 3.0 when it comes out.

-Very significant performance improvement because of dynamic branching in the PS unit.
-Orthogonal FP16 operations
-Market (can't discuss that yet)

SM 3.0 is going to be good enough for some time. There is only one big step left (before GPU start evolving just like CPUs --> performance only) that should allow classic global illumination algorithms to be efficient on GPUs. I doubt SM 4.0 will provide that.
_________________
Dany Lepage
Lead Programmer
Splinter Cell - X
UbiSoft Montreal

Given the insistence by some members of this forum that FP16 is not a valid part of the DX9 spec, I'd like to hear opinions on why, if this was the case, Microsoft would bother doing anything with them and why they didn't instead apply the changes to FP24 (since if you listen to the fanboys FP24 is the logical, obviously superior to anything else out there format)?

akira888 · Jan 31, 2004

Ailuros said:
Is that New Year in Turkey or New Year with a turkey?

I most certainly do not tend to hang out with my dinner

Heh. Good one.

andypski · Jan 31, 2004

radar1200gs said:
Given the insistence by some members of this forum that FP16 is not a valid part of the DX9 spec, I'd like to hear opinions on why, if this was the case, Microsoft would bother doing anything with them and why they didn't instead apply the changes to FP24 (since if you listen to the fanboys FP24 is the logical, obviously superior to anything else out there format)?

Because what is being talked about here is the external floating point formats, which have always been FP16 and FP32 in DX9 - nothing new in that. Also, people aren't arguing that FP16 isn't part of the spec (or at least they shouldn't be) - when the _pp hint is set the spec defines FP16 as the minimum precision. What it isn't legal for a driver to do is to use FP16 under any circumstances when the _pp hint isn't set, whether it believes it will affect the output noticably or not.

FP24 is the standard internal high-precision format of the shaders - in terms of external memory accesses it makes sense to restrict the widths of the data to powers of two, but internally you have much more freedom, and so you can be more flexible in tradeoffs between silicon cost and overall precision.

What Dany is talking about here is a move towards treating external floating point formats as first class citizens - ie. having all the same capabilities as current integer formats with respect to blending and filtering, whereas in current hardware floating point formats can typically only be point-sampled and not blended. Initially Dany states that he expects to see this for FP16 formats because it is cheaper to do the filtering and blending on FP16 than FP32.

DemoCoder · Jan 31, 2004

andypski said:
What it isn't legal for a driver to do is to use FP16 under any circumstances when the _pp hint isn't set, whether it believes it will affect the output noticably or not.

What about strength reductions or substitutions that won't effect the output at all? There are probably a few cases where the compiler can make conservative substitutions, especially on short shaders that do hardly anything, deal only with integer textures, and write to normal integer framebuffers.

FP24 is the standard internal high-precision format of the shaders - in terms of external memory accesses it makes sense to restrict the widths of the data to powers of two, but internally you have much more freedom, and so you can be more flexible in tradeoffs between silicon cost and overall precision.

Agreed, but it is nice when HW supports an extended FP32 precision if you are doing some scientific rendering or non-games stuff. I coded up some scientific algorithms on the GPU a few months ago as an experiment, and it kinda sucked that my accuracy turned out alot worse than a C program because my parameters were getting truncated.

BTW, ATI's support of MRT rocks, but adding NVidia's pack/unpack instructions would be even more awesome when combined with ATI MRT. Sometimes you want a pixel shader that can shove a whole bunch of variables in a FB and pull them in on the next pass. With MRT, you can write 16 different scalar values, or 4 vector, (more if you count oDepth) but with pack/unpack you can double or quadruple that amount if you want to store only some 16-bit or 8-bit values. This works really nice if you want to store lots of loop constants or temporaries.

Of course, it's a waste a silicon today, but hopefully in a few years, the GPUs will have enough silicon to space to be 100% all-the-way through compliant with the external formats. If internally FP is implemented completely different, but still equivalent precision, that's fine. What "sucks" (relative, since it's not really an issue for most games) is sending in a IEEE float, and having the HW trunc it with no options.

Hopefully, R500/NV50 or whatever will support FP32 if you "request" it.

andypski · Jan 31, 2004

DemoCoder said:
What about strength reductions or substitutions that won't effect the output at all? There are probably a few cases where the compiler can make conservative substitutions, especially on short shaders that do hardly anything, deal only with integer textures, and write to normal integer framebuffers.

Whether that is acceptable or not is really up to Microsoft to define as the owners of the API. Without guidance on this the only thing you can say is that _pp can be run at partial precision, and not _pp must be run at at least FP24. Perhaps Microsoft would be inclined to allow such subtitutions - but I expect that situations where this is permitted would need to be very clearly defined - clearly any lower-precision substitution that could in any way affect the output value is obviously invalid.

Personally I would say that it is not within the purview of the driver to attempt such optimisations. Since the _pp hint is provided it is up to the author to decide to make use of it or not if they want to take advantage of any possible performance gains. The compiler or driver should not be making such substitutions - it is akin to having a compiler making lower precision substitutions despite the fact that you turned optimisations off.

Perhaps there should be some global hinting mechanism to allow Microsoft's HLSL compiler to make such optimisations - I would be more comfortable with this than placing the requirements for legality of substitutions on the drivers of IHVs who always have a vested interest in making their hardware look faster whether by means of legal substitutions or not. There's just too much temptation to play a bit fast and loose with the rules.

Agreed, but it is nice when HW supports an extended FP32 precision if you are doing some scientific rendering or non-games stuff. I coded up some scientific algorithms on the GPU a few months ago as an experiment, and it kinda sucked that my accuracy turned out alot worse than a C program because my parameters were getting truncated.

Yes - naturally we will see support for higher precision formats coming along as VPUs become used more frequently in applications outside of entertainment, and also just as a natural consequence of advances in technology.

DemoCoder · Jan 31, 2004

I agree. I brought up the issue of needing a "hinting" mechanism for the driver a while ago, due to the fact that the drivers now contain optimizers, and sometimes you need to switch optimizations off, especially if the optimizer is doing something bad on some pathological case.

After all, I can pass lots of command line arguments to GCC's optimizer and many other C compilers or virtual machines, so why not? Inlining heuristics, when to choose branch vs predicate, etc will all become very important in PS3.0+.

KimB · Jan 31, 2004

andypski said:
What Dany is talking about here is a move towards treating external floating point formats as first class citizens - ie. having all the same capabilities as current integer formats with respect to blending and filtering, whereas in current hardware floating point formats can typically only be point-sampled and not blended. Initially Dany states that he expects to see this for FP16 formats because it is cheaper to do the filtering and blending on FP16 than FP32.

Yes, I'm really hoping that we get full FP16 framebuffer (and texture) support.

But I do have to say that there'd be no reason to do the same for FP32. Remember that the operations that we're wanting to be added to FP16 framebuffers and textures are operations that assume color data. Color data won't need to be stored at greater than FP16 precision unless you're going to do lots of passes ( >16 or so), and the longer and more general shaders that are becoming available should avoid any issues with that sort of thing (so, of course, it would be desirable to have high-speed, high-precision internal FP support moving into the future).

KimB · Jan 31, 2004

DemoCoder said:
I agree. I brought up the issue of needing a "hinting" mechanism for the driver a while ago, due to the fact that the drivers now contain optimizers, and sometimes you need to switch optimizations off, especially if the optimizer is doing something bad on some pathological case.

Yeah, it would be really nice to be able to just tell the compiler, "Just use whatever precision you think won't affect the output," and not worry about it. Then just switch some optimizations off if it starts to look ugly.

An even better solution might be compiler options that have different settings for how conservative the optimizations must be, not just an on/off switch (of course, you'd need well-defined behavior for each setting for the settings to be useful in an environment like we're talking about, which may be difficult to accomplish).

RussSchultz · Jan 31, 2004

Chalnoth said:
Yeah, it would be really nice to be able to just tell the compiler, "Just use whatever precision you think won't affect the output," and not worry about it. Then just switch some optimizations off if it starts to look ugly.

Heh. I've used a few compilers like that. Except you have to switch off some optimizations at random times when the compiler doesn't like whatever construct you've come up with.

From an engineering/programming standpoint, that sucks. It should work as designed, per spec, all the time, not some indeterminate output.

KimB · Feb 1, 2004

RussSchultz said:
Heh. I've used a few compilers like that. Except you have to switch off some optimizations at random times when the compiler doesn't like whatever construct you've come up with.

From an engineering/programming standpoint, that sucks. It should work as designed, per spec, all the time, not some indeterminate output.

It's not like it's something you'd be forced to use. Remember that this would be a substitute for paying close attention to which precisions are needed where.

Simon F · Feb 2, 2004

RussSchultz said:
Chalnoth said:

Yeah, it would be really nice to be able to just tell the compiler, "Just use whatever precision you think won't affect the output," and not worry about it. Then just switch some optimizations off if it starts to look ugly.

Click to expand...

Heh. I've used a few compilers like that. Except you have to switch off some optimizations at random times when the compiler doesn't like whatever construct you've come up with.

From an engineering/programming standpoint, that sucks. It should work as designed, per spec, all the time, not some indeterminate output.

Agreed. It can be just as hairy when a particular CPU architecture (actually, let's cut straight to the chase - it's the x86) decides it's going to use higher precision just because you've given the C-compiler more opportunity to optimise the code.

I've mentioned this before but I've had no end of problems when I optimised/rewrote some floating-point intensive code used in the texture compressor. When you are trying to determine if something converges and sometimes it's computed at 32-bit and then at other times 80-bit, you get no end of problems. A complete nightmare.

KimB · Feb 2, 2004

Simon F said:
Agreed. It can be just as hairy when a particular CPU architecture (actually, let's cut straight to the chase - it's the x86) decides it's going to use higher precision just because you've given the C-compiler more opportunity to optimise the code.

I've mentioned this before but I've had no end of problems when I optimised/rewrote some floating-point intensive code used in the texture compressor. When you are trying to determine if something converges and sometimes it's computed at 32-bit and then at other times 80-bit, you get no end of problems. A complete nightmare.

The x86 will always use 80-bit FP calculations unless you're using some of the SIMD instructions. But that shouldn't be a problem unless you're using the equality operator, and not using the equality operator on floats was one of the first things I learned in programming classes. It's just not something you do.

Simon F · Feb 2, 2004

Chalnoth said:
The x86 will always use 80-bit FP calculations unless you're using some of the SIMD instructions. But that shouldn't be a problem unless you're using the equality operator, and not using the equality operator on floats was one of the first things I learned in programming classes. It's just not something you do.

Chalnoth, I suggest you try writing an SVD routine and see what happens

Apart from ==0.0 (for special cases), I'm not using equality tests. The problem occurs because, "randomly" (i.e. at the discretion of the compiler), sometimes a variable remains on the FPU stack (and thus at 80-bit precision) and other times it gets saved out to cache/memory and hence is converted to 32-bit. The calculations will thus change dramatically depending on the luck of the draw with the compiler's variable allocation. It gets even more frustrating because the tendency to use the FPU stack increases with optimised builds and thus a bug in the code will vanish when you try to debug it.

On a sensible CPU, these problems simply do not occur.

Hyp-X · Feb 2, 2004

Chalnoth said:
The x86 will always use 80-bit FP calculations unless you're using some of the SIMD instructions.

Wrong.
You can set the calculating precision of the FPU in the control word to 32, 64 or 80 bits.
For example D3D sets 32 bit (for the entire program!!!).
The flags are set to 64 bit by default in MSVC.

DeanoC · Feb 2, 2004

Hyp-X said:
Chalnoth said:

The x86 will always use 80-bit FP calculations unless you're using some of the SIMD instructions.

Click to expand...

Wrong.
You can set the calculating precision of the FPU in the control word to 32, 64 or 80 bits.
For example D3D sets 32 bit (for the entire program!!!).
The flags are set to 64 bit by default in MSVC.

And wrong again.
As SimonF says the precision is always 80 bit for 'most' operations, the control word precision is used for divides mainly. Long (64 or 80 bit) divides are expensive whereas long multiples aren't (on Intel x86). So you can tell the processor to stop division calculations at the approiate stage (i.e. 23rd bit for floats) but it doesn't change other operations.

This means that depending on when the compiler decides to flush something from floating point register to memory is when the truncation occurs to the length specified.

Code:

For all FPU operations on x86 these 2 code sequences will produce different results in Reg0

Program 1
Reg2 = Reg0 + Reg1
Store Reg2 to memory
Read memory to Reg2
Reg0 = Reg2 + Reg1

Program 2
Reg2 = Reg0 + Reg1
Reg0 = Reg2 + Reg1

Most compiler have a mode which will ensure that a float really is a float by transferin and reading back at every operation all floats/doubles etc.

x86 can be a real pain for numerically sensitive operations. As its largely impossible to determine to what precision its actually operating at (because it will change based on instruction order, phase of moon, etc)

DemoCoder · Feb 2, 2004

Well, even on sensible CPU's, you can get different results depending on the optimizer due to code motion and reordering, which can change between recompiles of your program. ANSI C/C++ "banned" associativity optimizations (e.g. (a+b)+c cannot be evaluated as a+(b+c) ) , but there are still compilers that offer this, other languages don't have similar bans, and even in the ANSI C case, there are still some pathological optimizer issues. (compiler is still allowed to compute A, B, and C in any order and cache or move the results)

Java added the "strictfp" keyword to the language to resolve this issue. In "stricftp", all operations are done according to the IEEE standard, no under or over precision is allowed, and evaluation order has to be maintained. Otherwise, upcasting to 80-bit is allowed as well as some "harmless" reordering optimizations.

sonix666 · Feb 2, 2004

DemoCoder said:
Well, even on sensible CPU's, you can get different results depending on the optimizer due to code motion and reordering, which can change between recompiles of your program. ANSI C/C++ "banned" associativity optimizations (e.g. (a+b)+c cannot be evaluated as a+(b+c) ) , but there are still compilers that offer this, other languages don't have similar bans, and even in the ANSI C case, there are still some pathological optimizer issues. (compiler is still allowed to compute A, B, and C in any order and cache or move the results)

But still, most compilers enable 'aggressive' floating point optimizations by default. The simple rule is to include error margins for any comparison when using floating point. Someone suggested that they'd better remove the equality and inequality operators from the language, to prevent 'stupid' programmers to fuck up, because they have no clue that floating points don't have unlimited precision.

Simon F · Feb 3, 2004

DeanoC said:
Most compiler have a mode which will ensure that a float really is a float by transferin and reading back at every operation all floats/doubles etc.

Just for the record for those interested, that is an option I wish to avoid as gprof tells me that ~60% of the run time is in this function so it would rather defeat my efforts to optimise that piece of code

DemoCoder said:
Well, even on sensible CPU's, you can get different results depending on the optimizer due to code motion and reordering, which can change between recompiles of your program. ANSI C/C++ "banned" associativity optimizations (e.g. (a+b)+c cannot be evaluated as a+(b+c) ) , but there are still compilers that offer this, other languages don't have similar bans, and even in the ANSI C case, there are still some pathological optimizer issues. (compiler is still allowed to compute A, B, and C in any order and cache or move the results)

Well I'm using GCC (on at least 2~3 different platforms) which is "quite ANSI" so incorrect associativity kludges are unlikely.

darkblu · Feb 3, 2004

DeanoC said:
And wrong again.
As SimonF says the precision is always 80 bit for 'most' operations, the control word precision is used for divides mainly. Long (64 or 80 bit) divides are expensive whereas long multiples aren't (on Intel x86). So you can tell the processor to stop division calculations at the approiate stage (i.e. 23rd bit for floats) but it doesn't change other operations.

This means that depending on when the compiler decides to flush something from floating point register to memory is when the truncation occurs to the length specified.

are you absolutely sure about the above?

IA software developer's manual said:
The double precision and single precision settings, reduce the size of the significand to 53 bits and 24 bits, respectively. These settings are provided to support the IEEE standard and to allow exact replication of calculations which were done using the lower precision data types. Using these settings nullifies the advantages of the extended-real format?s 64-bit significand length. When reduced precision is specified, the rounding of the significand value clears the unused bits on the right to zeros.

The precision-control bits only affect the results of the following floating-point instructions:
FADD, FADDP, FSUB, FSUBP, FSUBR, FSUBRP, FMUL, FMULP, FDIV, FDIVP, FDIVR, FDIVRP, and FSQRT.

FP16 and market support

Ailuros

Epsilon plus three

radar1200gs

akira888

andypski

DemoCoder

andypski

DemoCoder

KimB

KimB

RussSchultz

Professional Malcontent

KimB

Simon F

Tea maker

KimB

Simon F

Tea maker

Hyp-X

Irregular

DeanoC

Trust me, I'm a renderer person!

DemoCoder

sonix666

Simon F

Tea maker

darkblu

Similar threads