FP16 and market support

Rugor · Dec 31, 2003

In that case when given a _PP hint and the only options are FP32 or FX16 it will default to FP32. Remember the datatype would override the precision hint. The hint indicates that it is acceptable to use a lesser degree of Floating Point precision on this occasion if available. Since the lowest degree of FP precision that hypothetical card supports is FP32 that's what it will get.

Hellbinder · Dec 31, 2003

Ostsol said:
Hellbinder said:

The problem is that is not taking into account the Negative aspects of FP16 that will become more and more evident as shader use and the complexities thereof become more common.

After a while even FP24 will begin to show some signs of weakness. FP16 however will start splitting at the seams even as early as late next year. What if you want to sample a texture that is 2048x2048? or other cases outside the limitations of FP16.

It Makes no sense whatsoever to support FP16 currently and is flat out Insanity to push it with a new product even comming out in Q1 of next year. There are several reasons for this and they are self evident. IMO it borders on dishonesty for some of the people in this thread to be staunch supporters of FP16 when they know damn well it not a wise course of action.

If even FP24 is going to need to go the way of the Dodo,, Why in the hell are people arguing for FP16 just because one IHV made such poor decisions over the last year.

The Future, even the near future is Pure FP32 and thats all there is to it.

Click to expand...

Indeed if FP16 is inadequate in a situation, that's where FP32 is used. My only issue with FP16 is when potential FP32 performance is excessivly compromised because of FP16. Not all shaders are really long and complex though. Not all shaders will see a build-up of inaccuracies as a result of low precision. Simple, flat texturing operations, for example, will never need floating point precision. Of course, one could always just use the fixed function pipeline in some instances, but what if the fixed function pipeline is emulated using shaders? In that case, one will want the fastest emulation possible. If it can be gained using lower precision that results in no quality penalty, then there's no problem with that.

Now I'm not turning around and suddenly supporting FP16 entirely. As I said: I support it so long as its existence does not excessively compromise potential FP32 performance. That is because by itself it is does not add anything at all to graphics programming. Performance is the only thing it can possibly add. It does not allow for special functionality within shader programming, nor does it provide the possibility of certain effects -- except for providing additional performance, but that's only good when the result won't compromise quality. As such, FP16 becomes a bonus -- a potential way to get performance when it is needed.

Here is the problem with that and it has already been covered.

FP16 (or any other lower percesion) is Only ever faster in a specific case where the Hardware in Question was specifically made to be so over a supported but less robust implimintation of a higher percision.

Meaning that if those same resources were used to make the Higher Percision as Robust as possible there is simply no need whatsoever for the lower percision.

There is no need whatsoever to ever process things at a lower percision *if* the Hardware is correctly designed to fully support the FULL use of the maximum Persision supported. There is simply no inherit speed increase by using a lower persision. On the Contrary hardware has to be ADDED in order to see an increase.

Which gets us back to the whole point of Proper Support for One Full Percision is the best answer and the best use of Transistor Budget, with the best possible results.

Vince · Dec 31, 2003

Hellbinder said:
If even FP24 is going to need to go the way of the Dodo,, Why in the hell are people arguing for FP16 just because one IHV made such poor decisions over the last year.

The Future, even the near future is Pure FP32 and thats all there is to it.

I known you're purposely polarizing this discussion on precisions as a proxy in the larger IHV 'contest' - but in this case I think it's better to look at it singularly.

There seems to be this reductionist trend in recent years in macroscopic architectures to forgo complex constructs (which are often inefficient) and instead turn to simplistic ones which are run at high efficiency and just run iteratively to complete the same task which would otherwise eat die area. The trend away from mass-set-piece multitexturing in favor of multipass/loopback is one such example.

I don't see why the same shouldn't apply here. There will be times when IEEE-FP32 (which would appear to be the industry status quo) is overkill and since it comes at a heavy fixed-cost as measured in logic, to waste it would be highly inefficient from an absolute PoV - this is a statement that's beyond even debate from your position. I would then continue once you accept IEEE-751 as a standard, the rest of this argument is sheer logic - FP16 seeming to be a good lowest-common denominator. A hypothetical architecture capable of speedy FP32/FP16 being a good theoretical concept.

Am I saying FP16 is the end-all-be-all? Hell no. Am I saying it's better than FP24 when they're compared? Of course not. Have I even mentioned nVidia or ATI? No, and take note.

Personally, I don't see why you're so against this concept atleast theoretically.

Ostsol · Dec 31, 2003

That brings us back to Chalnoth's solution where a FP32 unit could be designed such that it could double as a twin-FP16 unit. If that could be done without any significant compromise to potential FP32 performance (that is, without so much more transistors spent that could be used on more plain FP32 units), then there's no problem. As such, the value of FP16 is directly related to how it can be implemented. I agree that having specialized FP16 units is a waste and the peformance boost provided by FP16 in the NV3x is only due to a seemingly strange design decision, but if there is a way to get around all of that, then the only issue that remains is -- as I have explained -- that only performance is gained, with nothing else added to graphics programming. It is simply a bonus with a precision compromise.

Vince · Dec 31, 2003

Ostsol said:
....

*thumbs up* Exactly!

arjan de lumens · Dec 31, 2003

Razor04 said:
Yea...but what would happen if there are only two precisions implemented...FX16 and FP32. If the _pp hint defaults to the lower of the two precisions like it is supposed to (at least as I understand it) then it would be using FX16 whether the developer intended FX16 or FP16.

If you request a floating-point type, then a floating-point type is what you will get - the _pp modifier won't convert from floating-point to fixed-point unless MS changes its meaning in PS/VS 4 and higher. In the absence of FP16, the _pp modifier will do nothing, just as on R3xx architectures now.

arjan de lumens · Dec 31, 2003

Ostsol said:
That brings us back to Chalnoth's solution where a FP32 unit could be designed such that it could double as a twin-FP16 unit. If that could be done without any significant compromise to potential FP32 performance (that is, without so much more transistors spent that could be used on more plain FP32 units), then there's no problem. As such, the value of FP16 is directly related to how it can be implemented. I agree that having specialized FP16 units is a waste and the peformance boost provided by FP16 in the NV3x is only due to a seemingly strange design decision, but if there is a way to get around all of that, then the only issue that remains is -- as I have explained -- that only performance is gained, with nothing else added to graphics programming. It is simply a bonus with a precision compromise.

An FP32 unit that doubles as a twin FP16 unit is a bit hard to design: splitting FP multipliers is easy and adds little extra gates and gate delays (~2 MUXes); splitting FP adders is very hard (AFAIK it's easier and cheaper to make an FP32 unit that can do single FP16 operations, then add a second FP16-only adder beside it); splitting RCP/RSQ/other operations can benefit from multiplier splitting, but the LUTs and adders also present in RCP/RSQ circuits are too expensive to share to be worthwhile.

Razor04 · Dec 31, 2003

Rugor said:
In that case when given a _PP hint and the only options are FP32 or FX16 it will default to FP32. Remember the datatype would override the precision hint. The hint indicates that it is acceptable to use a lesser degree of Floating Point precision on this occasion if available. Since the lowest degree of FP precision that hypothetical card supports is FP32 that's what it will get.

Thanks! As I mentioned I am not completely familiar with how it works so thank you for clarifying.

Genghis Presley · Dec 31, 2003

Dio said:
Hellbinder said:

ATi dudes Jump into their SUV's

Click to expand...

SUV's? There tends to be more interest in sports cars.

Except for hiring them to go up to Tahoe.

Are you aiming this comment at anyone specific Dio ?

By the way, if you need to get hold of me before next Wednesday call me on my cell - I'll be in Tahoe.

GP.

KimB · Dec 31, 2003

Hellbinder said:
After a while even FP24 will begin to show some signs of weakness. FP16 however will start splitting at the seams even as early as late next year. What if you want to sample a texture that is 2048x2048? or other cases outside the limitations of FP16.

In scenarios where FP16 isn't enough for a single instruction, you would simply use FP32. You wouldn't want to use FP16 for any texture addressing. Even the NV2x uses FP32 for texture addressing, and so does the R3xx (provided it's not a dependent texture read...).

The fact remains that there are many operations where FP16 is more than enough. Essentially any calculations done on color data, including high dynamic range lighting, could be done just fine with FP16. If you're doing graphics processing, there's going to be a lot of calculations on color data...

It Makes no sense whatsoever to support FP16 currently

It makes a lot of sense, provided performance can be gained from its use. Remember that any architecture that is designed to use FP16 will have a higher-precision format available when it's needed.

If even FP24 is going to need to go the way of the Dodo,, Why in the hell are people arguing for FP16 just because one IHV made such poor decisions over the last year.

FP24's going away for two main reasons:
1. Unification of vertex and pixel shader pipelines is coming.
2. It is possible, with FP32, to support FP16 at even higher speeds.

Ostsol · Dec 31, 2003

arjan de lumens said:
Ostsol said:

That brings us back to Chalnoth's solution where a FP32 unit could be designed such that it could double as a twin-FP16 unit. If that could be done without any significant compromise to potential FP32 performance (that is, without so much more transistors spent that could be used on more plain FP32 units), then there's no problem. As such, the value of FP16 is directly related to how it can be implemented. I agree that having specialized FP16 units is a waste and the peformance boost provided by FP16 in the NV3x is only due to a seemingly strange design decision, but if there is a way to get around all of that, then the only issue that remains is -- as I have explained -- that only performance is gained, with nothing else added to graphics programming. It is simply a bonus with a precision compromise.

Click to expand...

An FP32 unit that doubles as a twin FP16 unit is a bit hard to design: splitting FP multipliers is easy and adds little extra gates and gate delays (~2 MUXes); splitting FP adders is very hard (AFAIK it's easier and cheaper to make an FP32 unit that can do single FP16 operations, then add a second FP16-only adder beside it); splitting RCP/RSQ/other operations can benefit from multiplier splitting, but the LUTs and adders also present in RCP/RSQ circuits are too expensive to share to be worthwhile.

Heheh. . . I guess it was a pretty big "if". . . *shrugs* If it's not cost effective, then single precision certainly is the way to go.

arjan de lumens · Dec 31, 2003

Operations such as RCP/RSQ are probably done so infrequently that it doesn't make sense to split them for FP32/twin-FP16 use; for other units, I would estimate that shared FP32/twin-FP16 units would be about 15% or so larger/more expensive than pure FP32 units - the extra FP16 adder is quite small compared to, say, an FP32 multiplier. (The 15% number doesn't include infrastructure such as register files, texture mappers/caches, instruction decoders/schedulers etc.) It may well be quite cost-effective if FP16 gains enough acceptance, but whether we will see such solutions will be completely up to NV & ATI.

KimB · Dec 31, 2003

arjan de lumens said:
An FP32 unit that doubles as a twin FP16 unit is a bit hard to design: splitting FP multipliers is easy and adds little extra gates and gate delays (~2 MUXes); splitting FP adders is very hard (AFAIK it's easier and cheaper to make an FP32 unit that can do single FP16 operations, then add a second FP16-only adder beside it);

I don't see why you would need to split the entire unit. Of course, it might be simpler if every FP16 operation could be run in parallel with a separate FP16 operation, and obtaining that parallelism would be all about doing it with as few transistors as possible.

Of course, there still is the problem that adding parallelism requires that more pixels be in flight at any one time to take advantage of that parallelism.

KimB · Dec 31, 2003

arjan de lumens said:
Operations such as RCP/RSQ are probably done so infrequently that it doesn't make sense to split them for FP32/twin-FP16 use; for other units,

RSQ is done twice as fast at FP16 than FP32 on the NV3x.

Ostsol · Dec 31, 2003

Chalnoth said:
arjan de lumens said:

Operations such as RCP/RSQ are probably done so infrequently that it doesn't make sense to split them for FP32/twin-FP16 use; for other units,

Click to expand...

RSQ is done twice as fast at FP16 than FP32 on the NV3x.

How so? Based on ATI's OpenGL Programming and Optimization Guide, it is performed as a single instruction and completed in a single clock cycle on the R300. . . Does NVidia's implementation require more cycles for FP32?

arjan de lumens · Dec 31, 2003

Chalnoth said:
arjan de lumens said:

Operations such as RCP/RSQ are probably done so infrequently that it doesn't make sense to split them for FP32/twin-FP16 use; for other units,

Click to expand...

RSQ is done twice as fast at FP16 than FP32 on the NV3x.

Which indicates that it is probably split into two operations: one LUT lookup/lerp operation that gives ~FP16 precision and one Newton-Raphson iteration or something similar that extends the precision from FP16 to FP32 - similar to how it's done in AMD 3dnow!.

KimB · Jan 1, 2004

That would make sense. The instruction apparently takes two cycles at FP32.

Hellbinder · Jan 1, 2004

It makes a lot of sense, provided performance can be gained from its use. Remember that any architecture that is designed to use FP16 will have a higher-precision format available when it's needed.

Again you keep going back to this Fallacy to support your argument.

FP16 is in no way at all Faster than a Higher percesion except in the express case where Transistor count is increased/Doubbled for its correct and direct support. Which makes no logical sense to waste the resources on a lower Percision that is not going to be used excpet where Nvidia itself PUSHES it on developers for no other reason than they *currently* HAVE TO USE IT.

I gurantee, you (and i think everyone knows this), would be making the exact same arguments i am if the Shoe was on the other foot.. Except you would likely also be pushing how FP24 is tha "wave of teh Future" or something.

The bottom line is that if Nvidia had devliverd properly supported FP32 and left out FP16 you would not even be bringing this up. Instread you would be spending all your time talking about the Evils of FP24.

I know part of your motivation is the fact that you know the Nv40 Mantains the Use of FP16.

radar1200gs · Jan 1, 2004

There has been ongiong speculation that ATi will increase the number of pipelines in r4xx. I haven't closely followed the speculation, but what follows is my opinion/speculation on one easy way to do it. (I'm not suggesting that what follows is fact).

ATi impliments r4xx as FP32 with FP16 partial precsion. The partial precsion is achieved by splitting the FP32 registers. This allows the claim of extra pipelines, but only when partial precision is used. It does not affect the number of pixels output per cycle (still only 8 of those).

YeuEmMaiMai · Jan 1, 2004

Ati is not going to use FP16...........so why even hope they will follow nVidia?

all this talk about why FP16 is good.....bah....game performance speaks volumes and take a look at toms VGA guide part III and you will see ATi leading in ALL of the benchmarks except for 2. There even a few instaces where a 9500Pro beats out a 5950 ultra... .yeah that's right a year old mid range card puts the smakdown on nvidia's finest...

so let nvidia continue with FP16 as ATi's performance with FP24 is superior......law down some tasty eyecandy and there really is NO COMPARISON between the excellent image you get from r3X0 and the mediocre image you get with NV3X when performance is similar....

FP16 and market support

Similar threads