FP16 and market support

Arun · Dec 24, 2003

Demirug said:
Yes, most of the stages (nVidia call it slots) are used for the texture latency (nv: > 176). But even if you do not use the texture unit you have to go to all this stages because the texture unit bypass have the same size. This is necessarily because the order of all quads can not change. There is no real threadscheduler.

Is it just me or is that an obvious modification for the NV40? If they need Branching in it, they NEED to be smarter than that anyway.
So assuming the number of "necessary" stages/slots/threads in the NV40 will greatly diminish would be a safe bet IMO. So if they reduced the number of average slots by 50-70%, and doubled the register file again as they did in the NV35... You might get to an extremely reasonable amount of register performance hit.

Heck, if you need 8 registers to get anysort of real performance hit, and the performance hit of 16 registers would be roughly the same as the one of 4-6 registers on the NV30... I'd even say that unless NVIDIA badly ****s up regarding their ALUs, their performance might be quite excellent indeed!

Of course, that's a BIG if

Uttar

KimB · Dec 24, 2003

jimbob0i0 said:
If there are NO applications that can show 24bit precision to not be enough wouldn't that indicate that in the current generation it actually most like is enough?......

No applications have tried to test texture addressing accuracy.

Demirug · Dec 24, 2003

Uttar said:
Is it just me or is that an obvious modification for the NV40? If they need Branching in it, they NEED to be smarter than that anyway.

Not necessarily.

If they add a instruction pointer to each quad (10 Bits for 1024 Instruction Programms) it will be easy to execute a different instruction for each quad. At the end of the pipeline you need a additional unit that can change the instruction pointer. This will work well for static branching. dynamic branching is a little bit more complicated because it is possible that not all pixel in one quad have to execute the same instructions. But this can solve too. I know that it work because I have written a little simulationprogramm.

Uttar said:
So assuming the number of "necessary" stages/slots/threads in the NV40 will greatly diminish would be a safe bet IMO. So if they reduced the number of average slots by 50-70%, and doubled the register file again as they did in the NV35... You might get to an extremely reasonable amount of register performance hit.

50-70% is IMHO to much. I think nVidia have to fight for each single Slot they want to remove.

Hyp-X · Dec 24, 2003

Demirug said:
even if you do not use the texture unit you have to go to all this stages because the texture unit bypass have the same size. This is necessarily because the order of all quads can not change.

This doesn't make much sense.

Say you have 128 slots (because of high register usage), a PS program of 1 TEX instruction and some arithmetic instructions, a 176 tex latency but a <=128 arithmetic latency.

cycle 0: start the 1st PS instruction on slot 0
cycle 1: start the 1st PS instruction on slot 1
...
cycle 127: start the 1st PS instruction on slot 127
cycle 128: idle waiting for slot 0 to become runable
...
cycle 175: idle waiting for slot 0 to become runable
cycle 176: start the 2nd PS instruction on slot 0
cycle 177: start the 2nd PS instruction on slot 1
...
cycle 303: start the 2nd PS instruction on slot 127
cycle 304: start the 3rd PS instruction on slot 0, because not being 2nd instruction is already complete

I don't say the FX works this way.
I'm just saying that your argument that the bypass have to be long because the quads order cannot change makes no sense.

Nick · Dec 25, 2003

Could someone direct me to where the DirectX SDK specifies that 24-bit floating-point is enough? I found only the 'Data Types' page of HLSL and it specified 16-bit, 32-bit and 64-bit but not 24-bit (of course I realize this is storage size). There seems to be no precision information about assembly shaders, or am I looking in the wrong document?

Anyway, I don't think 24-bit for texturing is enough. You can already see artifacts because there are less than 8 bit for the filtering fraction as you can see here: ATI's filtering tricks. Of course you can only see them under rare conditions if there's only one, 'flat' texture. But as soon as the texture coordinates undergo some operations instead of just taking them from the interpolators, there's a great loss of precision that will be visible on things like detail textures. My software renderer uses the rcp SSE instruction which gives at least 12-bit of mantissa precision, for perspective correction, but it gives clear artifacts for nearby surfaces. A 16-bit mantissa is only 16x more precise and this isn't sufficient if some bits are lost with extra operations on these registers.

So, ATI has to hurry to get some 32-bit floating-point hardware on the market before we have games that experience precision problems. It has been a good compromise for a while, but 24-bit isn't going to last long. I think it's pretty smart of Nvidia to allow different precisions depending on the use. Color calculations can be done perfectly with 16-bit floating-point format while 32-bit can be used for texture coordinates and such. I see no need for 24-bit actually. So I still think DirectX has a major role here in 'deciding' who has the fastest hardware that is 'considered' compliant. They should, like OpenGL, specify exactly what precision is required or recommended for every kind of operation.

As always, I could have made some serious errors, so please correct me if I'm wrong.

radar1200gs · Dec 25, 2003

WaltC said:
radar1200gs said:

...
The same thing goes for game developers. In the real world, most DX9 class chips sold to consumers will support FP16 and developers will ignore it and the potential performance increases at their peril.

Click to expand...

You are looking at the issue exactly backwards. Why would any hardware manufacturer want to do fp16 at all, if he can do fp24 which runs as fast, if not faster, than fp16 runs on the products which support fp16? This is but underscored by the fact that fp24 is the API target, not fp16.

A manufacturer wouldn't, in my view, simply because it isn't needed for anything.

So why does nVidia need fp16?

Answer: Because the nV3x architecture doesn't support any rendering precision *above* fp16 which is competitive in terms of performance.

If your assumption is that fp16 is "always faster" than a higher rendering precision, such as fp24, that assumption is incorrect, as the actual performance of a gpu architecture is not solely determined by the precision of the fp pipeline, but is determined by all of the other factors relative to gpu design which lie outside of fp rendering precision. And that's why R3x0 does fp24 as fast, and faster, than nV3x does fp16. Estimating gpu performance strictly by the rendering precision of the fp pipeline is a mistake.

Hence, it will be "he who markets fp16-dependent products" who will suffer, as opposed to developers who write software which displays optimally at fp24. Relative to the 3d hardware market, nothing has changed from 18 months ago, when people were buying nVidia 3d cards because they provided the best performance. All that's changed in the last 16 months or so is that the same people who were buying nVidia then, are now buying ATi--for the same reasons. The market will tend to center around better standards when they can be brought to market in a practical and economic sense. So, just as 16-bit integer usurped and replaced 8-bit integer, and 24/32-bit integer replaced 16-bit integer as a market standard in both software and hardware, it should be no surprise that a speedy implementation of fp24 is preferred above a speedy implementation of fp16.

The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

If I were a developer I would be supporting the format my customer is ,ost likely to have in his box, rather than some theoretically superior standard.

Back in the day the Motorola 68000 family was considered superior to Intel x86, but x86 won because it had a larger installed base. Superiority does you no good if noone is making use of it.

Bouncing Zabaglione Bros. · Dec 25, 2003

radar1200gs said:
The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

If I were a developer I would be supporting the format my customer is ,ost likely to have in his box, rather than some theoretically superior standard.

Oh right, that must be why we are still using 8 bit colour and 286 CPUs ....

BRiT · Dec 25, 2003

radar1200gs said:
The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

I don't buy that. Show me the numbers.

Ostsol · Dec 25, 2003

Chalnoth said:
jimbob0i0 said:

If there are NO applications that can show 24bit precision to not be enough wouldn't that indicate that in the current generation it actually most like is enough?......

Click to expand...

No applications have tried to test texture addressing accuracy.

Based on my own tests, precision in fragment shaders will only make a difference for dependant texture reads. For normal texture sampling, the texture coordinate is tossed into the texture sampler at the full FP32 provided by the vertex pipeline.

radar1200gs · Dec 25, 2003

BRiT said:
radar1200gs said:

The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

Click to expand...

I don't buy that. Show me the numbers.

The 5200 and 5600 made sure of that. ATi had the same chance but decided not to compete in the low-end with a DX9 class chip. I bet Dave Orton would love to strangle whoever was responsible for ordering several squillion RV250 chips from the foundries...

OpenGL guy · Dec 25, 2003

Nick said:
Anyway, I don't think 24-bit for texturing is enough. You can already see artifacts because there are less than 8 bit for the filtering fraction as you can see here: ATI's filtering tricks.

Refrast does the same. Also, 32 steps is a lot, it takes some really extreme examples to make then visible.

Of course you can only see them under rare conditions if there's only one, 'flat' texture.

Exactly, and these don't occur in games.

But as soon as the texture coordinates undergo some operations instead of just taking them from the interpolators, there's a great loss of precision that will be visible on things like detail textures.

Where's this "great loss in precision" coming from? Not from the LOD fraction, certainly. Not from FP24 certainly.

My software renderer uses the rcp SSE instruction which gives at least 12-bit of mantissa precision, for perspective correction, but it gives clear artifacts for nearby surfaces. A 16-bit mantissa is only 16x more precise and this isn't sufficient if some bits are lost with extra operations on these registers.

Prove it. The HW is here now. And why would you do perspective correction in the pixel shader? Also, there are 17 bits of mantissa when you count the implied bit.

So, ATI has to hurry to get some 32-bit floating-point hardware on the market before we have games that experience precision problems.

Whatever.

It has been a good compromise for a while, but 24-bit isn't going to last long. I think it's pretty smart of Nvidia to allow different precisions depending on the use. Color calculations can be done perfectly with 16-bit floating-point format while 32-bit can be used for texture coordinates and such. I see no need for 24-bit actually. So I still think DirectX has a major role here in 'deciding' who has the fastest hardware that is 'considered' compliant. They should, like OpenGL, specify exactly what precision is required or recommended for every kind of operation.

DX9 does specify. It specifies a minimum of 24-bit precision in the pixel shader, or 16-bit precision when _pp is specified. Not very complicated.

[maven] · Dec 25, 2003

Nick said:
Could someone direct me to where the DirectX SDK specifies that 24-bit floating-point is enough? I found only the 'Data Types' page of HLSL and it specified 16-bit, 32-bit and 64-bit but not 24-bit (of course I realize this is storage size). There seems to be no precision information about assembly shaders, or am I looking in the wrong document?

I think it would be in the DDK, couldn't find anything in the SDK...

Nick said:
Anyway, I don't think 24-bit for texturing is enough. You can already see artifacts because there are less than 8 bit for the filtering fraction as you can see here: ATI's filtering tricks. Of course you can only see them under rare conditions if there's only one, 'flat' texture.

But they (limited fractional bits for texture-interpolators and dependent reads) are different sources of error...

Nick said:
But as soon as the texture coordinates undergo some operations instead of just taking them from the interpolators, there's a great loss of precision that will be visible on things like detail textures.

This is a notion I disagree with. To quote one of my Numerical Analysis lecturers, it's the same as proclaiming the patient dead while the operation hasn't even started yet.
There are certain types of operations (in conjunction with particular data) that can have catastrophic effects on accuracy, but you need to be aware of those (as a programmer) anyway; and they do not necessarily occur.

Note that I haven't addressed at all, whether FP16/24/32 is enough for anything or not...

gokickrocks · Dec 25, 2003

radar1200gs said:
The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

If I were a developer I would be supporting the format my customer is ,ost likely to have in his box, rather than some theoretically superior standard.

Back in the day the Motorola 68000 family was considered superior to Intel x86, but x86 won because it had a larger installed base. Superiority does you no good if noone is making use of it.

so by your logic, developers should be coding for intel...

also, the majority of cards dont use floating point in the shaders, so again by your logic, it would imply that developers should be using integers...dont know about you, but I dont see many people running dx9 cards...

BRiT · Dec 25, 2003

radar1200gs said:
BRiT said:

radar1200gs said:

The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

Click to expand...

I don't buy that. Show me the numbers.

Click to expand...

The 5200 and 5600 made sure of that. ATi had the same chance but decided not to compete in the low-end with a DX9 class chip. I bet Dave Orton would love to strangle whoever was responsible for ordering several squillion RV250 chips from the foundries...

Once again, show me the numbers.

SpellSinger · Dec 25, 2003

radar1200gs said:
The 5200 and 5600 made sure of that. ATi had the same chance but decided not to compete in the low-end with a DX9 class chip. I bet Dave Orton would love to strangle whoever was responsible for ordering several squillion RV250 chips from the foundries...

The NV products can't support 32FP even with their high end at an acceptable performance level.

I still don't buy it.

KimB · Dec 25, 2003

Ostsol said:
Chalnoth said:

jimbob0i0 said:

If there are NO applications that can show 24bit precision to not be enough wouldn't that indicate that in the current generation it actually most like is enough?......

Click to expand...

No applications have tried to test texture addressing accuracy.

Click to expand...

Based on my own tests, precision in fragment shaders will only make a difference for dependant texture reads. For normal texture sampling, the texture coordinate is tossed into the texture sampler at the full FP32 provided by the vertex pipeline.

Which is precisely what I'm worried about.

radar1200gs · Dec 25, 2003

gokickrocks said:
radar1200gs said:

The simple reality is that there are more GPU's out there in consumer homes capable of doing FP16/32 than there are GPU's capable of doing FP24.

If I were a developer I would be supporting the format my customer is ,ost likely to have in his box, rather than some theoretically superior standard.

Back in the day the Motorola 68000 family was considered superior to Intel x86, but x86 won because it had a larger installed base. Superiority does you no good if noone is making use of it.

Click to expand...

so by your logic, developers should be coding for intel...

also, the majority of cards dont use floating point in the shaders, so again by your logic, it would imply that developers should be using integers...dont know about you, but I dont see many people running dx9 cards...

developers already do code for Intel, with no problems since with the exception of SSE2 on the intel side and 3dnow! on the amd side the code is identical anyway and AMD cpu's often benefit more from optimisations for Intel than Intel itself does.

As for integer support, thats where Cg would have come in.

Far from being an evil plan to take the graphics world over, Cg lets developers code once and then runs it optimally on the target architecture (including competitor architectures if the competitor actually bothers to write a competent backend). It is HLSL for DX8 and OpenGL.

KimB · Dec 25, 2003

OpenGL guy said:
Also, 32 steps is a lot, it takes some really extreme examples to make then visible.

This is the way aliasing works. You often don't even notice it in still screens unless extreme examples are taken.

Examples:
1. Edge aliasing: under most circumstances, you have to zoom very far in to see edge aliasing in a screenshot. The effect, however, is much more visible in motion.

2. Texture aliasing: again, you don't see it under most circumstances. It's pretty easy to show, in fact, that a LOD bias of -1 can look pretty darned good in a screenshot. But it there will be quite a lot of shimmer when in motion.

This is why I have always opposed people using in-game screenshots to test texture quality and AA quality. For these things, synthetic screenshots are vastly more telling. If you want to use games to test such things, the only way to do it properly would be to use video. That never happens, however, and I don't think it ever will.

gokickrocks · Dec 25, 2003

way back, MikeC posted a Q3 video in regards to AA on both the 9700 and the 5900 IIRC...so there you go

radar1200gs · Dec 25, 2003

Chalnoth said:
OpenGL guy said:

Also, 32 steps is a lot, it takes some really extreme examples to make then visible.

Click to expand...

This is the way aliasing works. You often don't even notice it in still screens unless extreme examples are taken.

Examples:
1. Edge aliasing: under most circumstances, you have to zoom very far in to see edge aliasing in a screenshot. The effect, however, is much more visible in motion.

2. Texture aliasing: again, you don't see it under most circumstances. It's pretty easy to show, in fact, that a LOD bias of -1 can look pretty darned good in a screenshot. But it there will be quite a lot of shimmer when in motion.

This is why I have always opposed people using in-game screenshots to test texture quality and AA quality. For these things, synthetic screenshots are vastly more telling. If you want to use games to test such things, the only way to do it properly would be to use video. That never happens, however, and I don't think it ever will.

What would be a good compromise is a screenshot showing the overall scene, then a 10 to 30 frame animation using animgif format or similar (can .png do anim sequences?) of a small selected portion of the screen designed to show the effect in motion.

FP16 and market support

Arun

Unknown.

KimB

Demirug

Hyp-X

Irregular

Nick

radar1200gs

Bouncing Zabaglione Bros.

BRiT

(>• •)>⌐■-■ (⌐■-■)

Ostsol

radar1200gs

OpenGL guy

[maven]

gokickrocks

BRiT

(>• •)>⌐■-■ (⌐■-■)

SpellSinger

KimB

radar1200gs

KimB

gokickrocks

radar1200gs

Similar threads