FP16 Bilinear Filtering

Dave Baumann · Feb 3, 2005

Does anyone know the cost of doing an FP16 Bilinear Filter on NV40? 2 Cycles? 3 Cycles? Something else?

Zeross · Feb 4, 2005

It seems that using FP16 textures cost 2 cycles on the NV40 wether you're using point sampling or bilinear filtering.

Tridam · Feb 5, 2005

FP16 bilinear filtering is free on NV40. However its texturing unit can't output more than 2 FP16 components per cycle (that's the same with every GPU).

FP16 point sampling x or xy : 1 cycle
FP16 point sampling xyz or xyzw : 2 cycles
FP16 bilinear filtering x or xy : 1 cycle
FP16 bilinear filtering xyz or xyzw : 2 cycles

akira888 · Feb 6, 2005

Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)

arjan de lumens · Feb 7, 2005

akira888 said:
Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)

Actually, yes. There was a presentation at the Graphics Hardware 2004 conference where some people studied bottlenecks in GPUs for GPGPU-type tasks, and found that NO current GPU had a path from the texture cache to the pixel shader that was wider than 32 bits. (link to presentation (300K ppt) ) So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures. This was shown to be a cache->shader shader path limitation, not a memory bandwidth limitation (this can be easily tested by benchmarking reads of grossly magnified textures).

Luminescent · Feb 7, 2005

Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?

arjan de lumens · Feb 7, 2005

Luminescent said:
Current GPU architecture's can read 4 FP32 bit values in a cycle

Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.

Xmas · Feb 7, 2005

Luminescent said:
Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?

NV4x can't filter FP32 values. And if the TMUs are limited to 32 bit output per clock (before conversion to FP32), there certainly are no units that would generate more than that.

Sage · Feb 7, 2005

why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?

Luminescent · Feb 7, 2005

arjan de lumens said:
Luminescent said:

Current GPU architecture's can read 4 FP32 bit values in a cycle

Click to expand...

Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.

I was using the following statement of yours as my source:

So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures.

I guess glanced at it too quickly and added onto it. At first glance it seemed you were indicating that the pixel units could read 4 fp32 values from the texture units in single cycle if it weren't for the fact that the data path from the tex units to the pixel shader was crippled to 32-bits.

Secondly, I meant 4 FP16 values, since, as Xmas pointed out, NV40 cannot filter FP32 textures.

akira888 · Feb 7, 2005

Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?

Probably (disclosure: speaking as a hobbyist coder and not an engineer) because they correctly expected the vast majority of texture reads to only have a 32 bit return value and therefore it simply wasn't worth it to double the size of the on-chip data bus (which would use valuable die area) to accelerate a relatively uncommon operation.

Xmas · Feb 8, 2005

As I tried to point out, it's not only the data path, it's the number of units as well. There would be no point to restrict one but not the other.
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)

KimB · Feb 8, 2005

Xmas said:
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)

The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.

Xmas · Feb 8, 2005

Chalnoth said:
Xmas said:

So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)

Click to expand...

The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.

That's why I wrote "FP16 bilinear interpolators".

And there's one additional MAD for trilinear/AF sample accumulation.

KimB · Feb 8, 2005

Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects (though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).

FUDie · Feb 8, 2005

Chalnoth said:
Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects

That doesn't appear to be the case for FP16.

(though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).

Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.

-FUDie

KimB · Feb 8, 2005

FUDie said:
Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.

-FUDie

Actually, that does make more sense. The initial way I was thinking you'd split up a filtering operation would be to have interpolator 1 average samples 1 and 2, have interpolator 2 average samples 3 and 4, then have a third interpolator to average the results of the above samples. Makes more sense, I suppose, to divide this up such that instead of just having two interpolators instead of three (a large waste in computation), you'd just make them 2-component instead of 4-component.

Humus · Feb 9, 2005

Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?

Internal buses are hardly for free. Besides, when you're using floating point textures, you're likely also doing a decent amount of math, which should balance up the extra latency.

KimB · Feb 9, 2005

Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?

Well, I would tend to think that it's the math units that are the limitation here. It's got to take quite a few more transistors for FP16 interpolators than the usual FX8 interpolators. And if the math units can't do it any faster, why waste any transistors on data paths?

Sage · Feb 9, 2005

okay I guess I didn't read carefully enough, I was under the impression that the bus was the limiting factor.

FP16 Bilinear Filtering

Dave Baumann

Gamerscore Wh...

Zeross

Tridam

akira888

arjan de lumens

Luminescent

arjan de lumens

Xmas

Porous

Sage

13 short of a dozen

Luminescent

akira888

Xmas

Porous

KimB

Xmas

Porous

KimB

FUDie

KimB

Humus

Crazy coder

KimB

Sage

13 short of a dozen

Similar threads