FP16 Bilinear Filtering

Dave Baumann

Gamerscore Wh...
Moderator
Legend
Does anyone know the cost of doing an FP16 Bilinear Filter on NV40? 2 Cycles? 3 Cycles? Something else?
 
It seems that using FP16 textures cost 2 cycles on the NV40 wether you're using point sampling or bilinear filtering.
 
FP16 bilinear filtering is free on NV40. However its texturing unit can't output more than 2 FP16 components per cycle (that's the same with every GPU).

FP16 point sampling x or xy : 1 cycle
FP16 point sampling xyz or xyzw : 2 cycles
FP16 bilinear filtering x or xy : 1 cycle
FP16 bilinear filtering xyz or xyzw : 2 cycles
 
Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)
 
akira888 said:
Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)
Actually, yes. There was a presentation at the Graphics Hardware 2004 conference where some people studied bottlenecks in GPUs for GPGPU-type tasks, and found that NO current GPU had a path from the texture cache to the pixel shader that was wider than 32 bits. (link to presentation (300K ppt) ) So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures. This was shown to be a cache->shader shader path limitation, not a memory bandwidth limitation (this can be easily tested by benchmarking reads of grossly magnified textures).
 
Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?
 
Luminescent said:
Current GPU architecture's can read 4 FP32 bit values in a cycle
Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.
 
Luminescent said:
Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?
NV4x can't filter FP32 values. And if the TMUs are limited to 32 bit output per clock (before conversion to FP32), there certainly are no units that would generate more than that.
 
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
 
arjan de lumens said:
Luminescent said:
Current GPU architecture's can read 4 FP32 bit values in a cycle
Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.
I was using the following statement of yours as my source:
So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures.
I guess glanced at it too quickly and added onto it. At first glance it seemed you were indicating that the pixel units could read 4 fp32 values from the texture units in single cycle if it weren't for the fact that the data path from the tex units to the pixel shader was crippled to 32-bits.

Secondly, I meant 4 FP16 values, since, as Xmas pointed out, NV40 cannot filter FP32 textures.
 
Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?

Probably (disclosure: speaking as a hobbyist coder and not an engineer) because they correctly expected the vast majority of texture reads to only have a 32 bit return value and therefore it simply wasn't worth it to double the size of the on-chip data bus (which would use valuable die area) to accelerate a relatively uncommon operation.
 
As I tried to point out, it's not only the data path, it's the number of units as well. There would be no point to restrict one but not the other.
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
 
Xmas said:
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.
 
Chalnoth said:
Xmas said:
So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.
That's why I wrote "FP16 bilinear interpolators".

And there's one additional MAD for trilinear/AF sample accumulation.
 
Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects (though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).
 
Chalnoth said:
Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects
That doesn't appear to be the case for FP16.
(though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).
Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.

-FUDie
 
FUDie said:
Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.

-FUDie
Actually, that does make more sense. The initial way I was thinking you'd split up a filtering operation would be to have interpolator 1 average samples 1 and 2, have interpolator 2 average samples 3 and 4, then have a third interpolator to average the results of the above samples. Makes more sense, I suppose, to divide this up such that instead of just having two interpolators instead of three (a large waste in computation), you'd just make them 2-component instead of 4-component.
 
Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?

Internal buses are hardly for free. Besides, when you're using floating point textures, you're likely also doing a decent amount of math, which should balance up the extra latency.
 
Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
Well, I would tend to think that it's the math units that are the limitation here. It's got to take quite a few more transistors for FP16 interpolators than the usual FX8 interpolators. And if the math units can't do it any faster, why waste any transistors on data paths?
 
okay I guess I didn't read carefully enough, I was under the impression that the bus was the limiting factor.
 
Back
Top