Does anyone know the cost of doing an FP16 Bilinear Filter on NV40? 2 Cycles? 3 Cycles? Something else?
Actually, yes. There was a presentation at the Graphics Hardware 2004 conference where some people studied bottlenecks in GPUs for GPGPU-type tasks, and found that NO current GPU had a path from the texture cache to the pixel shader that was wider than 32 bits. (link to presentation (300K ppt) ) So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures. This was shown to be a cache->shader shader path limitation, not a memory bandwidth limitation (this can be easily tested by benchmarking reads of grossly magnified textures).akira888 said:Would it make sense that the reason for that is because that data bus between the texture filtering unit and the shading units is only 32 bits wide? (designed for the usual case of RGBA_8 textures)
Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.Luminescent said:Current GPU architecture's can read 4 FP32 bit values in a cycle
NV4x can't filter FP32 values. And if the TMUs are limited to 32 bit output per clock (before conversion to FP32), there certainly are no units that would generate more than that.Luminescent said:Current GPU architecture's can read 4 FP32 bit values in a cycle, but can something like NV4x bilinearly filter those 4 fp32 values in a single cycle, bandwith limitations aside?
I was using the following statement of yours as my source:arjan de lumens said:Source? It may seem like obvious that they SHOULD have that ability, but actual benchmarking so far tells a different story.Luminescent said:Current GPU architecture's can read 4 FP32 bit values in a cycle
I guess glanced at it too quickly and added onto it. At first glance it seemed you were indicating that the pixel units could read 4 fp32 values from the texture units in single cycle if it weren't for the fact that the data path from the tex units to the pixel shader was crippled to 32-bits.So you could read a 4-component FP32 texture if you wanted to (on both R3xx, R4xx, NV3x, NV4x, unfiltered), but the hardware would take 4 cycles to deliver the data on all architectures.
Sage said:why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.Xmas said:So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
That's why I wrote "FP16 bilinear interpolators".Chalnoth said:The only issue with this idea is that it takes three interpolations to do the summation for bilinear texture filtering.Xmas said:So there are only two FP16 bilinear interpolators, while there are four 8bit capable interpolators. They could be implemented as 2* FP16 + 2* FX8, or maybe you can somehow combine two FX8 interpolators to form one FP16 interpolator (though I don't see an easy way to do that)
That doesn't appear to be the case for FP16.Chalnoth said:Actually, it was that one additional interpolator for sample accumulation that I was concerned with. I was assuming each interpolator would pretty much automatically be operating on 4-component objects
Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.(though it seems that in nVidia's case the FP16 interpolators are a bit more flexible and capable of dual-issue).
Actually, that does make more sense. The initial way I was thinking you'd split up a filtering operation would be to have interpolator 1 average samples 1 and 2, have interpolator 2 average samples 3 and 4, then have a third interpolator to average the results of the above samples. Makes more sense, I suppose, to divide this up such that instead of just having two interpolators instead of three (a large waste in computation), you'd just make them 2-component instead of 4-component.FUDie said:Dual-issue? Where? Looks more like a "loopback" arrangement. I.e. x and y are interpolated first then z and w. There's no dual issue involved here: There's a full 2-component FP16 bilinear interpolator.
-FUDie
Sage said:why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
Well, I would tend to think that it's the math units that are the limitation here. It's got to take quite a few more transistors for FP16 interpolators than the usual FX8 interpolators. And if the math units can't do it any faster, why waste any transistors on data paths?Sage said:why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?