FP16 Bilinear Filtering

Sage said:
okay I guess I didn't read carefully enough, I was under the impression that the bus was the limiting factor.
Well, to be fair, there's no way to for certain determine whether or not this is the case (without being able to look at the architecture). But it makes most sense that the math units are the limitation.
 
It that were so then why does point sampling a FP32 texture have a 4 cycle latency? Point sampling shouldn't require any interpolation at all.
 
akira888 said:
It that were so then why does point sampling a FP32 texture have a 4 cycle latency? Point sampling shouldn't require any interpolation at all.
Now that is rather odd, because point sampling a FP32 texture also requires the same bandwidth as bilinear filtering a FX8 texture. It would seem that the architecture isn't as well-optimized to sampling FP32 textures as it could be.
 
Chalnoth - a complete guess, but maybe it's because the texture caching logic fetches/determines cache hits in 2x2 texel blocks. Or, if the latency is always 4 cycles, then probably the bus between the texture unit/shader is only 32bits, so you would need 4 cycles to transfer one fp32 4-vector.
 
Sage said:
why would they cripple fp reads in this way? surely it wouldnt be very difficult to doubble or even quadrupple that since we're talking about an on-chip bus. do they just not expect anyone to ever actually use fp textures on current-generation hardware?
While akira's explanation is the most important reason, I think another issue is the way in which FP textures are currently used.

When doing HDR post processing, the bandwidth needed will slow you down anyway, since you have a 1:1 mapping between pixels and texels. Say you were doing a blur of 4 pixels, you need to read 256 bits of data, then write 64 when you're done. I think many of today's GPU's have only 32 bits of bandwidth per pipe per clock, and rarely will you get >90% utilisation.

It makes sense to me. Why make the GPU capable of more than the memory will be able to feed it? Only when you get into ordinary usage of FP textures (i.e. not 1:1) will bandwidth be less an issue, and I think the sky in FarCry's HDR mode is the only example so far.
 
If current texture caches are implemented like those proposed in the literature (The Design and Analysis of a Cache Architecture for Texture Mapping
) the way to get four texels per cycle and fragment (bilinear) is to use Morton(Z) order for storing the texels interleaved in four separated banks in the cache. Then the texels can then be accessed in parallel and conflict free regardless of the fragment to texel ratio (so it works well even without mipmapping).

Code:
Morton order at 2x2:

      0 1
      2 3

Morton order at 4x4

      0  1  4  5
      2  3  6  7
      8  9  c  d
      a  b  e  f

With just four banks interleaved based on Morton order any bilinear access for a single fragment (2x2 neighbour texels) is conflict free.

Those banks have a width of one texel, so that could be another reason for the FP16 penalty as the banks are likely to be 32 bit for the common (optimized) texel size. Accessing a texel larger than 32 bit would imply at least two accesses to the same bank, thus at least an additional cycle.

From some tests I did, NVidia NV35 and ATI RV250 would seem that they may have such kind of texture cache architecture. ATI R350 and RV350 seems to have a different architecture that relies more in mipmapping and a 1:1 to 1:2 fragment to texel ratios. The experiments I did seem to point to 1 and 3 cycle penalties when doing bilinear beyond those ratios (the texture cache is larger than in NV35 though) with those GPUs.
 
Back
Top