Bandwith hit of AF

trinibwoy

Meh
Legend
Supporter
Quick question. Does AF use any additional bandwidth? Or does it work solely with cached texture data? TIA
 
Hmmm, on second thought this might be a pretty dumb question :oops:. It would need to pull additional mip maps over trilinear hence the bandwidth hit right? Or am I still completely clueless?
 
I don't think you can guarantee the texels reside in the texture cache, so there's a read from card memory if that's the case. So yeah, I don't think per-quad and L2 texture caches work that way for anisotropic, at least on the chips I have the best idea about.
 
It depends on the AF algorithm used whether AF requires more or less texture bandwidth per clock than isotropic filtering. But it almost always requires more bandwidth.
 
Well, if there was no hit, there would be no point to the angle-dependent hacks we see so often. Just because of the way cache works, I would expect most extra fetches to come from cache. But there has to be a significant amount that cause cache misses if these IQ shortcuts are so prevalent.
 
Inane_Dork said:
Well, if there was no hit, there would be no point to the angle-dependent hacks we see so often. Just because of the way cache works, I would expect most extra fetches to come from cache. But there has to be a significant amount that cause cache misses if these IQ shortcuts are so prevalent.

Sigh ... sorry, I don't see what the so called 'angle dependant hacks' have to do with the texture caches. Textures aren't aligned with the screen so a screen aligned 'hack' isn't going to help with the texture caches.

About bandwidth usage as each additional AF sample takes an additional cycle (throughput) in the ideal case you are still consuming as much bandwidth per cycle than with no AF. You consume more bandwidth per fragment though, just that may be not by time. In the non ideal case I would say it consumes more but I don't know by how much. It may be a matter of how the footprint of the AF samples over the mipmap is.
 
The texture bandwidth hit for anisotropic mapping will depend on how many samples (1 sample = 1 trilinear lookup for the discussion below) the hardware chooses to take, but for a region where N samples are needed per pixel, it is roughly O(N^2). If you take a picture, say, 256x256 texels (mipmapped) and e.g. squeeze it by a factor of 16 vertically and keep its original size horizontally, then you get the following results:
  • With isotropic mapping, the usual trilinear filtering algorithm will touch each texel in the 16x16 mipmap once (assuming perfect texture caching), as this is the only mipmap picked. The result is an image that is severely blurred horizontally. Thus: 256 texels touched.
  • With 16x anisotropic mapping (assuming the Feline algorithm or similar algorithms that implement aniso by repeated trilinear lookups), you take 16 samples per pixel (on a vertical line going down through the pixel), on a vertical line, causing accesses 4 mipmap levels down from what was the case with isotropic rendering. You end up touching every texel in the 256x256 mipmap at least once. Thus: 65536 texels touched, and 256x texture bandwidth needed.
As should be clear, the bandwidth per cycle per texture unit will increase with increasing degrees of aniso (which is intuitively evident from the fact that you are sampling a lower/larger mipmap; this all assumes that you aren't getting clamped to the bottom of the mipmap pyramid).
 
RoOoBo said:
Sigh ... sorry, I don't see what the so called 'angle dependant hacks' have to do with the texture caches. Textures aren't aligned with the screen so a screen aligned 'hack' isn't going to help with the texture caches.
I'm not sure what you're trying to say.

All I'm saying is that we have these adaptive modes of AF that reduce the AF factor based on angles. The only reason for these modes to exist is if they provide a performance boost. If they did not, everyone would go for IQ so they wouldn't get clobbered by bad press. So if these modes offer better performance, I think there very well could be a connection to texture cache misses. It's possible that all the speed up comes from processing less samples, but I don't see the logic in dismissing texture cache misses.
 
Inane_Dork said:
I'm not sure what you're trying to say.

All I'm saying is that we have these adaptive modes of AF that reduce the AF factor based on angles. The only reason for these modes to exist is if they provide a performance boost. If they did not, everyone would go for IQ so they wouldn't get clobbered by bad press. So if these modes offer better performance, I think there very well could be a connection to texture cache misses. It's possible that all the speed up comes from processing less samples, but I don't see the logic in dismissing texture cache misses.

I would say that most AF algorithms are adaptative as taking additional samples when it isn't required isn't useful. In fact if you filter them in the wrong way the result could still be worst.

In my opinion the axis based anisotropy detection algorithm used in current GPUs is more related with a cheap hardware implementation than with performance. And in any case if you look at the example code implementing such algorithm I posted in the G70 thread there is little that suggests that the texture cache is going to be specially affected. The angle dependance is produced because the anisotropy is measured along an axis, so if there isn't anisotropy in that axis the algorithm will detect it, and if there isn't anistropy in any of the axis the algorithm won't take additional samples. The problem is that may be the projected fragment area is still anisotropic in a way that can not be detected using that set of axis. I would want to check how it looks with 6 or 8 axis (rather than the 4 currently used) and if the zones and angles where no anisotropy is detected in the AF test are reduced but I don't have the time right now.

I don't know what the AF algorithm in the NV2x and NV3x was so I can't judge if it was really more costly or correct than the axis based algorithm. May be NVidia decided that it was taking more samples that really required and scrapped it as an inefficient implementation, or then they really thought that it was 'unfair' for ATI to be using a cheap algorithm and decided to change to it. I don't use to check benchmarks that often but I don't remember there was a such noticeable difference between the performance penalty of ATI 16xAF vs the old style NVidia 8xAF. So go and try asking them.
 
Back
Top