Is AF a bottleneck for Xenos?

j^aws · Mar 14, 2006

Jawed said:
That comparison is only meaningful if you have one GPU with EDRAM and another without. It would be like one GPU with texture caching and another without, though the difference afforded by EDRAM wouldn't be as extreme.

Your workings for filtering bandwidth consumption are completely irrelevant. You might as well count bus interface pins for all the good it'll do you.

Your only hope is to try to assess the cache architecture, the pipeline architecture and their quantities. On top of that you have the out of order scheduling in Xenos, which adds another layer of intractability.

We don't even know if Xenos has an L1-only or an L1/L2 cache structure (I happen to think the latter).

Jawed

Read my reply to Titanio.

It was a worst case number.

It was in response to derive Mintmasters 8 GB/sec number.

Nothing more, nothing less.

j^aws · Mar 14, 2006

arjan de lumens said:
AFAIK, the Xenos GPU's texture units are not capable of taking more than 4 samples per texel per clock per texture unit (+1 for a second, unfiltered texture IIRC) - 2xAF for 16 pixels will take 4 cycles, not 1.

Thanks, is that mentioned in Daves article?

arjan de lumens said:
Also, the number you get this way is NOT the bandwidth from memory to TMUs but instead bandwidth from texture-cache to the TMUs.

I wasn't suggesting it was. It was just a pathological number.

Dave Baumann · Mar 14, 2006

Virtually all texture filter units are single cycle Bilinear units.

j^aws · Mar 14, 2006

Dave Baumann said:
Virtually all texture filter units are single cycle Bilinear units.

So would trilinear be 2 cycles and AF2, 4 cycles in most TMUs, right?

j^aws · Mar 14, 2006

It seems people cannot follow points in this thread and are too quick to call some numbers "meaningless". For the record,

Mintamster said:
...In those calculations, 8GB/s is all the rendering bandwidth needed from main memory...

...8 GB/sec is all that's needed for AF. Well, if no one can derive this, then it's a meaningless number to make that point.

Dave Baumann · Mar 14, 2006

Jaws said:
So would trilinear be 2 cycles and AF2, 4 cycles in most TMUs, right?

In absolute terms, where these operations are being performed, that would be the cost, but AFAIK its not as straightforward as that in terms of actual use.

Trilinear is merging two mip-map levels together, in order to avoid the hard boundries between mip levels that Bilinear gives, so normal Trilinear, where two mip levels are being blended will take two cycles. However, there will be a portion of the screen where only the base texture level is being sampled, so for some of it its still single cycle for Bilinear Sampling.

With Anisotropic Filtering the area that only the base texture is sampled is increased (but with AF so too can the number of samples increase, dependant on the angle of the texture) and its only further into the screen that you'll get the increased sampling requirements of both AF and Trilinear. Once you get beyond the base texture and first mip-map level then the mip-maps themselves are getting smaller so you'll probablybe able to fit more into the texture cache.

Its worthwhile taking a look at these images in a review, which highlight the mip-map levels used with the coloured boundries:
http://www.beyond3d.com/previews/nvidia/gffxu/index.php?p=20

j^aws · Mar 14, 2006

From earlier in the thread,

Mintmaster said:
...Whether AF is enabled or not, Xenos can only do 16 bilinear samples per clock. With >2xAF (actual) applied per pixel, only 2 of the 4 texels used in each sample won't be shared with neighbouring pixels. At 4-bits per compressed texel, that's 8GB/s peak...

From my earlier 1:8 compression number of 64 GB/sec, and taking 4 cycles for AF2 per TMU,

~ 64/4 cycles
~ 16 GB/sec

Since 2 of 4 texels, being shared halves the samples needed,

~ 8 GB/sec
QED

BTW, thanks Dave for the info.

Mintmaster · Mar 14, 2006

Jaws, please don't throw numbers and calculations out there if you don't know what they mean. 512GB/s is way off for uncompressed textures.

16x anistropic filtering, whenever needed, takes 16 bilinear samples. Xenos can only do 16 bilinear samples per clock, period. Also, in normal bilinear filtering you always have a texel shared by at least 4 neighbours, so the 4 samples needed gives you a net of one texel per pixel peak. For any anisotropic filtering level over 2x, the sharing is now a minimum of two pixels (the non-slanted direction, e.g. left-right for a floor in an FPS).

Thus it doesn't matter how much AF is applied as long as the current pixel needs more than 2xAF. The peak, worst case data requirement is 32 texels per clock. Average case will be lower for sure. Each mipmap level has 1/4 the texels of the previous, so I'd estimate you need a factor of ~0.5 because I don't think you'll ever sample from the same mip-map while simultaneously having worst case sharing. Even this case is only during minification where your texel density is high enough. AF can still kick in when your texel sharing is more than 2:1. Remember that these calcs assume you're purely texture fetch limited.

In the end, we're looking at under 1GB/s where AF isn't needed (e.g. the nearest part of a road), less than 2GB/s average before the first mipmap boundary (which is quite far with AF) where AF is needed, and soon peaking at 4GB/s thereafter.

Texture bandwidth is generally very low unless you use uncompressed textures or are doing post processing. In the latter, you're forced to access your uncompressed framebuffer, and there's a 1:1 texel to pixel ratio.

Mintmaster · Mar 14, 2006

Oh, and some of you were complaining about why AA was optimized to have a small hit but AF wasn't.

The answer is that if AF takes a lower hit, then you're wasting resources. If Xenos could do 4xAF for free, for example, then it can do 64 bilinear samples per clock. Why not give it the ability to do 64 arbitrary samples instead of only 16 when AF is not needed?

There's no shortcut for AF other than "rip-mapping" which only works along the two axes of the texture and consumes 3x more texture memory. You simply need more texture units, that's all.

Vaan · Mar 14, 2006

I've seen this pic homemade pic on another forum...

j^aws · Mar 14, 2006

Mintmaster said:
Jaws, please don't throw numbers and calculations out there if you don't know what they mean. 512GB/s is way off for uncompressed textures.
...

It wasn't thrown around. It was derived. Also, If you read the thread, I've already derived it to be a factor of 8 out. Thanks to Arjan. It would be 8 Gb/sec compressed and 64 GB/sec uncompressed peak. Do people even read threads before commenting these days?

Frankly no one bothered to answer Titanios question apart from me. I knew what I was posting thank you. And I posted it clearly so you could see the working. The fact that I didn't know all available info (4 cycle AF2) at the time is neither here nor there. From the subsequent discussion, I learnt something. Which is what a discussion board is about.

So Mintmaster, please read the thread before making a pointless comment.

arjan de lumens · Mar 14, 2006

Mintmaster said:
16x anistropic filtering, whenever needed, takes 16 bilinear samples. Xenos can only do 16 bilinear samples per clock, period. Also, in normal bilinear filtering you always have a texel shared by at least 4 neighbours, so the 4 samples needed gives you a net of one texel per pixel peak. For any anisotropic filtering level over 2x, the sharing is now a minimum of two pixels (the non-slanted direction, e.g. left-right for a floor in an FPS).

I would presume that the sharing of texels between different bilinear-passes within an aniso-filtered pixel would be basically the same as the sharing of texels between pixels in the bilinear-texture, assuming that the bilinear-texture isn't greatly magnified..?

Thus it doesn't matter how much AF is applied as long as the current pixel needs more than 2xAF. The peak, worst case data requirement is 32 texels per clock. Average case will be lower for sure. Each mipmap level has 1/4 the texels of the previous, so I'd estimate you need a factor of ~0.5 because I don't think you'll ever sample from the same mip-map while simultaneously having worst case sharing. Even this case is only during minification where your texel density is high enough. AF can still kick in when your texel sharing is more than 2:1. Remember that these calcs assume you're purely texture fetch limited.

It is possible to get worse than 32 texels per clock with trilinear/trilinear-aniso. Start out with a perfect 1:1 pixel-to-texel mapping on a mipmapped trilinear-filtered polygon, scaled so that the texels belong to another mipmap than the largest one. This should give 100% accesses into one mipmap. So far, fine. Now, you can try to very slightly shrink the polygon (a few percent). What happens now is that you sample, for each pixel, the same mipmap as the 1:1-mapped one, and the next mipmap below it, which is 4 times as large. The result is that in 2 cycles in 1 TMU, you end up needing to fetch on average very close to 5 texels from memory (there will be practically zero texel sharing on the larger mipmap). This way, you can get arbitrarily close to a worst-case of 40 texels per clock (16 TMUs * 5 texels / 2 cycles per pixel). This calculation applies to Aniso too, just repeat the exercise with the polygon being squeezed together by a factor of 2,4,8 etc along one axis to provoke 2x, 4x, 8x etc anisotropic mapping.

This is AFAIK the worst case that can actually arise (barring LOD bias/clamp and texture cache capacity/granularity problems). It is possible, but not common in practice to have this case apply to large regions; in particular, if you in the example above slighty expand the polygon instead of slightly shrinking it, you will only need to fetch ~10 texels per clock to keep the texture mappers happy.

Mintmaster · Mar 14, 2006

arjan de lumens said:
I would presume that the sharing of texels between different bilinear-passes within an aniso-filtered pixel would be basically the same as the sharing of texels between pixels in the bilinear-texture, assuming that the bilinear-texture isn't greatly magnified..?

I don't think so. The min sharing you'll see for a bilinear texture is 4:1, which translates into 1 texel of data per pixel, and this means the texel is shared left-right as well as up-down (I'm assuming screen aligned texture axes for simplicity). For aniso, consider looking at a floor. The sharing is only left right. Along the line of sampling, only the end-point texels can be shared between the up-down pixel neighbours.

For your example of the worse case scenario, I guess it depends on the LOD algorithm. I would have expected the LOD to start transitioning before the 1:1 point, especially with AF enabled. Near the transition boundary I'd expect half the samples to be from the smaller mipmap, especially with the optimizations we've seen from IHV's, hence the ~0.5 factor in my calcs (which really should be 0.625 if my assumptions were correct).

Xmas · Mar 14, 2006

arjan de lumens said:
I would presume that the sharing of texels between different bilinear-passes within an aniso-filtered pixel would be basically the same as the sharing of texels between pixels in the bilinear-texture, assuming that the bilinear-texture isn't greatly magnified..?

Generally, yes, although you can do anisotropic filtering with less bilinear samples. In other words, it depends on whether you regard anisotropic filtering as multiple bilinear fetches or as area accumulation where you touch each texel in the area only once.

ERP · Mar 14, 2006

For the most part the bandwidth (main memory) cost of any filter is basically the cost of reading the used texels out of the appropriate mip maps exactly once. each time it's used, minus a bit if it's heavilly repeated, plus a bit because cache lines are bigger than the actual minimum requirement set.

That's what the cache does.

Most PC hardware is now optimised for Bilinear fetches, so the cost of a more advanced filter is directly proportional to the number of bilinear fetches required to accomplish the filter.

arjan de lumens · Mar 14, 2006

Mintmaster said:
I don't think so. The min sharing you'll see for a bilinear texture is 4:1, which translates into 1 texel of data per pixel, and this means the texel is shared left-right as well as up-down (I'm assuming screen aligned texture axes for simplicity). For aniso, consider looking at a floor. The sharing is only left right. Along the line of sampling, only the end-point texels can be shared between the up-down pixel neighbours.

If you have understood my argument and I have understood yours, you are arguing that there is ZERO sharing of texels WITHIN the aniso-processing of a pixel ..? In that case, the aniso-agorithm will necessarily need to be something substantially different from just repeated-bi/trilinear-sampling or else suffer substantial undersampling (if the bilinear-aniso fetches are spaced farther apart than 1 texel per fetch, then you get undersamping and thus texture aliasing.) Also, your 4:1 argument relies critically on the bilinear-texture never being undersampled.

For your example of the worse case scenario, I guess it depends on the LOD algorithm. I would have expected the LOD to start transitioning before the 1:1 point, especially with AF enabled. Near the transition boundary I'd expect half the samples to be from the smaller mipmap, especially with the optimizations we've seen from IHV's, hence the ~0.5 factor in my calcs (which really should be 0.625 if my assumptions were correct).

If the transitioning is not placed at the 1:1 point, then you are in violation of at least the OpenGL spec (causing you to fail even the simplistic conformance tests), and probably the Direct3d spec as well. You may be able to adjust the transitioning point when you are safely within an region of high anisotropy, though (although this will result in a texture that is more blurry than what the anisotropy degree would suggest). I would be surprised if the LOD calculation in Xenos is THAT different from the desktop.

Xmas said:
Generally, yes, although you can do anisotropic filtering with less bilinear samples. In other words, it depends on whether you regard anisotropic filtering as multiple bilinear fetches or as area accumulation where you touch each texel in the area only once.

If it is implemented as multiple-bilinear-fetches, then it is not very useful for this discussion to treat it as anything else than multiplie-bilinear-fetches; if it is area summation that doesn't just boil down to multiple-bilinear-fetches, the situation is different (in such a case, it should be possible to vary the degree of anisotropy continuously instead of steps like 2x,3x,4x etc).

Mintmaster · Mar 14, 2006

Jaws said:
It wasn't thrown around. It was derived. Also, If you read the thread, I've already derived it to be a factor of 8 out. Thanks to Arjan. It would be 8 Gb/sec compressed and 64 GB/sec uncompressed peak. Do people even read threads before commenting these days?

I did see your post, and it had flaws that showed you still didn't understand AF. The 8 GB/s number doesn't only hold for 2xAF, it holds for all AF. And arjan's comment was regarding your assumption of 16 texels, not universally for 2xAF (see below).

Frankly no one bothered to answer Titanios question apart from me. I knew what I was posting thank you. And I posted it clearly so you could see the working. The fact that I didn't know all available info (4 cycle AF2) at the time is neither here nor there. From the subsequent discussion, I learnt something. Which is what a discussion board is about.

So Mintmaster, please read the thread before making a pointless comment.

It was not pointless. Your post was very much authoritative in nature, without even an "IMO" or "I think" or "Sound right?", when you did not know how things worked.

"4 cycle AF2", as you call it, is nothing new. I said several times that Xenos is only capable of 16 bilinearly filtered samples per clock. A trilinear sample is two bilinear samples, and AF has N bilinear samples. This has been the case for every video card from ATI and NVidia since 2000.

That's six years, buddy. I have absolutely no problem with you not knowing that, but A) I explained it in a post with my comment about 16 samples for a very steep surface, and B) if you didn't know this fact or understand my comment then you're not qualified to make confident calculations like you did.

Even that bit of info from arjan was interpreted incorrectly by you. Most of the time 2xAF would only need 2 bilinear samples (see Dave's comment about trilinear), in which case it's 2 cycle. Even if you're near the mipmap transition, you only need 16 point samples (and thus 4 cycles) if piecing together trilinear samples, which is unlikely because it's doing redundant work. The lower resolution mipmap used in trilinear filtering doesn't need as many samples to cover the same area. In the end it should be viewed as the number of bilinear samples, and that's it, because you don't know anything else about the hardware.

I always assumed 16xAF means 16 bilinear samples total, as this makes sense from a hardware point of view. Either way, as I explained before, it doesn't matter how many samples there are per pixel, because Xenos can only do 16 of them per clock. I said this before your calculation post.

Now, I agree that I wasn't too clear about the texel sharing, but you picked this part up fine. (4 texels for bilinear * 0.5 for sharing) * 16 * 500MHz * 4-bits = 8GB/s. No need to complicate this with your weird starting numbers or "4 cycle AF2" or whatnot. The only thing that can change with AF level is the sharing factor, and when over 2xAF it gets near 0.5 in the worst case. For a simple bilinear filtering, it will be 0.25 worst case.

If you don't like my posts, I don't care, because others find them useful. Just don't go around misinforming people, especially without acknowledging your limited knowledge on the subject.

Mintmaster · Mar 15, 2006

arjan de lumens said:
If you have understood my argument and I have understood yours, you are arguing that there is ZERO sharing of texels WITHIN the aniso-processing of a pixel ..?

No, not zero. You have sharing between adjacent "lines" of anisotropy. Each texel gets used by two pixels instead of four. I'm having a hard time explaining this without a diagram...

(if the bilinear-aniso fetches are spaced farther apart than 1 texel per fetch, then you get undersamping and thus texture aliasing.)

Don't you mean 2 texels apart? You still get all the information if your sample locations are chosen well. It's just like averaging a 16 pixel area with 4 bilinear samples. Undersampling isn't a necessity.

If the transitioning is not placed at the 1:1 point, then you are in violation of at least the OpenGL spec (causing you to fail even the simplistic conformance tests), and probably the Direct3d spec as well. You may be able to adjust the transitioning point when you are safely within an region of high anisotropy, though (although this will result in a texture that is more blurry than what the anisotropy degree would suggest).

That's basically what I meant. I'm still surprised the spec is written that way. Halfway to the next mipmap, you'd have severe undersampling of the high res texture and it would still contribute to 50% of the final colour. Anyway, our debate is about what happens for angled surfaces, and the blurriness would be fairly minimal as long as the high res mipmap contributes significantly to the final pixel.

If it is implemented as multiple-bilinear-fetches, then it is not very useful for this discussion to treat it as anything else than multiplie-bilinear-fetches; if it is area summation that doesn't just boil down to multiple-bilinear-fetches, the situation is different (in such a case, it should be possible to vary the degree of anisotropy continuously instead of steps like 2x,3x,4x etc).

I don't see why it would be implemented any other way. You can still get continuous weighting of the samples by varying the location of the bilinear fetch correctly.

j^aws · Mar 15, 2006

Mintmaster said:
I did see your post, and it had flaws that showed you still didn't understand AF. The 8 GB/s number doesn't only hold for 2xAF, it holds for all AF. And arjan's comment was regarding your assumption of 16 texels, not universally for 2xAF (see below).

That's because I didn't look at the general case or even attempted to, but only 2AF. You're ASSUMING I looked at the general case. I was merely answering Titanios question. Not writing a thesis.

It was not pointless. Your post was very much authoritative in nature, without even an "IMO" or "I think" or "Sound right?", when you did not know how things worked.

Well, i didn't use "IMO" because I thought it was correct at the time. So what's your point again? I was wrong? I think I know that already by the time you posted your reply. So your comment is not only pointless but redundant.

"4 cycle AF2", as you call it, is nothing new. I said several times that Xenos is only capable of 16 bilinearly filtered samples per clock. A trilinear sample is two bilinear samples, and AF has N bilinear samples. This has been the case for every video card from ATI and NVidia since 2000.

Redundant. Already went back in the thread and reposted your post twice and derived the numbers.

That's six years, buddy. I have absolutely no problem with you not knowing that, but A) I explained it in a post with my comment about 16 samples for a very steep surface, and B) if you didn't know this fact or understand my comment then you're not qualified to make confident calculations like you did.

Redundant. Already went back in the thread and reposted your post twice and derived the numbers.

Even that bit of info from arjan was interpreted incorrectly by you. Most of the time 2xAF would only need 2 bilinear samples (see Dave's comment about trilinear), in which case it's 2 cycle. Even if you're near the mipmap transition, you only need 16 point samples (and thus 4 cycles) if piecing together trilinear samples, which is unlikely because it's doing redundant work. The lower resolution mipmap used in trilinear filtering doesn't need as many samples to cover the same area. In the end it should be viewed as the number of bilinear samples, and that's it, because you don't know anything else about the hardware.

No, it wasn't misunderstood in the context of my calculation.

I always assumed 16xAF means 16 bilinear samples total, as this makes sense from a hardware point of view. Either way, as I explained before, it doesn't matter how many samples there are per pixel, because Xenos can only do 16 of them per clock. I said this before your calculation post.

Redundant. I went back and reposted your comments.

Now, I agree that I wasn't too clear about the texel sharing, but you picked this part up fine. (4 texels for bilinear * 0.5 for sharing) * 16 * 500MHz * 4-bits = 8GB/s. No need to complicate this with your weird starting numbers or "4 cycle AF2" or whatnot. The only thing that can change with AF level is the sharing factor, and when over 2xAF it gets near 0.5 in the worst case. For a simple bilinear filtering, it will be 0.25 worst case.

What you see as weird was a simple adjustment for me from the context of my initial derivation.

If you don't like my posts, I don't care, because others find them useful. Just don't go around misinforming people, especially without acknowledging your limited knowledge on the subject.

Never said I didn't like your posts. I find most of them informative. And I don't go around mis-informing poeple. I source things or derive things that are clear for correction. And I'm not a graphics guru but I do hold a Masters in Engineering, so I know a thing a two about maths.

However, your comment was pointless. It added nothing to the thread that wasn't discussed already. If you wanted to make some points, you could've by simply picking comments and discussing them.

arjan de lumens · Mar 15, 2006

Mintmaster said:
No, not zero. You have sharing between adjacent "lines" of anisotropy. Each texel gets used by two pixels instead of four. I'm having a hard time explaining this without a diagram...

My point is still that if you count up all bilinear-samples taken (which may be multiple samples per pixel for high-aniso regions), then you should still get the same level of sharing as with an isotropically-mapped bilinear-texture. It would seem to me that this would apply even if each texel ends up contributing to fewer actual pixels.

If I take a textured polygon that is 1:1-mapped, and then reduce iths height by a factor of 16, should I then take 8, 9 or 16 bilinear-samples per pixel to perform 'proper' anisotropic mapping?

Don't you mean 2 texels apart? You still get all the information if your sample locations are chosen well. It's just like averaging a 16 pixel area with 4 bilinear samples. Undersampling isn't a necessity.

Averaging a 16-pixel area with 4 bilinear samples isn't impossible, but puts severe constraints on how you can place your bilinear-sample-positions - the sample positions must be exactly in the middle between 4 pixel centers - if you move your sample positions down by 1/2 texel, you have suddenly lost 8 of your pixels. Is it really possible to constrain sample-positions in that manner for Aniso without introducing popping artifacts :?:

That's basically what I meant. I'm still surprised the spec is written that way. Halfway to the next mipmap, you'd have severe undersampling of the high res texture and it would still contribute to 50% of the final colour. Anyway, our debate is about what happens for angled surfaces, and the blurriness would be fairly minimal as long as the high res mipmap contributes significantly to the final pixel.

Fair enough.

I don't see why it would be implemented any other way. You can still get continuous weighting of the samples by varying the location of the bilinear fetch correctly.

True, but such an algorithm is rather nontrivial.

Is AF a bottleneck for Xenos?

j^aws

j^aws

Dave Baumann

Gamerscore Wh...

j^aws

j^aws

Dave Baumann

Gamerscore Wh...

j^aws

Mintmaster

Mintmaster

Vaan

j^aws

arjan de lumens

Mintmaster

Xmas

Porous

ERP

arjan de lumens

Mintmaster

Mintmaster

j^aws

arjan de lumens

Similar threads