Is AF a bottleneck for Xenos?

Mintmaster · Mar 15, 2006

arjan de lumens said:
If I take a textured polygon that is 1:1-mapped, and then reduce iths height by a factor of 16, should I then take 8, 9 or 16 bilinear-samples per pixel to perform 'proper' anisotropic mapping?

8 AFAICS. The texel footprint is 1x16, and with 8 bilinear samples you can get contribution from each texel in the nearest 2x16 area. This is how I got my assumption of 2 "new" texels per sample during AF.

Is it really possible to constrain sample-positions in that manner for Aniso without introducing popping artifacts

You're right in that it would be non-trivial, but still not too difficult. Here's one theory:

Lets say your line of sampling is from (u1, v1) to (u2, v2), and |u2-u1| > |v2-v1|. The u coordinate of each of your samples would be forced to 1/2 texel locations, and u for each sample would differ from the previous by 2. The v-coordinate is such that you still lie on your line of sampling. The endpoint samples are not forced, however, and their weight depends on how far they are from the nearest forced sample.

With a few rounding rules and maybe some special handling near 45 degrees, I'm sure you could guarantee temporal continuity for a slowly varying sampling line. Samples from the lower mipmap could be optimized this way also since the footprint touches many fewer texels.

Of course, I don't know that ATI or NV do this, but given the great lengths they've gone to in order to increase trilinear filtering speed and how much they push the boundaries for acceptable image quality with AF, I'd be very surprised if this wasn't the case.

Another possibility is that the bilinear units are actually 2 semi-independent lerping units, each of which is fed data from an adjacent pair of texels, but the 2 pairs don't have to make a 2x2 block.

arjan de lumens · Mar 15, 2006

Mintmaster said:
8 AFAICS. The texel footprint is 1x16, and with 8 bilinear samples you can get contribution from each texel in the nearest 2x16 area. This is how I got my assumption of 2 "new" texels per sample during AF.

OK. This makes sense, as far as I can tell. I still think you can get higher than that, but that will still require a mechanism to grossly undersample a mipmap (perpendicular to the axis of anisotropy, that is).

Another possibility is that the bilinear units are actually 2 semi-independent lerping units, each of which is fed data from an adjacent pair of texels, but the 2 pairs don't have to make a 2x2 block.

For aniso in different directions, you want to be able to slice up the 2x2 block along both horizontal and vertical directions. AFAICT, this is actually not much harder for the texture cache than just delivering a 2x2 block for plain bilinear in the first place. And it also looks like it would be much easier to avoid glitches/popping by tracing the axis of anisotropy though 2x1-texel blocks than through 2x2-texel blocks; this would give performance characteristics similar to the tracing algorithm you have suggested while addressing the quailty/correctness problems I was concerned about.

Thank you, learned something new

Mintmaster · Mar 15, 2006

Jaws said:
That's because I didn't look at the general case or even attempted to, but only 2AF. You're ASSUMING I looked at the general case. I was merely answering Titanios question. Not writing a thesis.

You're not paying attention. If you were talking about the general case, then you can use that figure as an upper bound. But for 2xAF, you won't be sampling from 16 different texels. It'll often be 12 (or even less), with 4+4 from the nearest mipmap, and 4 from the one below. Moreover, the one below has much greater texel sharing by necessity. You don't know the number of cycles either, because not every pixel needs samples from both mipmaps, and also because you often don't need multiple samples from the lower mipmap. Both of these factors rely on the hardware implementation.

In fact, under the right circumstances that 2xAF does indeed take 4 cycles, you definately will not use 8GB/s. The additional texel sharing for the lower mipmap would bring you down to 4GB/s if not lower.

This is not an easy problem to analyze, and in fact, this type of calculation never is. My best guess is 4GB/s peak in practice, with an upper bound of 8GB/s in the absolute worst case of sampling from one mipmap when texel to pixel ratio peaks for every in-flight pixel.

Well, i didn't use "IMO" because I thought it was correct at the time. So what's your point again? I was wrong?

When someone asks you a question and you answer the way you did, they would not think there is much room for error. Unless someone is being duplicitous, anything they post is what they think is correct, so that goes without saying. You should write IMO when you feel there is reasonable uncertainty, say when making a prediction or calculation without sufficient experience. Especially when answering a question.

No, it wasn't misunderstood in the context of my calculation.

As I explained above, yes it was. Arjan told you sampling 16 texels would take 4 cycles per texture unit. He didn't tell you that all those samples are at peak texel to pixel density, which is what you assumed. Your "corrected" calculation where you declared "QED" acts as if those two things giving your missing factor of 8 were all that mattered. 8GB/s would not, in fact, ever apply to the scenario you considered.

However, your comment was pointless. It added nothing to the thread that wasn't discussed already.

You want me to spell out the purpose of that comment? Here you go: In the future, add a qualifier whenever making statements that go well beyond your knowledge, especially when answering someone's technical question. You did misinform people, e.g. Titanio, more so because of the way you replied rather than just what you said.

jpr27 · Mar 15, 2006

I know we all had their thoughts and expectations for the next generation consoles but wow. We are talking about first generation titles here. Xbox 360 (Soon PS3) as we all know have new hardware at their disposal. I believe most developers (for their first titles) have yet to dig in to each consoles hardware but rather using the strengths of each available to them for their games launch window. Although I must admit in some cases I have been underwhelmed by some launch titles, I firmly believe we are only getting a hint of what next the next generations consoles are truely capable of.

I think once developers have had time with each finalized hardware development kits and able to create "directly" from them, we will see each systems strengths and weaknesses. Right now I think it might be a little premature given the different architecture (CPU / GPU etc.) each posesses.

j^aws · Mar 15, 2006

Mintmaster said:
You're not paying attention.

And you're not paying attention. As already explained several times, it was a simple pathological number that was derived with no analysis of cache reuse.

As I explained above, yes it was. Arjan told you sampling 16 texels would take 4 cycles per texture unit. He didn't tell you that all those samples are at peak texel to pixel density, which is what you assumed. Your "corrected" calculation where you declared "QED" acts as if those two things giving your missing factor of 8 were all that mattered. 8GB/s would not, in fact, ever apply to the scenario you considered.

Again, see above. And instead of multiplying by "0.5 Ghz" in my derivation, it became 0.5/4 because of being 4 cycles. Simple calculation.

You want me to spell out the purpose of that comment? Here you go: In the future, add a qualifier whenever making statements that go well beyond your knowledge, especially when answering someone's technical question. You did misinform people, e.g. Titanio, more so because of the way you replied rather than just what you said.

I usually do make qualifiers. I've made plenty of derivations on this board thank you. This wasn't the first and it won't be the last. And frankly it didn't go beyond my knowledege as already explained. It's you who keeps insisting this. I don't need to work for ATI or NV to attempt a simple calc. And a lol, at mis-informing people as some sort of conspiracy! The method was fine, the number was wrong. Big deal. It's a discussion forum.

Mintmaster · Mar 15, 2006

Jaws, this'll be my last post on the subject.

If you were not analyzing cache reuse, your analysis did NOT answer Titanio's question of "where did the 8GB/s number come from?"; moreover, it is absolutely useless unless you inform us of the gaping hole in your calculation which pushed it more than an order of magnitude off. Then multiplying this number by 0.5/4 still does not fix your calcs because 0.5 does not apply to your situation. Figuring out the number that should be in place of 0.5 is not at all trivial for the 2xAF w/ 16 point samples example.

The calculation is not simple, and it does go beyond your knowledge. You need to know about cache, mipmapping, LOD selection, geometry, specifics about the hardware, etc. The method wasn't fine, nor was the so called correction, and any engineer should know that just because you fluke upon the right answer for the theoretical peak doesn't mean your scenario or method were correct. This is not about a conspiracy, it's about maintaining the integrity of the excellent Beyond3D forums. Many people come here to learn about 3D hardware, and many experts come here to teach and participate. It's not right for you to post as if you're the latter on a subject where you aren't.

j^aws · Mar 15, 2006

Mintmaster said:
If you were not analyzing cache reuse, your analysis did NOT answer Titanio's question of "where did the 8GB/s number come from?";

This was explained several times in the thread and you mention this now?

I'll explain the derivation in a moment so it's clear.

moreover, it is absolutely useless unless you inform us of the gaping hole in your calculation which pushed it more than an order of magnitude off. Then multiplying this number by 0.5/4 still does not fix your calcs because 0.5 does not apply to your situation.

Your misinterpretations explains your posts. I'm talking about 0.5/4 ~ 0.125 GHz ~ 125 MHz.

I'll explain in a moment.

Figuring out the number that should be in place of 0.5 is not at all trivial for the 2xAF w/ 16 point samples example.

You completely misunderstood the number I'm referring to, it's 0.5 GHz, i.e the clockspeed of Xenos.

The calculation is not simple, and it does go beyond your knowledge.

The pathological case is simple.

You need to know about cache, mipmapping, LOD selection, geometry, specifics about the hardware, etc. The method wasn't fine, nor was the so called correction, and any engineer should know that just because you fluke upon the right answer for the theoretical peak doesn't mean your scenario or method were correct.

Your assumption of a fluke is telling. It tells me that you did not understand my derivation. You're assuming it's a fluke.

This is not about a conspiracy, it's about maintaining the integrity of the excellent Beyond3D forums. Many people come here to learn about 3D hardware, and many experts come here to teach and participate. It's not right for you to post as if you're the latter on a subject where you aren't.

Which is why I'll make it clear with the derivation of your 8 GB/sec with MY method,

Assumptions,

Xenos has 16 TMUs

AF2 requires 16 samples per TMU

Each sample is 4 bytes for uncompressed textures (32bit)

Each TMU reads 4 samples over 4 cycles for the required 16 samples.

If xenos is clocked at 0.5 GHz, then 0.5 GHz/ 4 clocks ~ 0.125 GHz or 125 MHz. That would be the equivalent clockrate for each TMU to process these 16 samples per cycle.

For worst case b/w usage, with 100% TMU usage,

~ 16 TMUs x (0.5/4) Ghz x 16 samples per texel x 4 bytes per sample

~ 128 GB/sec for uncompressed textures

for 1:8 texture compression, i.e. for 4bit texture compression

~ 128/8

~ 16 GB/sec for compressed textures

Now looking at the texture cache, assuming 2 of 4 texels are shared, this halves the samples needed from memory. I used 16 samples per texel, bolded above. This 16 number can be replaced with, half, 8 samples per texel. The net effect would be half the calculated b/w numbers. So,

~ 16/2
~ 8 GB/sec for compressed textures.
QED

Mintmaster said:
...Whether AF is enabled or not, Xenos can only do 16 bilinear samples per clock. With >2xAF (actual) applied per pixel, only 2 of the 4 texels used in each sample won't be shared with neighbouring pixels. At 4-bits per compressed texel, that's 8GB/s peak...

I used a MY method to derive your 8 GB/sec. NO FLUKE.

THE END

ninzel · Mar 17, 2006

Most of this stuff goes over my head but what suprisises me that in these very technical discussions there is so much disagreement over somethng that should come down to math. 1+1=2. Simple so how is it that when you guy's are lookig at something that should come down to equations causes so much problem.
It's get's very confusing for a lay person to learn anything.

pipo · Mar 17, 2006

ninzel said:
Simple so how is it that when you guy's are lookig at something that should come down to equations causes so much problem.

It's get's very confusing for a lay person to learn anything.

It's very confusing for everybody.

I think the biggest problem is the fact that they don't know exactly how the machine works (NDA) and on top of that how devs are using the features in their games is a big question too...

Mintmaster · Mar 18, 2006

Jaws, the 0.5 I was talking about was the divide by two for the texel sharing, i.e. "assuming 2 of 4 texels are shared, this halves the samples needed from memory". That, along with your 4 cycles, is the modification you made to get from 64GB/s in your first post for compressed textures down to 8GB/s in your QED post.

Now, as simple as you think your case is, several of your assumptions are wrong.

-2xAF will NEVER require data from 16 different texels, and hence "16 samples per texel x 4 bytes per sample" is wrong.

-Hence it may not take 4 cycles for AF2.

-Sharing factor of 0.5 is only if all samples are from a single mipmap at peak texel density. "16 samples", as you put it, obviously includes samples from multiple mipmaps, or else it wouldn't be AF2.

The fluke is that (16 samples / 4 cycles * half for sharing), the portion of your calculation that includes your three incorrect assumptions for AF2, happens to give you the same result as (4 samples / clock * half for sharing), the number which applies only to bilinear AF (or trilinear optimized to bilinear where the former isn't necessary) at the peak texel to pixel ratio without LOD bias.

It is a fluke. 8GB/s does not apply to AF2 if it is trilinear and takes 4 cycles. The knowledge required for its calculation is not simple.

Mintmaster · Mar 18, 2006

pipo said:
It's very confusing for everybody.

I think the biggest problem is the fact that they don't know exactly how the machine works (NDA) and on top of that how devs are using the features in their games is a big question too...

You're pretty much bang on.

We don't know the specifics of the graphics chips, whether the cache works to its full potential, whether the devs or drivers will adjust LOD, etc. On top of these uncertainties, mipmapping and its consequences on bandwidth are not easy to explain.

arjan de lumens · Mar 18, 2006

ninzel said:
Most of this stuff goes over my head but what suprisises me that in these very technical discussions there is so much disagreement over somethng that should come down to math. 1+1=2. Simple so how is it that when you guy's are lookig at something that should come down to equations causes so much problem.
It's get's very confusing for a lay person to learn anything.

The problem is, of course, WHICH equations to use, and what data to plug into them. While it is quite easy to compute the bandwidth needed from the texture-cache to the TMUs for optimal performance (texel-size * 4 * number-of-TMUs), that number is not necessarily very closely connected to the bandwidth needed from external memory to the texture-cache.

Let's call the ratio between memory-to-cache-bandwidth and cache-to-TMU-bandwidth 'R'. The texture bandwidth (assuming 100% TMU utilization and bilinear TMUs) then obviously becomes (R*texel-size*4*number-of-TMUs). The question then becomes: What is a realistic value of 'R'? THAT is the difficult part - it doesn't take a lot of consideration to see that the value of R can vary greatly; on one extreme, the infamous 3dmark single-texturing test can probably achieve R=0.0001 or something like that (the texture is very small and thus only rarely needing to be fetched into cache), on the other extreme, a severely undersampled texture (like, say, bilinear with severe minification and no mipmaps or large negative LOD bias) can achieve values of R much greater than 1 (usually due to the fact that one line in the texture cache is larger than a single texel; if your cache line contains 16 texels, then it is indeed possible to achieve R=16).

More to the point, we want to know the value of R with trilinear/mipmapping/aniso under texture minification. This helps us a bit, but not really that much. For e.g. mipmapping/trilinear filtering, the value of R near a mipmap transition will ne nearly 4 times as large on one side as on the other side. Plain trilinear will also choose smaller mipmaps if you slant/stretch the polygon strongly, resulting in blurring and reduction of the value of R as well. I deduced a maximum value of R around 0.6 for trilinear above, however that number is loaded with a whole bunch of assumptions, which may or may not be appropriate for any given case (mipmap selection algorithm, LOD bias, perfect texture cache, no granularity problems with polygon edges etc).

And aniso on top all of this? Depends on algorithm. The well-known 'Feline' algorithm should have more or less the same bandwidth characteristics as trilinear usually has on non-slanted polygons, however, the Aniso algorithms of both Nvidia and ATI don't seem to be exactly 'Feline'; non-Feline algorithms (such as the one suggested here by Mintmaster) are likely to behave very differently in this regard.

There are too many unknows here for a number to be anything more than a stab in the dark; if you want the cold, hard truth, then you will need to use the performance counters of Xenos itself (assuming it has any and that they are accessible to you).

Jawed · Mar 18, 2006

I suppose it's about time I posted this:

Jawed

j^aws · Mar 18, 2006

Mintmaster said:
Jaws, the 0.5 I was talking about was the divide by two for the texel sharing, i.e. "assuming 2 of 4 texels are shared, this halves the samples needed from memory". That, along with your 4 cycles, is the modification you made to get from 64GB/s in your first post for compressed textures down to 8GB/s in your QED post.

I know you were and you we're misinterpreting what I posted immediately before. I wasn't talking about divide by "two" when I mentioned "0.5 GHz" in my previous post. It's the clock speed of Xenos used to calculate peak texel fillrate. And the 4 refers to 4 cycle cost for the TMU for AF2. The "0.5/4" effectively makes it a quarter of the peak texel fillrate.

Now, as simple as you think your case is, several of your assumptions are wrong.

-2xAF will NEVER require data from 16 different texels, and hence "16 samples per texel x 4 bytes per sample" is wrong.

Agreed it will NEVER require 16 *different* texels, hence the number is a pathological number *without* looking at texture cache reuse at the beginning of the derivation. And it is not wrong in the context of my derivation because at the end of my derivation, the "16" became "8" when texels are shared, as I mentioned.

-Hence it may not take 4 cycles for AF2.

Yep, Dave has already explained these costs here...

-Sharing factor of 0.5 is only if all samples are from a single mipmap at peak texel density. "16 samples", as you put it, obviously includes samples from multiple mipmaps, or else it wouldn't be AF2.

Agreed, see above.

The fluke is that (16 samples / 4 cycles * half for sharing),

Wrong. The scaling of a factor of 8 is *inherent* in my derivation but it doesn't come from the above. It comes from,

~ (0.5 GHz/4 cycles * half for sharing)

~ 0.5 GHz x (1/4 cycles * 1/2 for sharing)

~ 0.5 Ghz x (1/8)

. You are just calling it a fluke to *try* and justify your claim that it was.

the portion of your calculation that includes your three incorrect assumptions for AF2, happens to give you the same result as (4 samples / clock * half for sharing), the number which applies only to bilinear AF (or trilinear optimized to bilinear where the former isn't necessary) at the peak texel to pixel ratio without LOD bias.

See above. Your basis for a fluke is wrong.

It is a fluke. 8GB/s does not apply to AF2 if it is trilinear and takes 4 cycles. The knowledge required for its calculation is not simple.

See above. Your basis for a fluke is wrong.

A detailed analysis is not simple and requires detailed analysis of cache reuse, detailed scene analysis etc. as Dave explained earlier. However, the point was to derive your 8 GB/sec number and I did without a fluke.

You can keep clinging to this if it makes you happy. This isn't rocket science and yes, my Masters covered rocket science...

j^aws · Mar 18, 2006

Mintmaster said:
...It is a fluke. 8GB/s does not apply to AF2 if it is trilinear and takes 4 cycles. The knowledge required for its calculation is not simple.

I want to expand on this point because, it's quite clear to me that you can't adapt to my *simple* model,

Bilnear = 4 samples/ TMU, cost ~ 1 cycle

Trilinear = 8 samples/ TMU, cost ~ 2 cycles

AF2 = 16 samples/ TMU, cost ~ 4 cycles

AF4 = 32 samples/ TMU, cost ~ 8 cycles

AF8 = 64 samples/ TMU, cost ~ 16 cycles

etc, etc...

Lysander · Mar 18, 2006

Jawed said:
I suppose it's about time I posted this:

Jawed

So, it does two pass: first thru vertex shaders, then calculates z/s test and uses edram, then it goes back to calculate hierarhical and then tile goes thru pixel shaders and then again on smart logic for aa/hdr. But where tile clipping happens?

Xmas · Mar 18, 2006

Lysander said:
So, it does two pass: first thru vertex shaders, then calculates z/s test and uses edram, then it goes back to calculate hierarhical and then tile goes thru pixel shaders and then again on smart logic for aa/hdr. But where tile clipping happens?

The order should usually be
VGT, SQ, SPI, ALU, SX, PA, SC (HZ), SPI, ALU, SX, BC, AZ, EDRAM.
EDRAM is only used on output. Clipping happens somewhere in between Primitive Assembler and Scan Converter.

Lysander · Mar 18, 2006

Xmas said:
The order should usually be
VGT, SQ, SPI, ALU, SX, PA, SC (HZ), SPI, ALU, SX, BC, AZ, EDRAM.
EDRAM is only used on output. Clipping happens somewhere in between Primitive Assembler and Scan Converter.

Feldstein:

We have a great deal of internal memory in the daughter die referred to above. We actually use this memory as our back buffer. In addition, all anti-aliasing resolves, Z-Buffering and Alpha Blending occur within this internal memory.

What he means with "z-buffering" in edram?

Xmas · Mar 19, 2006

Lysander said:
What he means with "z-buffering" in edram?

He means that the normal, full resolution, per-sample depth buffer is located in EDRAM along with the color backbuffer. The per-sample depth compare takes place in logic inside the daughter die that contains the EDRAM.
But that is unrelated to the coarse-grained hierarchical Z that is done on-chip to reject fragments before the shading takes place.

As you can see in the diagram, the only data that is passed from the Alpha/Z test (AZ) through the Backend Central (BC) to the Hierarchical Z/Stencil (HZ) are Z/Stencil test results. These only help the hierarchical Z to be more effective and are not required because hierarchical Z is a conservative culling scheme.

Mintmaster · Mar 21, 2006

Jaws, you just aren't paying attention at all to my posts.

First of all, you were the one who brought up 0.5 GHz, not me. I was never talking about it. My reference to 0.5/4 referred to this post, where you divided by four and halved to get from 64GB/s to 8GB/s.

You have a lot of difficulty following my arguments. For example:

Jaws said:
Mintmaster said:

The fluke is that (16 samples / 4 cycles * half for sharing),

Click to expand...

Wrong. The scaling of a factor of 8 is *inherent* in my derivation but it doesn't come from the above. It comes from,

~ (0.5 GHz/4 cycles * half for sharing)

~ 0.5 GHz x (1/4 cycles * 1/2 for sharing)

~ 0.5 Ghz x (1/8)

Why did you clip my post there? You completely missed the point. Forget about chip clock speed until the very end. Multiplication is commutative, and 0.5 GHz is the easy part. Bytes per pipe per clock is sole point of debate here. This was the point: Your method assumes (16 texels / 4 cycles * half for sharing) for AF2. This is incorrect, but the final number matches a different scenario. First of all, AF2 does not always need 4 cycles (you even found Dave's post yourself); more importantly, if it does need 4 cycles, "half for sharing" is incorrect. Now, I said you'll never need 16 texels, but then you say this:

Agreed it will NEVER require 16 *different* texels, hence the number is a pathological number *without* looking at texture cache reuse at the beginning of the derivation.

You simply don't understand. Even if you don't share texel data between pixels at all, you would never need 16 different texels worth of data to fill a single pixel using AF2. This argument of mine is orthogonal to the sharing.

Mintmaster said:
...It is a fluke. 8GB/s does not apply to AF2 if it is trilinear and takes 4 cycles. The knowledge required for its calculation is not simple.

Jaws said:

I want to expand on this point because, it's quite clear to me that you can't adapt to my *simple* model,

Click to expand...

It is quite clear to me that you have very little understanding about texture filtering. Why did you highlight trilinear? I was clearly using "trilinear" and "takes 4 cycles" to describe the AF2. I wasn't talking about plain trilinear filtering. Anisotropic filtering can be done from only one mipmap level, in which case it's the bilinear variety, or from two mipmap levels, in which case it's the trilinear variety. 4 cycle AF2, as arjan described, could only possibly occur for the latter.

Only when all the samples are from a single mipmap near its transition edge will you get peak bandwidth usage. For your trilinear AF2 that requires 4 cycles, not only is the "without cache" data usage below 16 texels, but 2 of the 4 cycles use samples from a mipmap with 1/4 the texel : pixel ratio. Hence the inter-pixel sharing factor is even lower than half. Instead of (16 texels * half) for the worst case scenario I described, for your AF2 example it'll be (~12 texels * one third) in the worst case.

I fully understand your simple model. This discussion is about bandwidth for 2xAF. Such a calculation hinges entirely upon the ratio of texels to pixels at the active mipmap level, and this is where the sharing factor comes from. My factor of 0.5 applies to the peak you can get with bilinear AF. You applied it to 4-cycle trilinear AF2, which requires only marginally more data per pixel than bilinear AF2, yet takes twice as many clock cycles per pixel.

Please, do not cut my sentences when quoting. If you don't understand something, ask. All your replies to my posts have completely misunderstood what I was telling you. Contrary to your claims, I have not misunderstood anything about your calculations or assumptions.

Is AF a bottleneck for Xenos?

Mintmaster

arjan de lumens

Mintmaster

jpr27

j^aws

Mintmaster

j^aws

ninzel

pipo

Mintmaster

Mintmaster

arjan de lumens

Jawed

j^aws

j^aws

Lysander

Xmas

Porous

Lysander

Xmas

Porous

Mintmaster

Similar threads