Is AF a bottleneck for Xenos?

^^ too many misinterpretations for me to correct, so I'll get to the point.

Mintamster said:
I fully understand your simple model

NO YOU DON'T.

Mintmaster said:
This discussion is about bandwidth for 2xAF.

It turned into YOU TRYING to prove that my 8 GB/sec was a FLUKE, IT WASN"T.

What you posted earlier,

Mintmaster said:
...I did see your post, and it had flaws that showed you still didn't understand AF. The 8 GB/s number doesn't only hold for 2xAF, it holds for all AF...

Plug these numbers into my model,


AF2 = 16 samples/ TMU, cost ~ 4 cycles

AF4 = 32 samples/ TMU, cost ~ 8 cycles

AF8 = 64 samples/ TMU, cost ~ 16 cycles


8 GB/sec holds true for all AF.

Yeah, ANOTHER FLUKE! RIIIIGHT!



Taking my earlier equation for pathological b/w without texture cache,

~ 16 TMUs x (0.5/4) Ghz x 16 samples per texel x 4 bytes per sample

...and looking at it generally, i.e.

~No. of TMUs x (clockrate of GPU /cost of sampling in cycles) x max no. of samples needed for filtering x sample memory size

...and rearranging,

[EQ1]

~clockrate x TMUs x sample memory size x (max no. samples/ sampling cost)

...and for the following inputs,


AF2 = 16 samples/ TMU, cost ~ 4 cycles

AF4 = 32 samples/ TMU, cost ~ 8 cycles

AF8 = 64 samples/ TMU, cost ~ 16 cycles

...we can see that the ratio (max no. sample/ sampling cost) ~ 4 for all these cases. So by replacing this ratio into EQ1, we get,

[EQ2]

~clockrate x TMUs x sample memory size x 4

Looks familiar doesn't it?

Yeah,

arjan de lumens said:
The problem is, of course, WHICH equations to use, and what data to plug into them. While it is quite easy to compute the bandwidth needed from the texture-cache to the TMUs for optimal performance (texel-size * 4 * number-of-TMUs), that number is not necessarily very closely connected to the bandwidth needed from external memory to the texture-cache.

Yeah, another FLUKE! RIIIGHT!

Mintmaster, you can believe what you want.

The 8 GB/sec was NOT a FLUKE.

I'm repeating myself, so we can agree to disagree.
 
I'm at a loss whats going on here.

Jaws, you realise that your texture usage scenarios that you bring up don't bear any relation to real world use? Nor are they they the case for large portions of the screen because only bilinear samples are being taken even when Trilinear is enabled (i.e. Trilinear 2x AF will actually only be a maximum of 8 samples where only the base texture is used, so the cost for much of the screen is 2 cycles)?
 
Dave Baumann said:
I'm at a loss whats going on here.

Mintmaster is saying that my 8 GB/sec derivation, of which Mintmaster claimed originally, was a fluke. It wasn't. And by proxy, saying I shouldn't post numbers that I don't understand. I disagree strongly.

Dave Baumann said:
Jaws, you realise that your texture usage scenarios that you bring up don't bear any relation to real world use?

I didn't bring them up. Mintmaster brought 8 GB/sec as a PEAK into this thread to make the point that b/w wasn't an issue for AF. And I'm not disagreeing with your above comment. The number just represents a limiting, boundary condition.

Dave Baumann said:
Nor are they they the case for large portions of the screen because only bilinear samples are being taken even when Trilinear is enabled (i.e. Trilinear 2x AF will actually only be a maximum of 8 samples where only the base texture is used, so the cost for much of the screen is 2 cycles)?

Exactly why I posted,

" Trilinear = 8 samples/ TMU, cost ~ 2 cycles"

Thank you for proving my point.
 
Thats not what I said:

Trilinear (without AF) = 4 samples on base texture (1 cycle), then 8 samples when mip-maps are used (2 cycles).

2x Trinilinear AF = 8 samples on base texture, then 16 samples when mip-maps are used (4 cycles)
 
Dave Baumann said:
Thats not what I said:

Trilinear (without AF) = 4 samples on base texture (1 cycle), then 8 samples when mip-maps are used (2 cycles).

2x Trinilinear AF = 8 samples on base texture, then 16 samples when mip-maps are used (4 cycles)

You still plug in the appropriate numbers, i.e. 4 samples, cost 1 cycle; 8 samples cost 2 cycles; 16 samples cost 4 cycles etc. etc.
 
Xmas said:
He means that the normal, full resolution, per-sample depth buffer is located in EDRAM along with the color backbuffer. The per-sample depth compare takes place in logic inside the daughter die that contains the EDRAM.
But that is unrelated to the coarse-grained hierarchical Z that is done on-chip to reject fragments before the shading takes place.

As you can see in the diagram, the only data that is passed from the Alpha/Z test (AZ) through the Backend Central (BC) to the Hierarchical Z/Stencil (HZ) are Z/Stencil test results. These only help the hierarchical Z to be more effective and are not required because hierarchical Z is a conservative culling scheme.

I respectfuly disagree. This duality of z logic is very noneconomical. Based on ATI patent, z/s test is performed before any pixel shading takes place. I think only z/s values go on smart logic, which calulates overdrawn and then sends data to hierahical z (on main die) which "kill tile" or passes it. From design scheme it looks like there is additional BW pipeline for z/s test data transfer (from daughter to parent die).

ATI patent:
In determining whether to render the plurality of pixels within the tile, two different tests are performed, a stencil test and a hierarchical Z value test, otherwise known as a depth test. If the stencil test fails or the hierarchical Z value test fails, a determination is made to not render the pixels, otherwise referred to a killing the tile, as it is determined that the pixels are not visible in the graphical output.
If the stencil test passes and the hierarchical Z test passes, the pixels within the tile are rendered, as it is determined that there is a likelihood the pixels within the tile will be visible.
 
Lysander said:
I respectfuly disagree. This duality of z logic is very noneconomical. Based on ATI patent, z/s test is performed before any pixel shading takes place. I think only z/s values go on smart logic, which calulates overdrawn and then sends data to hierahical z (on main die) which "kill tile" or passes it. From design scheme it looks like there is additional BW pipeline for z/s test data transfer (from daughter to parent die).

ATI patent:
That patent clearly refers to hierarchical Z. Yes, that hierarchical, per-tile Z test is done before pixel shading. Entirely on-chip. Which is what I wrote. But that is not enough. You need a per-sample check as well. And that is performed inside the daughter die.
 
Jaws, I'm going to explain briefly one last time, and if you don't understand, then I give up.

Here is your "correction" post.
Jaws said:
From my earlier 1:8 compression number of 64 GB/sec, and taking 4 cycles for AF2 per TMU,

~ 64/4 cycles
~ 16 GB/sec

Since 2 of 4 texels, being shared halves the samples needed,

~ 8 GB/sec
QED

BTW, thanks Dave for the info.
The statement "Since 2 of 4 texels, being shared halves the samples needed" DOES NOT APPLY FOR 4-CYCLE AF2. Have I hammered this point into your head yet? The sharing is more, so the factor is less than half. If only one mipmap is needed for AF2 - and hence only took 2 cycles and 8 texels all at peak density - then it does apply.

Jaws said:
AF2 = 16 samples/ TMU, cost ~ 4 cycles

AF4 = 32 samples/ TMU, cost ~ 8 cycles

AF8 = 64 samples/ TMU, cost ~ 16 cycles


8 GB/sec holds true for all AF.

Yeah, ANOTHER FLUKE! RIIIIGHT!
If AF2 takes 4 cycles, worst case is ~4GB/s, not 8GB/s.
If AF4 takes 8 cycles, worst case is ~4GB/s, not 8GB/s.
If AF8 takes 16 cycles, worst case is ~4GB/s, not 8GB/s.

Your model assumes a sharing factor of 0.5 for all these cases, which is wrong. Get it?

8GB/s applies when sampling is from one mipmap level, and cycles taken are halved from your numbers.

When I was talking about a peak of 8GB/s being applicable to all AF, I was referring to regions where one mipmap was used because that's when you get peak bandwidth usage per cycle.


EDIT: Okay, now I see you are also quoting arjan to back your calculations up:
Jaws said:
arjan de lumens said:
The problem is, of course, WHICH equations to use, and what data to plug into them. While it is quite easy to compute the bandwidth needed from the texture-cache to the TMUs for optimal performance (texel-size * 4 * number-of-TMUs), that number is not necessarily very closely connected to the bandwidth needed from external memory to the texture-cache.
Yeah, another FLUKE! RIIIGHT!
For god's sake, did you even read the rest of the same damn sentence you made bold? Here it is: "that number is not necessarily very closely connected to the bandwidth needed from external memory to the texture-cache." You made this connection using my claim that this halves the bandwidth without understanding it.

End of discussion.
 
Last edited by a moderator:
^^ I won't bother correcting your misinterpretations AGAIN. I'll get straight to the point,

Mintmaster said:
...
For god's sake, did you even read the rest of the same damn sentence you made bold? Here it is: "that number is not necessarily very closely connected to the bandwidth needed from external memory to the texture-cache." You made this connection using my claim that this halves the bandwidth without understanding it.

Jaws said:
...Taking my earlier equation for pathological b/w without texture cache,...

Thank you for proving that you have a READING COMPREHENSION PROBLEM in this thread.

Now for the third time, believe what you want and cling onto your fluke. This flukathon is over. We'll agree to disagree.
 
Back
Top