Why are AMD and NVIDIA still increasing TMU count?

Frontino · Jul 3, 2013

boxleitnerb · Jul 3, 2013

AF definitely isn't free. It can cost up to 20% fps (scroll down, AA scaling tests come first):
https://www.computerbase.de/artikel/grafikkarten/2012/test-nvidia-geforce-gtx-680/8/

Texture resolution has no impact on TMU performance afaik.

Frontino · Jul 3, 2013

boxleitnerb said:
AF definitely isn't free. It can cost up to 20% fps (scroll down, AA scaling tests come first):
https://www.computerbase.de/artikel/grafikkarten/2012/test-nvidia-geforce-gtx-680/8/

Texture resolution has no impact on TMU performance afaik.

Those percentages can be misread. We don't know the actual FPS. Beside, I tested anisotropic filtering on my very own IGP and there's never any impact at all from trilinear to 16x.

CarstenS · Jul 3, 2013

Frontino said:
Why are AMD and NVIDIA still increasing TMU count? Texture resolution and anisotropic filtering are performance free... [...] I can crank at max textures and anisotropic even with my Intel HD Graphics 1000. I only need to lower everything else.

Do they really?

GCN has a very healthy 64:4 or 16:1 ratio of TMUs to ALUs, same as Cayman it's predecessor. VLIW5 (Evergreen and previous) were at 20:1, granted. Nvidia is currently at 192:16 or 12:1, coming from 8:1 in Fermi (GF100/110). The increase in total numbers is due to the use of building blocks and the increase of the number of those blocks. You really need Quad-TMUs in order to maximize the quad-based pixel-pipeline from earlier ages. Nvidia apparently found that Octo-TMUs gave them better ratios on whatever they're measuring (hit-rate in caches for example).

Frontino · Jul 3, 2013

CarstenS said:
Do they really?

GCN has a very healthy 64:4 or 16:1 ratio of TMUs to ALUs, same as Cayman it's predecessor. VLIW5 (Evergreen and previous) were at 20:1, granted. Nvidia is currently at 192:16 or 12:1, coming from 8:1 in Fermi (GF100/110). The increase in total numbers is due to the use of building blocks and the increase of the number of those blocks. You really need Quad-TMUs in order to maximize the quad-based pixel-pipeline from earlier ages. Nvidia apparently found that Octo-TMUs gave them better ratios on whatever they're measuring (hit-rate in caches for example).

Are we sure that ALU:TMU ratio is that important? Instead of increasing the SIMD cluster count and thus multiplying the TMUs and wasting silicon space, they should just go with a fixed number of clusters and only increase the ALUs and the ROPs (because antialiasing is still not free).

boxleitnerb · Jul 3, 2013

Frontino said:
Those percentages can be misread. We don't know the actual FPS. Beside, I tested anisotropic filtering on my very own IGP and there's never any impact at all from trilinear to 16x.

No they cannot, actual fps is irrelevant here. AF costs performance as shown, period.

Frontino · Jul 3, 2013

boxleitnerb said:
No they cannot, actual fps is irrelevant here. AF costs performance as shown, period.

You talk like they lost 50% after going to 16x. Those percentages are ridiculous! They could even have invented 'em. And I already told you I don't lose a single frame by setting 16x with my IGP.

boxleitnerb · Jul 3, 2013

Frontino said:
You talk like they lost 50% after going to 16x. Those percentages are ridiculous! They could even have invented 'em. And I already told you I don't lose a single frame by setting 16x with my IGP.

You claimed it was "free" which it definitely isn't. 10-20% is not negligible.
What you told me is irrelevant since yours is not the only truth and I don't see any measurements to back your claim up. Aside from that, who cares about IGPs? If it's a current iGPU from AMD, it will likely behave similar to the discrete GPUs from the test since it's the same architecture.

rapso · Jul 3, 2013

Frontino said:
I can crank at max textures and anisotropic even with my Intel HD Graphics 1000. I only need to lower everything else.

if you lower everything, chances are good you're not bound by GPU, but on another place. if anisotropic was free, then it should be the other way around, you should notice a performance drop while everything is to low settings, but not so much of a drop if everything else is set to high, as by then, the bottleneck shall be in other places and anisotropic filtering would run in parallel, being hidden by other expensive parts -> virtually for free.

Gubbi · Jul 3, 2013

Frontino said:
You talk like they lost 50% after going to 16x. Those percentages are ridiculous! They could even have invented 'em. And I already told you I don't lose a single frame by setting 16x with my IGP.

Your IGP is likely completely limited by main memory bandwidth. The extra AF taps all hit the texture caches and are thus "free".

Cheers

Paran · Jul 3, 2013

Frontino said:
You talk like they lost 50% after going to 16x. Those percentages are ridiculous! They could even have invented 'em. And I already told you I don't lose a single frame by setting 16x with my IGP.

That's not true, sorry. It depends on the game.

fellix · Jul 3, 2013

Performance hit is quite depending on the title and the used settings. Some games apply texture filtering only on select surfaces with varying degree of anisotropy, others simply use a blanket approach for everything. Of course, there's always a driver override, if applicable.

Free 16xAF is simply not possible, since no existing GPU architecture dedicates so much sampling units (or L1t cache) per texture address. G80 had free trilinear (2xAF) rate, though.

AlNom · Jul 3, 2013

How significant a benefit would there be to increasing L2 cache size with the same # of shader/tex units?

i.e. if AMD attached 256kB to each MC (instead of the current 64/128kB config), or if nV attached 512kB to each MC (instead of 256kB). Would they be able to do it on the current node? (I presume not)

rapso · Jul 3, 2013

Gubbi said:
Your IGP is likely completely limited by main memory bandwidth. The extra AF taps all hit the texture caches and are thus "free".

for proper AF you need more detailed texture levels, in bandwidth limited cases it should become slower.

rapso · Jul 3, 2013

AlNets said:
How significant a benefit would there be to increasing L2 cache size with the same # of shader/tex units?

should be between 'barely noticeable' and 'nothing', for two reasons:
1. the texture reading is very coherent, if you hit a tile once, you will hit it a lot of times during the next texture lookups, but afterwards you'll most probably not touch it again, then switch the texture and doubling the cache would barely increase the hit rate.
2. the cache miss rate on TMU size is kinda of predicted, simply by the amount of texels you can sample. if you organize your texture e.g. in 128byte tiles, using dxt1 textures, you have 16x16 texel, you can sample about 256times from it (if you assume unique UV mapping) before you need a new tile. one bilinear sample costs you 1cycle, trilinear 2cycle (on most hardware), so you have in a usual case 512cycle where you use a tile. if you organize your HW in a smart way, it holds more than those 256sample requests in a queue, so you predict the next tile 512cycles ahead, assume you have an L2 miss every time, you'd still have 512cycles every time you want to read memory. (if you look into cuda documentation, it actually says you can assume 300cycles for cache misses).

that's the reason the L1 TMU caches are so tiny (and got only twice the size when FP16 textures were introduced) and not shared across the TMUs. it should be rather seen as a "streaming" cache/buffer than an acceleration cascade for main memory access as it is the case on CPU. On CPU you stall if you have a cache miss, on GPU (on usual work loads) there is no stall until you become memory bound anyway.

if you have OpenCL running, or some post effects like SSAO, it's another story, but in that case, increasing and sharing L1 of the TMUs would be far more beneficial.

AlNom · Jul 3, 2013

Thanks for the explanation.

So how did AMD arrive at the 64 or 128kB per MC :?:

Gipsel · Jul 3, 2013

AlNets said:
How significant a benefit would there be to increasing L2 cache size with the same # of shader/tex units?

It increases the hitrate of the L2 of course. It reduces the external bandwidth requirements for texturing. So it helps a bit, if external bandwidth is a limiting factor. If it's not (shader limited or already TMU/fetch limited), it's not going to change much.

AlNets said:
i.e. if AMD attached 256kB to each MC (instead of the current 64/128kB config), or if nV attached 512kB to each MC (instead of 256kB). Would they be able to do it on the current node? (I presume not)

NV only uses 256kB per tile (64Bit memory controller) for GK110 (maybe GK208 has the same), the smaller chips have less (128kB per 64bit controller, i.e. it is in the same league as AMD with 64kB per 32bit channel in Tahiti/Pitcairn and 128kB per 32bit channel for CapeVerde).
But of course they could do it at the current node. The L2 cache sizes in GPUs are still relatively small overall (up to 1.5MB in GK110). And also area wise they are almost tiny considering the total die size or other SRAM pools on die. For instance, Tahiti contains more than 12MB SRAM in total, so doubling or even quadrupling the L2 size from its 768kB doesn't appear to be extraordinary costly. The question is if it brings the same performance benefits for its average workload (games) as using the same area for something else. For a GPGPU/HPC oriented chip it's probably worth it (and probably the reason GK110 packs 3 times the amount of a GK104). For current games, I doubt it a bit. But the next generation of GPUs has to handle also future games. I guess they will definitely think about that.

AlNom · Jul 3, 2013

hmm... Would they be looking at the usage of volume textures & HDR texture formats?

silent_guy · Jul 3, 2013

rapso said:
for proper AF you need more detailed texture levels, in bandwidth limited cases it should become slower.

That doesn't sound right to me?
Without AF but with trilinear enabled, you interpolate between 2 MIP levels to get the texel value. With AF, don't you simply gather more pixels from the same MIP levels? Or do you hit more than 2 MIP levels for certain (but not all) triangle orientations?

As you say in your next reply: texture reading is very coherent. So the additional AF reads should still come from the texture cache.

Gipsel · Jul 3, 2013

silent_guy said:
That doesn't sound right to me?
Without AF but with trilinear enabled, you interpolate between 2 MIP levels to get the texel value. With AF, don't you simply gather more pixels from the same MIP levels? Or do you hit more than 2 MIP levels for certain (but not all) triangle orientations?

You need more samples from higher resolved MIP levels compared to trilinear filtering. That's why it gets sharper. Otherwise it would be just a directional blurring.

Why are AMD and NVIDIA still increasing TMU count?

Frontino

boxleitnerb

Frontino

CarstenS

Moderator

Frontino

boxleitnerb

Frontino

boxleitnerb

rapso

Gubbi

Paran

fellix

AlNom

Moderator

rapso

rapso

AlNom

Moderator

Gipsel

AlNom

Moderator

silent_guy

Gipsel

Similar threads