Why are AMD and NVIDIA still increasing TMU count?

silent_guy · Jul 3, 2013

Gipsel said:
You need more samples from higher resolved MIP levels compared to trilinear filtering. That's why it gets sharper.

Ah, that makes sense. But you usually still only need data from 2 MIP levels?

Gipsel · Jul 3, 2013

silent_guy said:
Ah, that makes sense. But you usually still only need data from 2 MIP levels?

Yes, the individual taps for AF are basically trilinear samples (before DX10 one could also select bilinear AF, but that got dumped by MS).

rapso · Jul 3, 2013

silent_guy said:
Ah, that makes sense. But you usually still only need data from 2 MIP levels?

but from higher res mips, one level higher -> 4x bandwidth.
with 16x AF you'd need 2 levels higher ->16x bandwidth.

rapso · Jul 3, 2013

Gipsel said:
It increases the hitrate of the L2 of course. ...

might help on the Framebuffer, I think intels EDRam is targeting mainly framebuffers.

if you sample a texture, you read 4bit/pixel, if you write (zbuffer+color), it's 64bit/pixel. if that fits into the L2, it could be IMO the most benefiting part of the pipeline.

Gipsel · Jul 3, 2013

rapso said:
might help on the Framebuffer, I think intels EDRam is targeting mainly framebuffers.

if you sample a texture, you read 4bit/pixel, if you write (zbuffer+color), it's 64bit/pixel. if that fits into the L2, it could be IMO the most benefiting part of the pipeline.

Intel targets everything. The L4 caches also CPU traffic.

The thing is, at least AMD chips don't use the L2 cache to cache the framebuffer (at least not for ROP access, just when you read it back through the TMUs), it would simply pollute it too much. The ROPs have specialized caches which are small (Pitcairn/Tahiti have 128kB color cache and 32kB Z cache) but offer way higher bandwidth internally (enough for 4xMSAA and blending with a 64bit color format). Framebuffer de-/compression is done on loading or storing render target tiles to memory (bypassing the L2 in the process). It also serves the purpose of coalescing reads and combine writes to enable a high transfer efficiency for the memory (fewer read/write turnarounds, series of bursts for accessing the tiles).
As you say, one would need to fit the render target into the cache. That would need quite a bit of storage (high end GPUs have to plan for at least 4 Mpixels with a HDR format => 32MB without AA). And usually, the memory bandwidth isn't so much behind on what the ROPs can actually do (memory bandwidth is enough for 32bit color formats and even 64bits without blending to achieve the theoretical fillrate on Tahiti for instance). So it is obviously a better call to provide a slight increase of the L2 cache to reduce the bandwidth requirements for texturing so more is available for the ROPs (bypassing L2). That reasoning changes only on platforms with severly limited memory bandwidth (IGPs/APUs) and if you can provide large amounts of storage (like intel with its 128MB L4). It's not an option so far for AMD or nV until HBM (wide interface stacked memory on interposer or directly on chip) or similar approaches become economical.

rapso · Jul 3, 2013

Gipsel said:
Intel targets everything. The L4 caches also CPU traffic.

I have some doubts it give a big gain on the CPU side, there might be some cases, but doubling the cache size (or associativity) will not even give you 10% speed up (unless you have a very specialized software with work loads fitting exactly the doubled cache size).

As you say, one would need to fit the render target into the cache. That would need quite a bit of storage (high end GPUs have to plan for at least 4 Mpixels with a HDR format => 32MB without AA).

exactly why intel went for such an 'insane' size and why I believe it targets framebuffer mostly.

And usually, the memory bandwidth isn't so much behind on what the ROPs can actually do (memory bandwidth is enough for 32bit color formats and even 64bits without blending to achieve the theoretical fillrate on Tahiti for instance).

but it's not that the bandwidth was adjust, but the ROP count. there is just no point in having more ROPs than your memory can serve. with about 20GB/s (most desktop nowadays, PS3->RSX), you saturate at 8 rops@~500-600MHz. (sure, rops vary, but roughly counted its ROP-count * clock * Byte/pixel e.g. 8*600*4 -> 16GB/s). doubling the ROP count without doubling the bandwidth would barely change anything -> EDRam. or the other way around, while all other integrated GPUs peak with 8 ROPs, I'd bet the haswell GPU has at least 16ROPs peak.

The thing is, at least AMD chips don't use the L2 cache to cache the framebuffer (at least not for ROP access, just when you read it back through the TMUs), it would simply pollute it too much. The ROPs have specialized caches which are small (Pitcairn/Tahiti have 128kB color cache and 32kB Z cache) but offer way higher bandwidth internally (enough for 4xMSAA and blending with a 64bit color format). Framebuffer de-/compression is done on loading or storing render target tiles to memory (bypassing the L2 in the process). It also serves the purpose of coalescing reads and combine writes to enable a high transfer efficiency for the memory (fewer read/write turnarounds, series of bursts for accessing the tiles).

Sure, I wasn't restricting my reply to just the Vertex/index/texture/constant cache, I was talking about the L2 cache in general. if you double it, then for framebuffer purpose.
sorry if I have not made this explicitly clear or if "L2" is not fitting some official naming convention.

So it is obviously a better call to provide a slight increase of the L2 cache to reduce the bandwidth requirements for texturing so more is available for the ROPs (bypassing L2). That reasoning changes only on platforms with severly limited memory bandwidth (IGPs/APUs) and if you can provide large amounts of storage (like intel with its 128MB L4). It's not an option so far for AMD or nV until HBM (wide interface stacked memory on interposer or directly on chip) or similar approaches become economical.

for the reasons I've explained to AlNets there is not much benefit for textures in increasing the L2. reading textures is more of a streaming and the 'cache' is rather working like a streaming buffer. it's like watching a youtube video, you have a 'cache', but no matter how big you set it, you won't increase the hit rate.
Yes, of course, that's a bad example, and I agree, there is a difference in hit:miss rate, but the hardware is designed in a way that balances the texture fetch queue vs cache size vs bandwidth/cache size. if you just double the size to increase the hit rate by 10% and the rest of the hw stays the same and was hiding the misses during the common workload it was designed for, you won't notice the change.

Gipsel · Jul 3, 2013

rapso said:
I have some doubts it give a big gain on the CPU side, there might be some cases, but doubling the cache size (or associativity) will not even give you 10% speed up (unless you have a very specialized software with work loads fitting exactly the doubled cache size).

It was a general statement. It caches everything: CPU accesses, texture accesses, framebuffer operations. If it would "target" the framebuffer, it would cache only ROP acesses, not everything. It's just a general cache.

rapso said:
but it's not that the bandwidth was adjust, but the ROP count. there is just no point in having more ROPs than your memory can serve.

There is no point to have more ROPs than your rasterizer provides pixels.

In fact, overprovisioning the setup/raster stage compared to the ROPs makes more sense than the other way around. Hardly any game out there even approaches the theoretical pixel fillrates with the current amount of ROPs. Usually other bottlenecks hit first.

rapso said:
Sure, I wasn't restricting my reply to just the Vertex/index/texture/constant cache, I was talking about the L2 cache in general. if you double it, then for framebuffer purpose.
sorry if I have not made this explicitly clear or if "L2" is not fitting some official naming convention.

You have to say which cache you mean. The ROPs only use a single cache level as I described. Without going to really increasing its size by more than 2 orders of magnitude, it's not going to make much of a difference. That's not going to happen considering the really high internal bandwidth they have (~2 TB/s for Pitcairn and Tahiti) and the additional functions they serve (as color compression).
What one could think about is not just increasing the size, but changing the cache hierarchy to make it more feasible. But for that is is quite important to know which cache you are actually talking of.

rapso said:
for the reasons I've explained to AlNets there is not much benefit for textures in increasing the L2. reading textures is more of a streaming and the 'cache' is rather working like a streaming buffer. it's like watching a youtube video, you have a 'cache', but no matter how big you set it, you won't increase the hit rate.
Yes, of course, that's a bad example, and I agree, there is a difference in hit:miss rate, but the hardware is designed in a way that balances the texture fetch queue vs cache size vs bandwidth/cache size. if you just double the size to increase the hit rate by 10% and the rest of the hw stays the same and was hiding the misses during the common workload it was designed for, you won't notice the change.

While I agree there is usually not a huge benefit (I answered to AINets, too

), it is a potential bottleneck. If you work with high resolution textures (and AF for instance), the needed bandwidth (and size) can become significant, especially if the hitrate in L1 starts to drops. Enabling more reuse from a larger L2 can shield the memory interface from that. And you have also to consider that quite a lot of clients are connected to the L2; not only the L1s, but also the constant caches or the instructions caches. And I guess I don't have to mention that some of the more modern approaches tax the L2 harder as more and more data is exchanged through it (compute being just one example, also if you expand geometry by tessellation, the created data needs to end up somewhere).

Andrew Lauritzen · Jul 3, 2013

rapso said:
I have some doubts it give a big gain on the CPU side, there might be some cases, but doubling the cache size (or associativity) will not even give you 10% speed up (unless you have a very specialized software with work loads fitting exactly the doubled cache size).

As usual with caches, it matters precisely for cases where the working set wouldn't have fit into 6-8MB but does fit into 128MB. i.e. not a ton of cases, but there are some. For instance, watch the measly 47W mobile chip walk all over the big desktop chips in the fluid dynamics benchmark here, purely due to EDRAM:
http://techreport.com/review/24879/intel-core-i7-4770k-and-4950hq-haswell-processors-reviewed/13

rapso said:
exactly why intel went for such an 'insane' size and why I believe it targets framebuffer mostly.

I'm not sure what you mean by "targets framebuffer mostly" - as noted, it's a general purpose cache of whatever you're doing, whether that be massive blending in a framebuffer, writing to UAVs, write/read cycles (post processing, shadow rendering, etc) or whatever else. The latter cases are some of the big wins actually... not burning off-chip bandwidth during various post-processing or shadowing passes is pretty huge.

rapso said:
reading textures is more of a streaming and the 'cache' is rather working like a streaming buffer.

It still saves off-chip bandwith if you're streaming from something on-chip vs off-chip. Haswell's EDRAM can easily fit several entire render targets (G-buffer, shadow maps, etc) so passes where you render that, then texture from it need not go off-chip at all. But sure, making smaller changes to L2 or making any changes to caches that need to be flushed between passes isn't going to make a big difference.

Your analysis seems to be working under the assumption that the data always originates in memory, and the only reuse happens in the caches "near" the sampler. As I've explained, if you have a big enough, general purpose cache, the data often need not go off-chip at all.

AF being "free" in general strikes me as untrue. Just because it's not in those games on that chip doesn't mean it isn't anywhere. Especially on integrated just run something texture-heavy (like an older Valve game or something) at high resolution and compare bi/trilinear to 16x AF and I think you'll see a non-trivial hit.

I'll also point out that MSAA is never going to be "free" (unless you intentionally slow down non-MSAA) because it actually implies doing more work (shading, rasterization, depth, etc) due to triangles that would have missed samples with it off. Increasing ROP power doesn't eliminate this, as it can hit every part in the pipeline after triangle setup.

Why are AMD and NVIDIA still increasing TMU count?

silent_guy

Gipsel

rapso

rapso

Gipsel

rapso

Gipsel

Andrew Lauritzen

Moderator

Similar threads