Ah, that makes sense. But you usually still only need data from 2 MIP levels?You need more samples from higher resolved MIP levels compared to trilinear filtering. That's why it gets sharper.
Ah, that makes sense. But you usually still only need data from 2 MIP levels?You need more samples from higher resolved MIP levels compared to trilinear filtering. That's why it gets sharper.
Yes, the individual taps for AF are basically trilinear samples (before DX10 one could also select bilinear AF, but that got dumped by MS).Ah, that makes sense. But you usually still only need data from 2 MIP levels?
but from higher res mips, one level higher -> 4x bandwidth.Ah, that makes sense. But you usually still only need data from 2 MIP levels?
might help on the Framebuffer, I think intels EDRam is targeting mainly framebuffers.It increases the hitrate of the L2 of course. ...
Intel targets everything. The L4 caches also CPU traffic.might help on the Framebuffer, I think intels EDRam is targeting mainly framebuffers.
if you sample a texture, you read 4bit/pixel, if you write (zbuffer+color), it's 64bit/pixel. if that fits into the L2, it could be IMO the most benefiting part of the pipeline.
I have some doubts it give a big gain on the CPU side, there might be some cases, but doubling the cache size (or associativity) will not even give you 10% speed up (unless you have a very specialized software with work loads fitting exactly the doubled cache size).Intel targets everything. The L4 caches also CPU traffic.
exactly why intel went for such an 'insane' size and why I believe it targets framebuffer mostly.As you say, one would need to fit the render target into the cache. That would need quite a bit of storage (high end GPUs have to plan for at least 4 Mpixels with a HDR format => 32MB without AA).
but it's not that the bandwidth was adjust, but the ROP count. there is just no point in having more ROPs than your memory can serve. with about 20GB/s (most desktop nowadays, PS3->RSX), you saturate at 8 rops@~500-600MHz. (sure, rops vary, but roughly counted its ROP-count * clock * Byte/pixel e.g. 8*600*4 -> 16GB/s). doubling the ROP count without doubling the bandwidth would barely change anything -> EDRam. or the other way around, while all other integrated GPUs peak with 8 ROPs, I'd bet the haswell GPU has at least 16ROPs peak.And usually, the memory bandwidth isn't so much behind on what the ROPs can actually do (memory bandwidth is enough for 32bit color formats and even 64bits without blending to achieve the theoretical fillrate on Tahiti for instance).
Sure, I wasn't restricting my reply to just the Vertex/index/texture/constant cache, I was talking about the L2 cache in general. if you double it, then for framebuffer purpose.The thing is, at least AMD chips don't use the L2 cache to cache the framebuffer (at least not for ROP access, just when you read it back through the TMUs), it would simply pollute it too much. The ROPs have specialized caches which are small (Pitcairn/Tahiti have 128kB color cache and 32kB Z cache) but offer way higher bandwidth internally (enough for 4xMSAA and blending with a 64bit color format). Framebuffer de-/compression is done on loading or storing render target tiles to memory (bypassing the L2 in the process). It also serves the purpose of coalescing reads and combine writes to enable a high transfer efficiency for the memory (fewer read/write turnarounds, series of bursts for accessing the tiles).
for the reasons I've explained to AlNets there is not much benefit for textures in increasing the L2. reading textures is more of a streaming and the 'cache' is rather working like a streaming buffer. it's like watching a youtube video, you have a 'cache', but no matter how big you set it, you won't increase the hit rate.So it is obviously a better call to provide a slight increase of the L2 cache to reduce the bandwidth requirements for texturing so more is available for the ROPs (bypassing L2). That reasoning changes only on platforms with severly limited memory bandwidth (IGPs/APUs) and if you can provide large amounts of storage (like intel with its 128MB L4). It's not an option so far for AMD or nV until HBM (wide interface stacked memory on interposer or directly on chip) or similar approaches become economical.
It was a general statement. It caches everything: CPU accesses, texture accesses, framebuffer operations. If it would "target" the framebuffer, it would cache only ROP acesses, not everything. It's just a general cache.I have some doubts it give a big gain on the CPU side, there might be some cases, but doubling the cache size (or associativity) will not even give you 10% speed up (unless you have a very specialized software with work loads fitting exactly the doubled cache size).
There is no point to have more ROPs than your rasterizer provides pixels.but it's not that the bandwidth was adjust, but the ROP count. there is just no point in having more ROPs than your memory can serve.
You have to say which cache you mean. The ROPs only use a single cache level as I described. Without going to really increasing its size by more than 2 orders of magnitude, it's not going to make much of a difference. That's not going to happen considering the really high internal bandwidth they have (~2 TB/s for Pitcairn and Tahiti) and the additional functions they serve (as color compression).Sure, I wasn't restricting my reply to just the Vertex/index/texture/constant cache, I was talking about the L2 cache in general. if you double it, then for framebuffer purpose.
sorry if I have not made this explicitly clear or if "L2" is not fitting some official naming convention.
While I agree there is usually not a huge benefit (I answered to AINets, too ), it is a potential bottleneck. If you work with high resolution textures (and AF for instance), the needed bandwidth (and size) can become significant, especially if the hitrate in L1 starts to drops. Enabling more reuse from a larger L2 can shield the memory interface from that. And you have also to consider that quite a lot of clients are connected to the L2; not only the L1s, but also the constant caches or the instructions caches. And I guess I don't have to mention that some of the more modern approaches tax the L2 harder as more and more data is exchanged through it (compute being just one example, also if you expand geometry by tessellation, the created data needs to end up somewhere).for the reasons I've explained to AlNets there is not much benefit for textures in increasing the L2. reading textures is more of a streaming and the 'cache' is rather working like a streaming buffer. it's like watching a youtube video, you have a 'cache', but no matter how big you set it, you won't increase the hit rate.
Yes, of course, that's a bad example, and I agree, there is a difference in hit:miss rate, but the hardware is designed in a way that balances the texture fetch queue vs cache size vs bandwidth/cache size. if you just double the size to increase the hit rate by 10% and the rest of the hw stays the same and was hiding the misses during the common workload it was designed for, you won't notice the change.
As usual with caches, it matters precisely for cases where the working set wouldn't have fit into 6-8MB but does fit into 128MB. i.e. not a ton of cases, but there are some. For instance, watch the measly 47W mobile chip walk all over the big desktop chips in the fluid dynamics benchmark here, purely due to EDRAM:I have some doubts it give a big gain on the CPU side, there might be some cases, but doubling the cache size (or associativity) will not even give you 10% speed up (unless you have a very specialized software with work loads fitting exactly the doubled cache size).
I'm not sure what you mean by "targets framebuffer mostly" - as noted, it's a general purpose cache of whatever you're doing, whether that be massive blending in a framebuffer, writing to UAVs, write/read cycles (post processing, shadow rendering, etc) or whatever else. The latter cases are some of the big wins actually... not burning off-chip bandwidth during various post-processing or shadowing passes is pretty huge.exactly why intel went for such an 'insane' size and why I believe it targets framebuffer mostly.
It still saves off-chip bandwith if you're streaming from something on-chip vs off-chip. Haswell's EDRAM can easily fit several entire render targets (G-buffer, shadow maps, etc) so passes where you render that, then texture from it need not go off-chip at all. But sure, making smaller changes to L2 or making any changes to caches that need to be flushed between passes isn't going to make a big difference.reading textures is more of a streaming and the 'cache' is rather working like a streaming buffer.