So the article with the updates on latency measurements doesn't seem to change the picture much in terms of the latency of global memory on Navi 21 and GA102. They're both about the same.
I think these articles are biased towards identifying the best case latency figures - which is kind of the opposite of the "worst case" that I was originally thinking about. So, I'm not convinced it's that much use.
That assumes AMDs hit rate stats are relevant for Nvidia’s memory pipeline. There are
other factors that influence hit rates aside from size. Pinning, compression, pre-fetching, eviction policies etc. Nvidia also has caches local to the RT cores that help reduce the load on L2.
It makes more sense to compare AD102’s 96MB to the 6MB of GA102.
There's no doubt that more cache will be better. I'm merely questioning whether it's enough.
The fact that estimates put the die size as being around 611mm² implies to me that NVidia's performance models show no need for more cache. Honestly, I'm still surprised. Cache is effectively free die space, as it offers fine-grained redundancy, and so should not affect yields.
Nvidia's semi tiled rasterisation approach should also do wonders for the L2$ hit rate
That's double-counting though, as it's something that Ampere already does.
The second article I linked regarding latency starts off with an analysis of Unigine Superposition "8K Optimized". The analysis concludes that latency is the problem that Ampere has:
"Loosely translated, Nvidia is saying that if the shaders are using less than 80% of their issue bandwidth while they have work to do, and the top stall reason is because warps were waiting on cache/memory, then you’re limited by cache/memory latency."
Is that purely texturing related?
A single SIMD for FP32/Int32 certainly solves the primary utilisation problems that Ampere's dual-SIMD layout introduces. It would also save a fair amount of die space.
The extra register files and scheduling/scoreboarding obviously eats in to the space saving, but it should still come out far ahead in terms of utilisation.
I like this speculation, which is supposedly founded upon stuff that can't be revealed.
The picture doesn’t really make sense. The ratio of control to compute seems unbalanced and each sub core having just a single 16-wide SIMD doesn’t increase peak SM throughput.
Within the SM there's twice as many of these partitions.
It's worth remembering that the register file in Ampere doesn't have the bandwidth to support, for example, FMA on both FP32 pipes simultaneously. So even if the instruction mix in the kernel would allow for two independent FMAs to be issued "simultaneously" the register file will say no. The partition then depends upon result forwarding, kept in the operand collector or somewhere else, to avoid the register file bandwidth crunch.