It doesn’t though since nobody counts tensors when talking about theoretical flops for gaming.
"Tensors", no. FP16 most definitely yes. Or should do, because FP16 is relevant in gaming.
Maybe. But the topic was how application performance scaled with Ampere’s doubled FP32. We have lots of evidence that game performance did not scale anywhere close to the flops increase. However there are other workloads where it came reasonably close.
And what we appear to have discovered is that there's a requirement for two distinct hardware threads to be issued-to across the two FP32-capable datapaths in order to get at those FLOPS. And what's interesting about that is that it means you need twice as many hardware threads in flight on the SM partition as would be required were there just a single FP32 (combined with integer) SIMD, e.g. as seen in RDNA.
That makes occupancy more brittle - or if you prefer it makes performance variation with count of hardware threads in the partition more brittle.
RDNA appears to spend more transistors, relatively speaking, on scheduling: there's less SIMDs per scheduler. It appears AMD's focus in RDNA has been on reduced brittleness: reducing the ratio between best and worst cases, which also requires more, more-local, cache and bigger register files. So more of the same in RDNA 3.
The change from VLIW to "scalar" was very much driven by the desire to reduce brittleness. A lot of problems were seen with instructions that were scalar or vec2, "wasting" VLIW throughput.
It'll be interesting to see if 2022 brings us conditional routing to help with the problems caused by divergent control flow. As ray tracing becomes dominant for AAA graphics, it seems that branching is getting harder to avoid in shaders. Brittleness there truly is disastrous.