The arch is obviously not designed for gaming
Nonsense. It's nearly identical on the compute side and what's changed is more related to concurrently being able to RTX and DLSS.
the compute is way out of bounds for other bottlenecks
I don't think so. It's just people being knee-jerk fixated on single dimensional GFLOPS as if that should be the absolute performance metric. It's never been.
There was a very clear opportunity to maximize scheduling rate on the SM while increasing FP32 and they did it the only way posible, by doubling/adding a second FP SIMD (that didn't even get its own data path**) among many other units that were already there. It's never suposed to come with a doubling of performance, there was simply no other way to increase it but doubling the unit (which is the same that has happened with TMU and ROPs in the past, every few generations they seem overkill, but it's just a small percentage of actual trnasistor budget, same here). It's enough if performance increase is greater than area increase and so far, it came with a more than 30% performance uplift against a card (2080 Ti) with same amount of SM (68) for a very minor increase in area.
EDIT:
there's still a vast amount of hypothetical compute performance laying around doing nothing for most of a frame.
Yeah and a lot of texturing perofrmance laying around doing nothing and a lot of ROP performance laying around doing nothing and in Turing, also a lot of INT32 computing performance laying around doing nothing, and a long list of many other hypothetical performances laying around doing nothing. What exactly makes FP32 so special that it requires special consideration?
So far Ampere is, transistor for transistor, less efficient than Turing for gaming
How so? Even accounting for the much improved RT and TC cores + the scheduling changes to make those run concurrent, GA104 is 17.4 billion transistor vs TU102 18.4 billion and will most definitely beat it. As for the 3080, it has 20% of its chip disabled, so it would be equivalent to a 28 * 0.8 = 22.4 billion transistor chip, and that's just 20% more trnasistors for a 30%+ performance uplift.
and will probably remain that way.
It' is not that way and its advantage will do nothing but grow, as games better supporting its advantages start popping up.
** Now
that would have been an indication supporting your claim if it had had its own datapath.