I'm referring to the game performance which was disappointingly far from double Turing, whether expectations where defined in terms of FLOPS, TEX, bandwidth or power, or some combination. 3090Ti versus 2080Ti is pretty damning...
Pure compute applications, typically not games, saw doubling or more in performance. I'm not aware of any analysis that identified the reasons. Crucially, to identify whether co-issue or dual-issue is the source of the performance gain.
FMA isn't the right way to think about primary ALU performance. Apart from anything else FMA isn't the only floating point instruction. It's why I talked about instruction throughput.
Ampere may well do dual-issue, and the disappointing game performance uplift it saw may also apply to RDNA 3, which looks highly likely to be a dual-issue.
For what it's worth dual-issue of FMA in RDNA 3 looks like it will be impossible in a subset of operand availability situations, since the register file can only provide four out of six of the required operands. In theory one or two operands can come from the destination operand cache and one operand can be supplied as a literal. So there are some situations where the dual-issue will work, but plenty not.
So RDNA 3 is likely to look worse than Ampere on dense FMA code with tons of instruction-level parallelism. Whether that's detectible is another question.