It is unclear how RDNA transcendental co-issue is currently implemented (which probably needs dedicated 1R1W to be pipelined), and there are also extra read ports and write ports expected in addition to the VALU pipelines (at least 1R1W for VMEM/LDS/Export I reckon, assuming they all share a read-out path, and their RF writebacks all being centrally queued/buffered).
Yes, that other stuff has been an unknown to me. But, it might be 3R2W on the basis that 3-operand reads in VALU are rare, so just make these other instructions wait or read slowly when reading.
VMEM has an NSA specifier for extra instruction sizing (non-sequential addressing), so that the instruction can be up to 5 DWORDs in length, which supports 13 distinct components, specified by VGPR content, for the most complex read. The most complex BVH ray intersect instruction consumes 12 components (with 64-bit addressing) that can all come from VGPRs.
So clearly the hardware is designed not to satisfy the worst-case non-VALU read bandwidth in a single cycle.
RDNA 2 has clauses, which presents opportunities for time slots dedicated to non-VALU register file reads/writes, familiar since R300. A very convenient point in time where execution of VALU instructions is switched amongst hardware threads and where memory operations end up being scheduled. So, RDNA 2 may halt VALU issue in order to give VRF read bandwidth to other parts of the GPU?
I don't see how 3 input operands can be avoided, since that's a fundemental of FMA.
I'm suggesting that all 3-operand instructions become a 2-instruction macro, making use of a pipeline intermediate (vi0 and vi1 in my example) where necessary for "overflow" bits.
Sometimes the 2-instruction macro would simply first plop an operand into the pipeline intermediate register followed by the old-style instruction. This would simplify cases such as the cubemap instructions, 324 to 327 as seen in:
"RDNA 2" Instruction Set Architecture: Reference Guide (amd.com)
A possible approach would be 4R2W or 5R2W flexibly shared across the two VALU pipelines, similar to Zen's FMAC arrangement.
I know nothing about Zen's FMAC...
A flexible sharing would work, agreed. 4R2W would be no different from what I'm proposing in terms of register file bandwidth, but it would reduce the set of possible co-issues (two VOP3* instructions can't co-issue, nor can a VOP3* and a VOP2/VOPC). 5R2W would be better, but still would reduce the set of possible co-issues.
So then we get into a discussion of whether flexible sharing or macro-based VOP3* produces the best, overall, utilisation. We'll never be able to do that analysis at the scale that AMD can.
Dual-issue might be related to the "wave64" mode, which has two sub-modes, one of which sequences odd and even work items alternately (remembering this is just a software hack). These could instead be dual-issued. But that seems a bit odd, making it an exceptional mode of operation in an exceptional mode of operation.
Or, perhaps the SIMDs are 16-wide but paired for dual-issue, coupled with a partitioning of the vector register file into odd and even hardware threads. This way the RF can be doubled in size and the wiring doesn't go crazy. This would also result in a doubling of the SIMDs per SALU and SIMDs per coarse-grained scheduling.
It's fun this stuff, but co-issue or dual-issue to VALUs seems risky in terms of average utilisation.