Not directly related to RDNA 3, but a newly published patent described in detail about operand fetching and caching on a GPU device, and does match many clues in the RDNA whitepaper.
Given asynchronous operand fetching especially, I would not be surprised if the ALU doubling is really happening inside an SIMD32 (1 -> 2 VALU instructions/clk) while still having its vector register file unwidened at four 1R1W banks. The primary architectural bet might be on running VALU instructions from 2 wavefronts at a time to maximize VRF port and operand network usage, with the expectations that most kernels do not need sustained availability of 3 VRF read ports (either because FMA is not the dominant opcode, and/or result forwarding takes a heavy lifting).
This might explain why VOPD ended up having a rather limited scope, i.e., probably more a secondary bet that aims to help the execution latency (ILP) in targeted scenarios, rather than general throughput.
Given asynchronous operand fetching especially, I would not be surprised if the ALU doubling is really happening inside an SIMD32 (1 -> 2 VALU instructions/clk) while still having its vector register file unwidened at four 1R1W banks. The primary architectural bet might be on running VALU instructions from 2 wavefronts at a time to maximize VRF port and operand network usage, with the expectations that most kernels do not need sustained availability of 3 VRF read ports (either because FMA is not the dominant opcode, and/or result forwarding takes a heavy lifting).
This might explain why VOPD ended up having a rather limited scope, i.e., probably more a secondary bet that aims to help the execution latency (ILP) in targeted scenarios, rather than general throughput.
Last edited: