Yes. Of course .A single CU should be able to do 64 FMAs per clock, so it should be able to read 3 operands per ALU and write one result. That's 4×32 bits per ALU, or 16 bytes.
I usually calculate my read and write BWs separately, so I just concentrated on the write (register update).
Yes, renaming can only increase the consumption. However, with renaming the (named) architecture register count can be kept much lower than otherwise would be needed. Extra registers can be used in a flexible way to hide the latency (pipeline, memory, hyperthreading, rename loops, etc) whenever needed.Is this a comparison between code optimized for an in-order pipeline with a significant amount of loop unrolling, and one that assumes the hardware can do so via register renaming?
In terms of how register consumption appears relative to the same code sequence, register renaming can only increase consumption. At a minimum, the last non-speculative architected state is recorded in addition to any speculative registers. As far as the hardware goes, the fixed thread count and fixed register files leave plenty of slack.
Aside from the 8-16 of each register type in an x86 case, per thread, there can be tens to over a hundred registers in the register file that can only be used or wasted by 2 or possibly 4 contexts.
You could combine the GPU static shader register allocation idea and the architechure (named) register count. Shader compiler could decide the minimum need of registers and rest could be used for renaming (and/or running more waves).