A typical VALU instruction takes 4 clocks to execute, so being able to issue one VALU instruction per SIMD every 4 clocks is exactly the right balance. It only takes 4 wavefronts to get peak ALU performance, not 16.
Yes, but my argument targeted the latency hiding capabilities (for memory accesses) indicated by the number of wavefronts/warps/workgroups in flight (and the time the issue of instructions takes for these) in case of a certain (high) register allocation.
Thanks!
Actually, according to some low level tests it is between 18 and 22 (hot clock) cycles on Fermi depending on register bank conflicts, so maybe nV opted for a constant 11 cycles to get rid of the variable latency for the static scheduling.
Gipsel - good point on max register usage cutting down on the number of warps per SMX, I had forgotten to account for that.
I don't quite understand your GCN numbers though. Doesn't a CU track a maximum of 40 wavefronts? You can process an ALU op for 4 entire waves every 4 cycles. So by executing 1 ALU instruction over all 40 waves, you can hide 40 cycles of memory access latency (much more than on Fermi/Kepler), again assuming you have enough registers for 40 wavefronts.
I assumed a high register usage, so this isn't the case anymore.
For such heavy threads the number of workgroups in flight is simply limited by the size of the register files where nV's GPUs have a disadvantage.
I think Gipsel was considering a scenario of heavy register allocation and using the same register allocation for both architectures.
He was indicating that whereas Kepler cannot hide ALU latency with relatively few hardware threads, GCN has no problem.
Exactly.
The scalar ALU with its separate register file actually should enable AMD to get away with slightly less register usage than nVidia in quite a few cases, as for instance constants or adresses which are the same for all elements in a wavefront can be supplied from there and don't have to be in the vector registers.
Why does a MAD take NVidia 11 cycles, but AMD 4 cycles?
Are the 4 cycles confirmed by some low level benchmark?
Considering the simplifications in the ALUs compared to the VLIW architecture (which had 8 cycles arithmetic latency) it appeared quite possible (and the AFDS presentation mentioned back-to-back issue of vector ops without alternating between wavefronts [vector to scalar issue needs to have a 4 cycle latency panelty, btw.]). But seeing that GCN is actually able to hit frequencies above 1 GHz (my initial guess was not higher than VLIW), it could still be 8 cycles, even if then the architecture presentation at the AFDS would have been misleading in this point.
If it is indeed 4 versus 11 cycles I can only speculate that the reasons are similar to the ones for Fermi: Even as Kepler goes down the route to a much more static scheduling, the actual register access could still be part of the effective latency (not hidden by result forwarding incorporated to the pipeline) while it is not for AMD.
The ability to copy from work item to work item should be very nice, obviating moves through local memory. This is similar to Larrabee's shuffle.
I will wait for benchmarks of this feature. If it is as fast as in case of Larrabee, it contradicts a bit NV's mantra of late to localize the register files as close as possible to the ALUs to get a low power cost. But with some additional latency is appears like a nice idea as it should still be faster and lower power than an exchange through the local memory. Maybe they are even partly reusing the shuffle network for the local memory (which they duplicated for each scheduler/register file set in Kepler?) and just save the writing and reading to the local memory SRAM.