Yes, I remember this claim from GCN docs when it was revealed for the first time. I find it highly misleading considering wave scheduller throughput is 1 wave per cycle and most of these instructions of different types obviously share data paths with FP32 units, hence INT execution will block FP32 from execution on the same data path, the same with SFU and so on.
AMD's description has been pretty open about the 1-instruction per wavefront per cycle behavior. It is true that vector FP and vector INT do not issue concurrently, since the architecture does not consider them a separate type.
The scalar unit's domain is separate, and the other types tend to have different domains or units handling them. Branch, GDS/export, LDS, vector memory, and special instructions round out the other general types as far as the original GCN architecture description states.
While the data paths are not sized for an arbitrary combination of 5 instructions, there are allowances made for some concurrent issue. AMD's general architectural thrust has been to give the SIMD the best chance at good utilization, but there are signs that this includes making sure at least some of the other operation types can issue at a reasonable rate.
For Navi, the LLVM changes note that the scalar register file is more heavily banked than the vector file. This makes sense since there are multiple types of instructions including vector ones that can source from that file.
Prior GCN designs with the 4-cycle cadence also left the vector register file available for something other than a vector instruction at least 1/4 of the time. An FMA with three read operands in a 4-cycle cadence allows for another unit to tap into the register file. While AMD's super-SIMD patent may not apply to Navi's implementation, it notes that on average vector instructions would use the equivalent of 2 operands, meaning in many cases the register file was available for two operands to any other instruction type that might need a vector operand (export, VMEM, LDS, etc.).
If a vector instruction sourced a scalar register (unclear if so, though the bus that allows for this is mentioned as being its own entity), then that might have been another opportunity for the register file to find concurrent use.
A major limiting factor is operand bandwidth from a register file that is physically challenging to implement. For Turing, this is also a major consideration.
Both architectures have 4-banked vector register files. With Navi, this banking was exposed due to the breaking of the 4-cycle cadence, and the same rule of one operand per bank per cycle is now a hazard both make software deal with.
A significant feature Nvidia's GPUs have relied upon since Maxwell is the operand reuse cache, since it helps get around bank conflicts and provides bandwidth the register file lacks. There's a brief mention of something that might be similar for Navi in the LLVM flags, but it's unclear if it plays a similar role.
Scalar units have separate data paths, but with 1 wave per cycle, issuing a scalar op will block a SIMD op (fp, int, sfu) from execution.
If this is a prior GCN GPU, the restriction is that it cannot be from the same wavefront. If any of the up to 9 other wavefronts on the SIMD have a scalar operation pending it will probably be issued.
How many candidate wavefronts Navi has to choose from on a given SIMD, and how many per cycle it can select isn't something I've seen in the presentations.
Fully dedicating a scalar unit and scheduler to a SIMD can provide more opportunities for finding instructions without resource conflicts between the scalar and vector paths, and combining the hardware budget of two SIMD16 units should give it the ability to have more wavefront slots than before. However, the supposed streamlining of the architecture for clock speed and efficiency may put some downward pressure on the totals for a SIMD32 versus 2 SIMD16 blocks.
Also unclear from the Navi presentation is why the branch/message block in the GCN diagram went away for Navi. Most of those instructions still exist, although the loss of some instructions may allow for a shift in what categories or relative issue rates Navi's scheduler has.
Don't get me wrong, GCN obviously can execute up to 4 waves concurrently with just a single scheduler per CU because of the 4 cycle cadence Wave execution on SIMD units, but unlike Turing, it can't execute different types of instructions (different mixes of fp + int + sfu) at full rate concurrently because they all share the same data paths and scheduling rate is not up to speed to meet separate data paths requirements
It can make forward progress on more than just four wavefronts in a CU, although it seems the types you are focusing on are architecturally in the same bucket for GCN.
The Navi presentation did mention that the SFUs use one issue cycle and then continue in parallel, which has some similarity with how Turing handles FP and INT instruction issue.
I want to share my experience regarding register allocation. The compiler doesn't minimize the number of allocated VGPRs currently (for the pre-Navi hardware I could analyze) and leaves it at that, but instead it maximizes VGPRs to the next high power-of-two of the mimimum allocation. As an example, I tried to optimize a compute shader for occupancy, it was basically single wavefront without being able to swap it out. I statically calculated the number of needed registers myself and it ended up needing 3 registers more than 32, instead of leaving it at 35 the compiler skyrocketed to 65 (which for some oddness is considered a 64 VGPR wavefront, probably one was some sort of pseudo VGPR eliminated below the ISA at real machine code). The analysis can be conducted with the tools Rys' team is publishing, e.g. some options allow you to see register-allocation and reuse over the dump of the ISA of the shader on every instruction.
I think the register allocation allowance has changed a lot for Navi. AFAIU from remarks here, power-of-two isn't necessary anymore, so this kind of thing you describe became only necessary from Navi on. If the older architectures could potentially be called in a similar way is an interesting question. Technically it's a scheduling optimization problem, which is easily approachable in software (unlimited complexity), but not so much in hardware. It could mean the schedulers became more powerful in regards to solving these type of "searches", or the problem is directed towards the driver and more complex software. Ah, the eternal issue of static vs. dynamic optimization (compile-time vs. run-time).
It's difficult to say why the compiler bumped the allocation that high, though there are multiple programmer posts on this board and even references in AMD's GPUOpen site from developers that reference poor compiler allocation behavior for GCN.
The hardware itself doesn't have that extreme of a granularity.
https://gpuopen.com/amdgcn-assembly/
https://llvm.org/docs/AMDGPUUsage.html
The above give a VGPR granularity of 4, and 16 for SGPR for GCN3 (albeit a given shader will see its allocation doubled). It's 4 and 8 for older GCN architectures.
From the second link, Navi's VGPR granularity goes to 8. However, there's not so much a scalar granularity so much as there's just 128 registers at all times.
It has been the case that GCN's scalar register file has accumulated more and more context as more complex trap handling, complex addressing, and debugging have been added.
The large base allocation may also have some influence as to why the scalar path was duplicated and made per-SIMD.