Yes but if the less utilized thread takes less cycles the same amount of registers aren't allocated for the same amount of time thus freeing them to be utilized by another wave right? or is allocation handled differently?
Yes, the wave will finish (as soon or) sooner and the registers will be freed sooner, because the execute masked lanes skip all memory reads & writes and thus reduce the likelihood of a stall on every memory instruction. As an interesting tidbit AVX2 gather also has support for execution masking (branchless way to skip loads of lanes, saving BW).
In the worst case a 64 wide wave with a single pixel shader invocation (63 lanes unused) can stall as many times as a 64 wide wave packed full with PS invocations. In this case the 64 wide wave has done 64 times as much work in the same time. In the best case, the single PS invocation (63 lanes unused) wave would finish much sooner with no memory stalls (if some other wave touched the same cache lines just before), however it would still require 64x more registers per peformed work (albeit for a shorter time) and do 64x less work (but still use the execution units and burn power).
Partially filled waves are only a good idea in some corner cases (as described above) and sometimes with VS/GS/HS/DS as you can be primitive setup bound waiting for the waves to be filled. With small triangles, it is best to execute the vertex waves as soon as possible, because the GPU utilization is bursty (not fully utilized all the time). You want to start the PS waves as soon as possible to give the GPU enough meaningful work to fill all the CUs.
If the GPU waves are frequently underutilized, a more narrow wave width would be the best choice to improve the performance and to reduce the power usage. Currently is seems that the optimum is somewhere around 16-64 threads per wave. Different GPU architectures have different technical choices that affect the optimum. For example AMDs scalar unit and resource descriptor based sampling model make wider waves better for their hardware. NVIDIA is stuck at 32 wide waves, as CUDA has exposed the wave width from the beginning, and majority of the optimized CUDA code would break if the wave width would change. Luckily, 32 seems to be a very good wave width.
You mean because of multiple memory channels right?
It takes awfully long time from the load instruction to reach the physical DDR/GDDR memory. First the L1 cache is checked (before that the store forward buffer is checked on CPUs), if the line is not found, a message is sent to L2 (coherency protocol). If the L2 fails to find the line (and it is not in a L1 cache of another core), a memory read request is sent to the memory controller. The memory controller also buffers the requests. Each core (or CU in GPUs) can issue some maximum amount of memory requests per clock. These numbers are well documented for CPUs, but the GPU numbers are not. A controlled microbenchmark of course would give us exact memory request concurrency numbers of each GPU brand.