I can believe that in some cases, but I can absolutely max out the FP16 ALU on M1/M2 without being limited by occupancy, memory, or power. Given the unification of register file and threadgroup memory in M3, occupancy should be much less of a problem, and M series GPUs have way more memory...