@Lurkmass The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader. For Apple family 9 they show the registers can be dynamically allocated for each part of the program, instead of the worst case.
Maybe I'm misunderstanding the difference you are explaining.