Most likely both require assistance by the shader compiler to flag the lower part of the stack as "preemptable", so rather than a hard register count, you now have a peak working set size, and a total register count for the deepest part of the call tree. Worst case it requires hoisting a couple...