What do you mean HW does the scheduling? The HW is not going to replace a dp3/rsq/mul with a free normal. The HW is not going to replace an ADD/MUL with a dual-issue automatically. This is done by the compiler packing both into a VLIW instruction. And the HW certainly isn't going to do register allocation and expression inlining.
I took a look at some of the RTHDRIBL shaders via 3DAnalyze, many look to be inefficient, and almost look like they were written by hand, since they don't even do constant folding.
Let me give you example, this fragment is from 3DAnalyze of RTHDRIBL,
Code:
def c3 , 256.000000, 0.111031, 0.000000, -128.000000
mad_pp r4.w , r0.wwww , c3.xxxx , c3.wwww
mul r11.w , r4.wwww , c3.yyyy
Here we have
r11.w = r4.w * c3.y,
substituting for r4.w we have
r11.w = (r0.w * c3.x + c3.w) * c3.y
= r11.w = (r0.w * c3.y*c3.y + c3.w*c3.y)
= with constant folding
= r11.w = (r0.w * c_fold.x + c_fold.y)
lets us rewrite as
MAD r11.w, r0.w, c0.x, c0.y
Just saved 1 register and 1 instruction. I saw one shader that used 11 registers when only 2 were really needed, and most of those 11 were scalars! (r1.x, r2.w, r3.y), and they hadn't hit any port limits, so there was no justifable reason no to reuse dead registers or pack the scalars.