I have the relevant documentation before me and a 4x4 matrix multiply still takes up 4 instruction slots in VS3. In other words, it decomposes the macro into a "mul" followed by 3 "mads" (I'm assuming). I don't even think it's desirable to place 16 FP multipliers on one shader - it would be better IMHO to split those 4 vector units into 4 seperate shader units.