Neither the CUDA to PTX nor the PTX to GPU bytecode compiler need to do any vectorization.
Neither it's needed for ATI's OpenCL to IL compiler for instance. The difference is that it *can* be done (you can use r#.xyzw or just r# to apply an instruction to all components, works also with just 2 or 3 components), but this is mostly something to make it more readable or shorter. It doesn't make a difference if you pass
add r100.x, some_register
add r101.x, another_register
add r102.x, yet_another_value
or
add r100.xyz, r25.wzx
to the IL compiler. Both are handled as three independent additions which may end in the same VLIW bundle or not. The slot the hardware uses for an instruction is also not determined by the IL code (only the case for double precision adds, but that will likely be fixed in the future). The Brook+ or OpenCL to IL compiler don't do any vectorization or other fancy stuff. They simply translate the code in the most simple way to IL (looks terrible most of the time). The optimization is done by the IL compiler, but as already said by MfA, this optimization is about scheduling, finding ILP and packing the VLIW bundles, not about vectorization.
If a programmer chooses instructions using 2, 3 or 4 components, that only means he structures the code to be shorter, better readable, and to have more parallelism readily exposed, it doesn't magically lead to a higher utilization compared to the same algorithm written with just more scalar instructions.