Lane masks in LRB and bitwise masks in SSE. Similar mechanisms implemented in hw on ati and nv.
And for short vectors it's easy to do the same in software.
I'd like to see a moderately complex problem "vectorized" over ATI's vliw slots before I'd believe it. Divergence and all the warts included.
I've done that already.
Basically it looks like manual loop unrolling with a handling of the special cases for the possible divergences. One can handle that often efficiently with conditional moves for short divergences, or with normal control structures which has effectively the same charactistics as the lane masking.
You bloat the code but get quite some speedup (if the divergences don't dominate, but in that case GPUs suck either way).
And it twists the programming model completely out of shape.
Now there are multiple cores, vector width (64) and another vector width(4) to tackle. All for at best 4x perf gain.
No, as the normal vectorization in GPUs is implicit, you don't handle that part at all (besides when dealing with shared memory). You simply add vectorization, that's it. It's roughly the same as using SSE intrinsics, only more flexible.
For one out of 3 vendors. Not worth the trouble in my book.
The following is from nv's marketing slides but IS absolutely true.
-- 2-3x is not worth spending much effort. Wait a little or spend some money.
-- 10-15x gains are worth doing big rewrites.
Obviously I'm not reading your or nvidias books. A factor of 2 often decides if something is feasible or not. And even a 50% speedup is nothing to sneeze at in my book. If you have to write something from the ground up either way (which is the case for GPGPU much more often than not) it is definitely not an unsurmountable task to get it implemented with relative ease if one has thought about and planned that stuff before.
Btw., I mentioned that to you before and I will reiterate it, but also nvidia GPUs often gain from explicit vectorization as it reduces the granularity of memory accesses and increases the burst lengths. It is simply more cache friendly and with a lot of algorithms being bandwidth limited, it can be astonishingly efficient for some problems in view of the "scalar" nature of nvidia GPUs.