Not sure what you mean by that.
It means the starting point for writing efficient code is about as far away as it's possible to get when using these architectures.
Given the problem of scalar-only code a compiler has an obvious starting point to simply lump multiple threads into a single hardware thread on these architectures. But it'd be better if programmers were encouraged to think vec4 I think. Vectors aren't going away.
It'd be interesting to see an analysis of CUDA apps out there, to find out how many of them are most efficient when coded as purely scalar, i.e. at no point in the kernel code are any vector types used, nor structures.
Does OpenCL have explicit accomodations for vec4 processing elements? I couldn't see anything along those lines in any sample code. It would make sense if each provider's OpenCL back-end packed work-items appropriately for the hardware but then that's not an API issue.
OpenCL supports vectors of various sizes as well as structures, both based upon various data types, such as float and integer. The programmer can query the underlying hardware to find out what the preferred "packing" is, e.g. scalar or vec4, but that doesn't lead to any automated optimisation as far as I can see.
OpenCL is not particularly close to the metal (though it promotes itself as such in much the way CUDA does, it seems), so yes each vendor of a just-in-time compiler for their hardware architecture has to translate from the given code to their hardware. When most of the hardware out there uses vec4-float as the most fundamental data type in its throughput ALUs/registers, obfuscating this by making the programming model based upon scalars seems troublesome.
Larrabee is interesting because it is essentially a vec16 architecture. There is support for native vec16 data in OpenCL, so it hasn't been left out in the cold. It may well turn out to be useful to program NVidia GPUs with vec8 or vec16 as the underlying data type, matching either SIMD-width or bank-count for various types of memory.
The issue is fundamentally about appropriate ways to "auto-parallelise", and starting with scalar code makes this rather more fiddly than starting with vec4.
Sure it's too-early days yet to be able to tell, but it seems that programmers that want to write cross-platform OpenCL are going to be spending a lot of time tripping up on the scalar versus vector issue, particularly as OpenCL makes scalar the underlying programming model, not vector.
Maybe everyone thinks it's no big deal - clearly I'm on the sidelines
It's been amazingly quiet so far, there's no real OpenCL community to speak of.
A "vec4 native" data parallel API certainly doesn't sound very flexible to me.
What you're saying is that graphics, being vec4 native isn't flexible at all, particularly on machines that are effectively vec4
And further, that running this same graphics code on NVidia's scalar architecture is difficult
Have you looked at:
http://www.khronos.org/registry/cl/specs/opencl-1.0.33.pdf
The forum's pretty quiet:
http://www.khronos.org/message_boards/viewforum.php?f=28
Jawed