I don't see that. May be you know something we don't know.
I can't find the posts again. And Google Desktop on my PC has a very selective approach to my web history. I don't know of any explicit statement to this effect, merely hints that it's a condition of OpenCL - which is why I don't state it as a fact, but an understanding/impression/belief.
Coming back to the original statement I made: 'I can't see OpenCL ever achieving "performance parity" with CUDA because LLVM is in the way.' I think I should modify that with a caveat "as long as LLVM isn't modelling all of the GPU's relevant architectural parameters." (And this is apart from the general looseness of OpenCL as compared with CUDA.)
e.g. on ATI clause temporary registers have a maximum lifetime of 32 cycles, or ~128 scalar operations (with a maximum of 5 packed into a cycle). How these are used can have a major effect on both cycle count and register allocation, i.e. major performance differences, not 5 or 10%.
Or, how does one get LLVM to model memory latencies? Are there heuristics or flags for burstiness? Modelling of hardware-thread-count versus latency-hiding?
This is pretty common. CUDA had similar beginnings. Friends of mine who worked on it in 2007 have lots of horror stories from that era. Tools need ~1-1.5 years out of beta to mature it seems. No compiler is perfect.
I agree. There are broken things in AMD's toolchain that have been broken for years - and that's at the IL level.
NVidia's advantage with a pure CUDA chain is that it can throw its own staff at the problem, because it's internal technology. OpenCL is a third party - though I presume issues in LLVM, say, can be attacked by NVidia as a contributor to it, in its status as an open source project.
Here's a recent one (HINT: look at the date :smile
http://gpuray.blogspot.com/2009/08/as-fast-as-glsltm.html
And this one's pretty good too.
I'm not asserting CUDA is faultless, merely that NVidia has tighter reins.
I have to say that person says some pretty worrying things. Confusing float and double: as one of my maths teachers used to say, "that's a schoolboy error" (for a class of 16-18 year-olds).
I'm bemused by this sentence, "It turns out that CUDA gpus do not do 4 wide vector operations (according to a reasonably reliable source)." Sigh.
CUDA is ahead, no doubt. But blaming LLVM is an over simplification.
It's only one factor. And I don't think it's all bad, I'm sure I've seen odd stuff where the OpenCL is faster than the CUDA. This is similar to the way in which some people are writing kernels in PTX instead of C for CUDA. And then there are people who are modifying the binary for a kernel...
OCL drivers are still maturing, sure. But this is no reason to diss LLVM. People have simply invested more in their own compilers so far. And no, I am not buying that OCL is an abstraction far removed from your hw so it is somehow difficult to optimize. It might be more generic, but if nv chose to generate WHIRL from OCL kernels and then use their own optimizers, it would hardly be any less performant. Once the frontend transforms the code into IR, all HLSL/GLSL/OCL/CUDA differences are washed away. And the optimizations happen at the IR level.
CUDA/OpenCL programming is more than just a sequence of opcodes (or a question of how many hours it takes to JIT it) - it's a memory-hierarchy programming paradigm. I won't say it's intractable for LLVM, merely that the work needs to be done. And NVidia has full control over CUDA, but can only act as a contributor for LLVM.
It doesn't have to descend into something like the OpenGL extension wars, but NVidia might want to keep some of its technologies out of the public domain. Especially if they reveal particularities of the internal workings of the GPUs.
Perhaps, Apple had other plans.
The opensource 3D drivers are supposed to have LLVM based optimizers.
http://llvm.org/devmtg/2007-05/10-Lattner-OpenGL.pdf
I'm unclear on the way things work when an ATI or NVidia GPU is running the OpenGL app. e.g what happens when OpenGL 3.2 is used on a system with an OpenGL 2.0 graphics card.
To be sure, they are lagging, but building high performance drivers takes time and effort.
Yes and having the entire toolchain in-house for your own custom hardware (and its next generation: as you try to design the next generation of the architecture as a match for the problem space - and also achieve timely delivery of a revised toolchain) beats being dependent on third parties. Sure, there's marketing in play when NVidia says it's a software company, but it's inescapable due to the sheer complexity of current GPUs and the current state of parallel programming.
Intel didn't abandon at least one iteration of Larrabee for no good reason, after all.
Jawed