I'm writing something that I hope will run on NVidia (since the previous version does) and so sticking to OpenCL 1.1. Some time later I might try OpenCL 2.
Unfortunately I'm working with huge amounts of intermediate data (about 10GB per kernel invocation in the extreme case) that needs to be sorted. My sort is about 10x faster than off the shelf sorts, but that's because I don't need ordering, merely to know which are the best 128 items from a 624 long list (there's millions of these lists to sort per kernel invocation). I get that performance using registers across cooperating work items to hold the list rather than a combination of global and local memory.
A fundamental problem with OpenCL 2.0 GPU enqueue is that newly generated parent data has to go off chip to be used by the child kernels. Sure caching might work, but since I'm sticking with OpenCL 1.1 for the time being, it's going to be a while before I get to explore whether it's possible to make caching work by constructing a suitably fine-grained cluster of parents and their children. I'm doubtful.
Ultimately I'm looking at finding the best 128 items from ~4000-item long lists for millions of lists (~80 GB of raw data), potentially using multiple kernels in succession per list, which would necessitate off-chip storage twixt kernels. Which would also require that I break up the work into sub-domain kernels, since the minimum working set is a smidgen more than 1GB, and I'd prefer not to exclude 1GB cards.
The Intel article's use of GPU enqueue seems to be partly about not being dependent upon the CPU to set up kernel launch parameters, based on results produced. I imagine you were alluding to that aspect for my purposes, but not necessarily with varying parent-data-dependent kernel parameters. That use case, where a parent kernel simply issues sub-domain kernels, does sound like an interesting way to avoid launch overheads that arise due to CPU/GPU interaction. Though I can imagine it still being subject to the command processor bottleneck, if indeed that is what's happening.
Catalyst Omega has OpenCL 2.0 support. I'm unclear if that is beta, per se. The APP SDK, version 3.0 with OpenCL 2.0 support is itself beta.
http://developer.amd.com/community/blog/2014/12/09/amd-app-sdk-3-0-beta/