IIUC, you are saying that game engine people complai about writing shaders that are performance portable across IHVs. Did I understand your point correctly? I was referring to GPU ISAs only in that comment of mine.
Obviously people don't complain about portable shaders, but even ComputeShader doesn't expose a lot of what these chips can do, and people do complain about having to go to IHV-specific - if not even hardware-specific - languages and ISA's to do that. In a world where you had heterogeneous cores on the same die with a shared memory space, they would complain a lot more loudly. The necessity of the abstraction right now is what forgives the current model. FWIW though, yes it's a pain that I have to compile code, query reflection data and bind parameters so manually at runtime even in the fairly streamlined DX10/11 APIs!
AFAICS, you are looking at a future where there will be say 6 sandy bridge cores and 32 lrb cores on a single die, and your os will be able to kick threads around freely. If that is what you are looking at, then I am afraid that is not going to happen for a very long time.
Regardless of the time frame, it is a desirable end-point and starting with an x86-based part is a step in that direction.
Now you see why a thread cannot be kicked from a sandy bridge core to a lrb core (and vice-versa) without risking SIGILL.
Right now that's obviously not the case unless you had some guarantees about the "modes" or common feature sets that a thread used. I see no reason why it couldn't be the case in the future.
Even if it goes on die, asking lrb to become binary compatible with an intel cpu is too much.
Why is it too much to ask for some unification of these things in the long run? Do you really want to continue with N different vector ISAs?
cuda has 2 restrictions on pure C,
1) no function pointers
2) no recursion
Uhh and AFAIK a ton of rules to do with statically resolvable shared vs. global memory aliasing, unless something has changed recently.
But this is exactly my point: saying it "supports C" implies you can take a C program and port it easily to CUDA. This is not the case for all but the most trivial C programs, due to the lack of a standard library, a completely different threading and synchronization model, explicitly-managed cache hierarchies, etc. The fact that it can compile C code that doesn't use anything but the basic keywords is, and let me emphasize,
completely uninteresting. No programmer worth their salt gives a damn about the syntax of the language... it's the stuff that's completely different between typical C and CUDA that actually matters.
So yeah it'll be awesome when I can use function pointers and "exception-like things" on Fermi, but lets not pretend that this has any relationship to the flexibility of a typical CPU. You still can't express the most basic producer/consumer or message passing thread models in CUDA, and while this might come with Fermi's "task parallelism" stuff and CUDA 3.0, that remains to be seen.
Don't get me wrong, Fermi looks impressive on paper and a lot of the features that they are adding are desperately needed in the GPU computing space, but I've yet to see any indication that Fermi could - say - run a full OS on it or anything, which is typically associated with the flexibility of a CPU. Also I'm not saying that the typical CPU programming models are ideal going forward (inheriting the "all aliasing" pointer model of C/C++ would be insane going forward - so lets not see these languages as the ideal!), but there are still fundamental differences in the functionality of programming a multi-core CPU vs. a GPU at the moment.