The focus seems to be on Compute Shader and it seems to me an effort to deflect attention away from the noises NVidia's making about CS. NVidia seems to be ready to market CS4/CS4.1 as all that a developer needs (because all its GPUs since G80 work this way), so developers should focus on them, not CS5. And therefore D3D11 isn't relevant until NVidia says so. I wouldn't be surprised if NVidia's already well under way with this campaign.
Some of the restrictions in CS4.1 come from existing ATI cards, though, e.g. the private-write/shared-read model of shared memory (only being able to write to a 16-256B region, private to a thread) and the lack of atomics - both of these are fairly serious restrictions it seems. Private-write/shared-read isn't even a hardware restriction in ATI (R7xx only) but Brook+ also has this restriction, which I can't work out the underlying logic of (I can only think it's super-slow on ATI).
768 threads in a thread group is the basic limit of G80 - but, something I didn't fully realise till recently, the CUDA block limit is 512 threads. I wonder if this limit of 512 also influenced the shared memory model.
Actually, is there a difference between CS4 and CS4.1?
So, it seems that CS4 is significantly less functional than CUDA on current GPUs. The slight lack of functionality of G80 (no atomics) is a bit of a hindrance, but it does seem like shared memory functionality has been knobbled by ATI. And I still expect ATI earlier than R7xx to be incapable of CS4 - unless shared memory is emulated through video memory.
I don't remember seeing this before:
Indirect Compute Dispatch: This feature enables the generation of new workloads created by previous rendering or compute shading without CPU intervention. This further reduces CPU overhead and frees up more processing time to be used on other tasks.
This seems to imply that kernel domains can be sized, created and despatched by the GPU.
Or maybe it simply means that the GPU can auto-run a kernel based on a domain that's defined by a buffer that was created on a prior rendering pass. So the input buffer effectively defines the domain size, and completion of writes to that buffer is required before the new kernel can start. It might not even be a new kernel, but a repeated instance of the kernel that's just completed. Some kind of "successive refinement"?
I can't find this whitepaper on AMD's site.
Jawed