Outside of the difficulty writing efficient compute shaders, and the efficiency they run on any given piece of hardware, devs will think long and hard before stealing GPU resources from rendering.
Compute shaders are very useful for graphics rendering as well. If you can perform some graphics rendering steps more efficiently using compute shaders, you are naturally going to do it. In addition to speeding up existing processing steps, compute shaders allow completely new graphics algorithms to be implemented. Current pixel shader based rendering pipelines rely heavily on brute force rendering (huge regular sampling grid, etc). Compute shaders allow more clever algorithms to be implemented instead. In many cases you also want to run some algorithm steps on GPU because GPU->CPU->GPU latency roundtrip is too long.
There is no inherent difference between PS and CS (or every other shader type) in this regard.
I do not agree with this. Since PS3.0 you have been able to do basically anything you wanted in pixel shader. PS4.0 and 5.0 didn't bring any features that allowed anything radically new, except for maybe integer support. And you can emulate 24 bit integers with 32 bit floating point values very well (I have been doing for example image compression / bit packing in pixel shaders using floating points as integers on consoles, and the performance is good).
CS4_0/4_1 however lack many very important compute features (picked the most important):
- A thread can only access its own region in groupshared memory for writing
- SV_GroupIndex or SV_DispatchThreadID must be used when accessing groupshared memory for writing
- A single thread is limited to a 256 byte region of groupshared memory for writing
- Only one unordered-access view can be bound to the shader
- No atomic instructions are available
Without atomics or/and scatter to groupshared memory many algorithms are impossible (or very difficult/inefficient) to write. Groupshared memory is the most important feature that differs compute shader from pixel shader, and unfortunately group shared memory usage is very much crippled in CS4_0/4_1.
Also none of the GPUs that support CS_4_0/4_1 have generic read/write caches, or other new features (parallel kernel execution, context switching, etc) that make compute shaders so much more usable. DX10 chips might be technically "compute shader capable", but this doesn't mean they have anywhere near efficiency, flexibility and feature set compared to GCN/Fermi/Kepler. Personally I have never counted DX10 cards being "compute capable", and I would likely run these cards using a traditional pixel shader code path instead (group DX10 cards together with DX9 cards).