I'd like to see Directcompute reach the same programmability level as Cuda. Function pointers and dynamic parallelism FTW! Also, the HLSL ISA should be changed to scalar, since vec4 has pretty much been dropped in modern hardware, and having a vec4 virtual register file causes problems with alignment restrictions and it can be impossible in some cases to identify unused lanes in a computation, resulting in the GPU having to do extra work to process them. i.e.
float4 result = input[a]+input;
output[c] = result //only the first result.xyz is ever actually used, but the compiler can't tell what might use the output array in the future, and so has to compute result.xyzw
Finally, we need a good way to do a deep copy of complicated structures to the GPU memory space.
float4 result = input[a]+input;
output[c] = result //only the first result.xyz is ever actually used, but the compiler can't tell what might use the output array in the future, and so has to compute result.xyzw
Finally, we need a good way to do a deep copy of complicated structures to the GPU memory space.