They probably simply have no hardware for collision detection/resolution ... personally I'd say just make it undefined since it's a shame to leave the full crossbar they need anyway for the reads unused for the writes, but meh.
CUDA makes this undefined and writes to UAVs in D3D are undefined, so it's only really CS4.x that's affected. In Brook+ they might just be enforcing a "reasonableness", i.e. it's not reasonable in a high level abstraction to allow a programmer to "accidentally destroy data".
I wonder if the restriction in CS4.x is reflecting other restrictions in the hardware. CS4.x requires 768 threads in a thread group - all of which can share data with each other. But in CUDA only 512 threads can be in a block. I'm wondering if the inability of all 768 threads to write to an arbitrary shared memory address on NVidia also contributed to this limitation. But I'm not sure if that restrictiction on shared memory addressing actually exists - doesn't make sense to me, isn't shared memory just a flat address space per multiprocessor?
PS. you can always implement a ring bus in software for cross thread communication
Yes it's very robust, and variations of this sort-of fit the "stream" paradigm. I dare say I think the general concept of streams amongst threads in a group is a useful, scalable, abstraction (with the proviso that a group of threads is capped in size). Well, it appears scalable, though the compiler's first implementation has a pessimistic view, treating every read as a waterfalling read
which makes me think there could be a hardware restriction there that's about to be lifted with the next GPUs.
But some things just want a big fat blob of "fast" memory shared by all threads. Though now I think about it, what algorithms can't use the private-write/shared-read model? 256 bytes of writable memory per thread is quite a lot - though the performance on ATI (only 1 wavefront) would be pretty miserable, and well the scheduling on NVidia effectively means that 64 threads function as "1 warp" so the performance there will be bad too.
I guess scan prefers write anywhere, and that's pretty fundamental.
Jawed