TimothyFarrar
Regular
If collisions can't occur due to addressing, then there's no need to use atomic operations - the obvious example is per thread registers which are entirely private.
I'd like to assume that shared-memory atomics are implemented with dedicated hardware instructions. If that is the case, would seem as if you are right in that nothing special happens at the shared memory level. Otherwise things would get rather messy, you'd have to serialize groups of instructions on address "collisions". Not sure about the complexity trade off in hardware between these two options.
The paper you linked seems to use either 64 or 256 bins per patch, each of which is writable by any number of threads (source bins), with one source bin per source patch first identified by an atomicMin. So the number of collisions here is variable and in fact the family of atomic variables is huge.
Yeah that paper presents nearly the worst case I can see for atomic operations, huge number of global atomic operations. Each global atomic should be 64-bytes of global memory traffic (GT200, 32-byte minimum transfer size: load, atomic op, store) for a single 32-bit integer.
The more I look at this, the more I think that fast atomic operations are in-fact more important than any type of dynamic warp formation (DWF), so I'm changing my prediction about DWF. DWF for bank conflict avoidance no longer seems worth it when you consider that you can just load data into shared memory at a bank offset based on thread index (can completely avoid bank conflicts). So about the only thing DWF gets you is better branch performance, but divergent branching messes up everything required for data locality, and tightly ordered synchronization. Which leads me to wonder about just what that "cGPU" buzz word actually means. Perhaps it is just Multiple Kernel - SIMD (MK-SIMD), better cross core load balancing, combined with some better more shared caching for atomics/ROP?