So Jawed, what am I missing here?
OK, quick post as I'm about to go to bed...
So now toss in atomic operations happening in the unit doing the coalescing for the given interface to DRAM.
1. Each atomic operation is composed of a {read,op,write} sequence.
2. Note that the result of the atomic operation is the first read, so this can immediately be returned with the same latency as if it was just a read.
3. After this read, a parallel unit (I'm going to refer to as the atomic unit) in the interface could service the atomic operations on the segment of DRAM which was read. Then later re-issue the write back to DRAM. Intermediate results could be cached (right at the interface) so that repeated operations to the same segments don't eat repeated read/write cycle of the same segments of DRAM. This in interface caching would make (2) above fast even in the case of repeated address collisions.
I think it's safe to liken this to blending in the ROPs as each fragment arrives. If multiple fragments for a given pixel arrive within a short time span, then the data is likely in-cache.
What about the worst case scenerio, all multiprocessors repeated issue atomic operations to the same address?
I think my idea of the worst case is worse, and I think it's why we're somewhat divergent: I'm thinking of all threads in a grid atomically updating an address and then all threads using the final value for immediately-dependent computation. It's actually stupid, and logically equivalent to issuing two kernels one after the other.
On that basis my worst case is a straw man.
The only mitigation might be that multiple atomic variables are all independently updated by "random" threads from the grid. Even then, trying to justify running this as one kernel rather than two successive kernels seems difficult.
1. There are two cases here,
1a. One thread of the SIMD group is requesting the atomic operation (this would be the common case, for example writing to a global queue : a shared-memory scan/scatter to group, then one thread does a global atomicAdd to get a global position to do a fully coalesced non-atomic global scatter of the group into the queue). Performance in this case would be similar to issuing a non-atomic operation with the addition of the possible latency of the atomic unit updating the read DRAM segment, and the bandwidth cost of the eventual write back (which could infact happen each operation in the worst case).
The real cost here is similar to a single non-coalesced 32-bit read, ie having to read the minimum sized segment (which in cuda is 32-bytes). However this runs in parallel with the regular SIMD ALUs, so assuming you don't saturate the memory bus, then in theory all is well.
1b. Second case, each thread of the SIMD unit is requesting an individual global atomic operation. In the worst case all these addresses collide. I cannot think of any case were a program would do that. More likely case is that these atomic operations are either semi-random (of which you end up with the same performance as case 1a) or have some locality and slightly collide (at which case perhaps you get more latency at the atomic unit), or perhaps all hit a separate coalesce able address (like if you had say 32 separate global queue pointers for each thread of a warp, don't have any usage cases for that one yet).
The immediately obvious issue here is that without extremely high locality of the addresses, per-thread atomics will swamp the atomic units, i.e. 16 atomic requests per core clock by each of 30 multiprocessors in GTX280, being serviced by 32 atomic units (GT200 can blend at 32 pixels per clock, can't it?). Worst case the "blend rate" of the atomic is 1/32 of full speed.
Thinking about graphics, this is the primary reason that rasterisation is highly localised with space-filling curves.
So, ahem, it might be best to model the cost of atomic operations in terms of their distance from the highly regularised form of blending in rasterised graphics. So thinking in terms of:
- address radii per warp - page divergence
- count of overlapping radii by distinct warps within a multiprocessor - intra-multiprocessor serialisation factor
- count of overlapping radii by distinct multiprocessor (if any) - inter-multiprocessor serialisation factor
- available cycles to hide atomics, per kernel invocation or twixt fences - ALU:atomic
In graphics all of these are sane
I still haven't fully understood the nature of the antiradiance algorithm in that paper, to get an idea of a practical instance of these factors, e.g. I'm unclear on the mapping of links to bins - specifically whether all bins in the destination are updated or only a selection.
Also the algorithm appears to have no need for an atomic, as a non-atomic version runs. But I don't understand how the non-atomic version works. Does it use considerably more memory? Is atomic required for more complex scenes. The talk of a breadth-first version at the end of the paper implies that memory wouldn't be problematic, I guess.
Jawed