TimothyFarrar
Regular
Because you need 10s or hundreds of atomics in flight to make the performance bearable.
Seems to me that, sparse or frequent atomic usage doesn't matter as long as you can keep the ALUs busy with work (unless you are bandwidth bound). Throughput is important here, GPU atomics are only for un-ordered usage by design. It is just like a highly latent texture or global memory access. I cannot think of off hand cases where we are latency bound on the GPU now (tiny draw calls, CUDA kernels with tiny grids, with proper work load balancing that issue might go away). Maybe when this current GPU scaling starts to level off and on-chip networking becomes a bottleneck, then we can return to the latency problem.
Are these applications even trying to do other work in the shadow of the atomic latency?
In a lot of cases they simply cannot by hardware design. I attempt to explain below, but keep in mind this is from my in-brain knowledge, I don't have the time to re-lookup all the info in detail. So someone else here might need to correct the rough spots.
In the not try, but "free" work case, ie in the case where you have a hyperthreaded CPU, and the atomic operation doesn't do nasty stuff to the shared ALU pipe, then you just end up with the many times lower amount of hyperthreads CPU have (say 2) vs the variable up to 32 "hyperthreads" you have with GT200. So you loose the ability to hide the latency in full throughput ALU work.
Problem with typical PC usage with atomics is the memory barrier (because atomics are commonly used in cases where order is important, and often hardware forces the barrier) which change compiler instruction reordering, and stall the CPU. With GCC in most cases a full memory barrier (read+write) is built into the atomic intrinsics. With MSVC, you have special cases of acquire and release semantics which do a read only and write only barrier (which are of use on some platforms only, like IA-64, etc).
The PowerPC (old macs, consoles, etc) have special lwarx (load word and reserve indexed) and stwcx (store word conditional indexed) instructions. These two functions in combination with other regular instructions emulate atomic operations. The load word and reserve marks the cache line as reserved. The CPU will toss that reservation if another thread attempts a reservation, or if the line gets dirty before the store word. If the reservation is lost the store word conditional fails and a software retry loop is started.
Old x86 chips had the LOCK instruction prefix physically lock the bus for a given instruction, >=P6 I think Intel moved this to something sane, and cache coherency handles the problem (don't have clock cycle counts in my head for actual and false cache line contention however). Locking does however serialize all outstanding loads and stores (ie forced memory barrier). X86 does stores in order, but loads can go before stores.
Bottom line is atomics CPU side have nasty side effects which reduce ALU throughput. Could be 20-250 cycles where CPU isn't doing real work (ie just doing say an Interlocked*() atomic function).
The issue with NVidia's design is it's write-through with no concept of MRU/LRU, so it's generating worst case latency for everything (this is how it seems, anyway - maybe there's some blending tests out there that show otherwise). It's like having no texture cache at all. The GPU can hide the latency, but it takes a lot more non-dependent instructions to do so than if basic caching was implemented. It's just using the ROPs as they are, i.e. minimal cost.
I'm not convinced that this is a problem or that caches would solve it. ROPs do have coalescing, so effectively a write-combining cache. And it seems as the ROP/MC or whatever is doing the actual ALU work for the atomic operation, and NOT the actual SIMD units. This is what is important. In the worst case, ie address collisions, the normal SIMD ALU units keep right along working, and sure the ROP/MC atomic ALU work gets serialized. It is likely this serialization and extra ROP/MC ALU unit is the true reason for the atomic operation latency (extra ALU unit is throughput bound, increasing latency). Also remember, CUDA atomics return the old value fetched from memory, before the atomic operation happens (so latency would be the same as the global fetch, minus extra writeback causing a reduction in throughput, if not bottlenecked by something else).
So really what would you rather have, cache + CPU doing the atomic operation and CPU thread stalling, or issue the atomic operation and have the memory controller or ROP do the atomic operation while the CPU keeps on processing (this is what I like).
EDIT: sure would be funny if the normal SIMD ALUs did do the atomic operations... clearly I'm making a lot of assumptions here!
Last edited by a moderator: