On the register allocation topic, the CUDA docs talk about using a multiple of 64 threads per block to avoid either register read-after-write dependencies and bank conflicts. I wonder if the fat vs thin allocation might have something to do with using registers for texture fetch parameter storage?
You mean because TMUs access registers much more slowly? Presumably the 2 quad-TMUs in a GT200 cluster work on the same warp's texturing in lock-step. But maybe not?
8 texture coordinates is 16 scalars, which is one fetch from all 16 banks, say. But if the minimum burst size is 4, then a single fetch would actually produce coordinates for 32 texture results, i.e. one warp. So 2 burst fetches are actually required for a pair of warps that are paired.
Been thinking a little more about the dynamic warp formation idea, and it seems as if this would break the implicit warp level synchronization that you have with CUDA
Between synchronisation points, yes - but that's exactly what you want. Once the warp has re-synchronised at the end of the clause that produced divergence, the warp continues execution as normal.
also would be a problem with the warp vote instruction.
When DWF is operational you have a problem of maintaining what are now "scattered predicates", e.g. threads A1,A4,A17,A31 (accompanied by B..., C... etc. threads) are in a loop while the rest of the warp is sleeping, and on this iteration A17 and C24 go to sleep. So, during divergence _any is always true and _all is always false (otherwise there'd be no divergence).
I suppose what you're alluding to are predicates that aren't linked to overall clause control flow, e.g. breaking out of a loop when a value falls below a threshold for all threads.
Not sure if simple nesting of predicates takes care of this. Masking for the currently active threads may do the job. I can't remember how Fung's proposal for DWF handles this, to be honest.
Jawed, have any ideas on how atomic operations are implemented? Seems to me that shared-memory atomic operations enforce that the required instructions are scheduled as a packet together. Really wondering how global atomics are implemented...
Hmm, I've not really spent any time on these, so this is my first impression.
It looks to me as if something like a loop with a rotating (e.g. shift-left) predicate mask across all the warps in a block is used to enforce serialisation of the instructions that comprise the atomic "macro". I guess there must something in the instruction-issuer that forces the entire macro to issue in sequence. There could be a cache in the issuer (operand collector) to hold the result of each atomic operation in order to avoid the latency of literally writing to memory.
Presumably for global atomics the issuers have a communication network dedicated to sequencing amongst themselves. But of course there's no ordering, so the sequencing should be fairly trivial, i.e. entirely co-operative rather than pre-emptive.
Presumably, with a bit of luck, with something like the code on page 109 of the 2.2 Guide (I have beta version) which only does atomicInc for thread==0 of a block, the two predicates are ANDed to produce just one loop iteration (single packet issue).
Is there any analysis of atomic instruction performance out there?
Jawed