Mintmaster
Veteran
It can still be done at the first control flow point reached after the timeout while solving your deadlock problem, because deadlock isn't possible w/o control flow. Some hardware may already have a per-thread program counter (didn't Bob mention this?), as it's not clear to me that such an implementation for control flow is any harder than using a stack of predication masks.Interesting proposal although I'm not sure this is as simple as it sounds either. As you mention in your edit I think this still breaks down to being able to split up the "threads" in the lanes after some timeout, rather than just always doing it at control flow. In either case it does involve tracking additional program counters and potentially registers (unless you have some indirection map from the new warps to the old register lanes they came from I guess) when the split happens.
As for new registers and indirection maps, you lost me, as I don't see why they're necessary. Remember that all that we're doing is changing the order of thread execution. Dealing with deadlock could be as simple as this:
1. Initialize execution mask
2. Set n to the index of the first bit set in the mask
3. Find all threads in the batch with the same instruction pointer as thread n, use these as the predicate mask, and clear those bits from the execution mask
4. Execute the program on these threads until the timeout (or syncthreads, or fetch)
5. Go to step 2 until the mask is clear
The biggest issue is, as you mentioned, the lack of thread reconvergence. However, if you want to guarantee being deadlock free (for appropriate code, of course), then I don't know if it's possible to run threads in a way that will let threads reconverge. The control path dependent partial-warp syncs that you mentioned can also deadlock if inserted by the compiler, because maybe some threads need to run through a section of code common to all threads before others to break the deadlock, thus mandating divergence.
I think the solution has to involve both HW and language changes. HW-only changes could at best heuristically weigh that timer I mentioned to encourage reconvergence, and I can't see it working too well. CPUs don't have to worry about reconvergence at all beyond regular syncs, so how can we expect GPUs to efficiently tackle this problem? What you need is subwarp desync-sync pairs in the language (e.g. syncthreads() takes a mask parameter which you initialize with desyncthreads(), with all code between being reducible) that the coder explicity uses to prevent deadlock arising from compiler defined static ordering of branch evaluation. This way you only use that timer when absolutely necessary, because otherwise divergence screws performance up.
Until we're ready to incorporate something like this into the language, IMO it's not worth it to do anything about this problem.
BTW, what are all the advantages you're thinking of with DWF? And how general of a solution are you thinking of?