Well maybe it wouldn't be hard to implement, but it's certainly not trivial unless I'm missing something. Furthermore if it was easy, I doubt they would have had the restriction in the first place.
Consider a kernel something like:
Code:
groupshared float x = 0;
{
if (threadID == 0) { // Consumer
while (x == 0) {}
} else if (threadID == 1) { // Producer
x = 1;
}
}
Now as mapped to SIMD lanes, this doesn't necessarily work properly. If the vectorized code decides to predicate the threadID if and evaluate the consumer block first, it will never get out of the resulting while(), and hence never get into the producer block... deadlock - ouch.
To avoid this, you basically have to know to run that control block independently, and not predicated/SIMD in this case. You can't know this statically in the general case, so you need the hardware (or SW if targeting a SIMD ISA) to dynamically be evaluating arbitrarily different control flow paths generated by predication across the warp/SIMD lanes. Fully generally, this involves packing/unpacking masked warp lanes on the fly.
Simply always splitting on predication/control flow into separate warps would solve the deadlock problem, but would be pretty inefficient as it would imply that whenever a single lane diverges, it will never converge again even for simple, reducible control flow. Thus for a "proper" implementation, you also need to be able to detect re-convergence and pack separate warps back into a single one.
Note also that the sync() operations become ill-defined in this context... there's no longer any guarantee that a given "thread" will hit a given sync() (or any sync()) so it's unclear what the barrier should mean... all threads that take that control flow path? You end up with more utility from more traditional shared memory and atomic threading constructs.
That's why I say that this problem is fairly equivalent to dynamic warp formation, which appears to be a fairly difficult one to do efficiently since although it has had obviously huge benefits back to the first time someone ran a ray tracer on the GPU, it still has yet to show up in any hardware that I know of, and I'm pretty sure that if Fermi had it, they'd be making more noise about it.