Somewhat OT, but regarding G70 - each group of 4 shader units works on a single quad each cycle, correct?
One way one might be able to improve dynamic branching performance is if one scanned a batch for "active quads" (those containing at least 1 pixel executing the current instruction in the batch) ahead of issue time, and at a rate greater than the issue rate (again, from what I understand,1 quad per cycle per 4 shader units).
To cope with memory latency, you'd still want to be able to handle multiple batches, but maybe one could get away with a much smaller number of threads (say 4 batches of 64 pixels per group of 4 shader units, or 2 batches of 128 pixels) and still get quite a decent speedup...
One way one might be able to improve dynamic branching performance is if one scanned a batch for "active quads" (those containing at least 1 pixel executing the current instruction in the batch) ahead of issue time, and at a rate greater than the issue rate (again, from what I understand,1 quad per cycle per 4 shader units).
To cope with memory latency, you'd still want to be able to handle multiple batches, but maybe one could get away with a much smaller number of threads (say 4 batches of 64 pixels per group of 4 shader units, or 2 batches of 128 pixels) and still get quite a decent speedup...