Generally, there's no need for branch prediction in a GPU. (Though GPUs and the driver software, together, can make use of hinted information about branches.)
This is really a reflection of the fact that a GPU's pipeline is really quite different from a CPU's.
In general, a GPU runs one instruction, repeatedly clock after clock, say a thousand times. On 4 or 16 pixels simultaneously (say). After all the pixels in the batch have run that instruction, the GPU then moves onto the next instruction in the program.
The GPU pipeline is there to organise the vast amounts of data that a GPU can crunch through - 10 or 100x what a typical CPU can do.
The relatively simple instruction set of a GPU and the outwardly quite limited concepts that can be programmed in shader language, mean that instruction decoding and branch prediction are essentially irrelevant side-issues in pipeline organisation (generalisation, of course - they don't totally disappear).
When a branch occurs, the GPU has the entire remainder of the batch to organise the data/instructions to run next (e.g. if pixel 667 is the first pixel to branch then there's another 333/16 = 20 clocks to go). Though the 1000th pixel might be the little bugger that is first
in which case you'd have a glitch.
Predication is used in the GPU to differentiate which pixels run which instructions:
Code:
....if pixel is in shadow then
1001 make
1001 this
1001 pixel
1001 darker
....end if
So the first and the last pixels (in a group of four, here) run the code, the other pixels failed the branch and so don't run the code.
---
A CPU runs a different instruction (in general) on each successive clock. A CPU deals with relatively small amounts of data, but generally it's a moderately disorganised slew all across memory. Caching on modern CPUs takes care of that problem but still leaves the problem of latency in fetching data.
One way a modern CPU pipeline deals with latency is by preparing alternative instructions (in a different hardware thread) to execute should the data not arrive on time.
The pipeline also has to deal with not being sure which instruction will execute after a branch in code is performed (a simple if...then...else, or a loop). The CPU can speculatively fetch the choices (both branch-fail and branch-succeed) and have them ready to roll, or even start executing them. That's the basic idea with branch prediction. It's costly in terms of transistors on the CPU, but that's what you get for having such laggardly data flows.
A CPU's pipeline tends to be fairly long, 20-to-30 stages. If a branch prediction is wrong then the entire contents of the pipeline become worthless. And the program will actually glitch until data/program has passed through the pipeline's full length, where it can start running again.
Of course I'm talking in rough terms about AMD/Intel CPUs in PCs.
---
GPUs benefit from having a pipeline that's always "1 instruction" long, for any given program. Hence there's no need for branch prediction, as such.
(There is a time overlap in the pipeline, as the new instruction starts up while the old instruction is just finishing, so it's not technically 1 instruction long, but that's essentially how it appears.)
NVidia GPUs with dynamic branching have a 6 clock latency on executing a branch (it may be less, now, it seems to have changed because of driver improvements). I can't for the life of me work out why, because it should be "hidden" by the 1 instruction pipeline. So I'm missing something there
Jawed