How will branch prediction work with a 16 wide vector machine like Larrabee? With GPUs each vector element is a vertex or pixel that can branch independently so predication is used unless all elements choose the same branch.
If you are talking about branching granularity, Larrabee, G80, and R6xx have a granularity that is greater than one.
The question is how we go about comparing them.
Larrabee's vector is 16-wide, though if we were to pack a pixel quadsworth of data into each register, it is only 4 pixels. If the data is arranged differently, we could force the granularity up to 16.
G80 has a granularity of 16 for vertex threads, and 32 for pixels.
R600 has a granularity of 64.
If there is any divergent branch behavior, there is going to be idled hardware, duplicated work, or serialized run-through of both code paths.
Larrabee's best case is less than that of G80, which is less than that of R600.
With predication on GPUs you're running both sides of the branch too, and in those cases part of the SIMD hardware is idle/empty also.
If I understand it right, there won't be a vector branch instruction in Larrabee, so there's no chance of having both a taken and a not-taken result for the same branch. Instead, you'll have vector comparison instruction(s) that compute essentially a 1-bit result for each vector element, and a way to tell whether any of those results are true or any false.
Then a potentially-divergent branch on Larrabee would look like this:
Code:
vcond = <vector comparison>
push current vector predicate mask
if (any bits set in vcond)
AND vcond into vector predicate mask
run "true" side of branch
else if (any bits set in NOT(vcond))
AND NOT(vcond) into vector predicate mask
run "false" side of branch
pop vector predicate mask
The branch predictor would predict the result of the two branches in that code, each of which is either taken or not taken.
I'd expect there to be specialized instructions for making this pattern (and similar patterns for loops) efficient.
That sounds more like value prediction that just happens to be used to change program behavior. There may be some theoretical advantage shown in scalar code, but I haven't seen much on wider values.
At 16 elements, a branch predictor's storing a bit mask for each vector branch would be 8 times as large as a 2-bit saturating counter.
At 16 elements, there is a widening space of possible branch paths besides taken-not taken.
Given how much of this behavior is data-driven, the chance that the predictor is wrong in 16 elements is pretty high.
I'm not convinced predicting a vector mask's value is better.
edit:
Never mind, I read too much into what you were saying.
The pseudo code there is a branch based on a compare to zero.
The problem of branching granularity is not helped, as the branch predictor only avoids work if the vcond value is zero. That actually goes so far as ignoring the other coherent branching case where vcond is all 1s.
GPUs at least are capable of handling both the all 0s and all 1s cases pretty well.
The branch predictor in this case adds nothing and potentially makes things worse.