I agree with that, but with predication you're only executing empty instructions during a branch. For parts of the code outside a branch all units are working. I don't know how significant this is though.With predication on GPUs you're running both sides of the branch too, and in those cases part of the SIMD hardware is idle/empty also.
What about the case where bits are set in vcond and NOT(vcond)? Won't it need to run both sides of the branch like a GPU?If I understand it right, there won't be a vector branch instruction in Larrabee, so there's no chance of having both a taken and a not-taken result for the same branch. Instead, you'll have vector comparison instruction(s) that compute essentially a 1-bit result for each vector element, and a way to tell whether any of those results are true or any false.
Then a potentially-divergent branch on Larrabee would look like this:
The branch predictor would predict the result of the two branches in that code, each of which is either taken or not taken.Code:vcond = <vector comparison> push current vector predicate mask if (any bits set in vcond) AND vcond into vector predicate mask run "true" side of branch else if (any bits set in NOT(vcond)) AND NOT(vcond) into vector predicate mask run "false" side of branch pop vector predicate mask
I'd expect there to be specialized instructions for making this pattern (and similar patterns for loops) efficient.
Code:
if ( all bits set in vcond )
run "true" side of branch
else if (all bits set in NOT(vcond))
run "false" side of branch
else
run both sides
edit: The above pseudocode might illustrate my question better. Ignoring that branch predictors will never practically be 100% correct if the architecture works like a GPU, then it seems that it's not possible for the branch predictor to be correct all the time.
I guess it would pick a branch and if necessary run the other branch while either throwing away the original branch or keeping the results from both with a predicate mask. If both branches are frequently taken then a branch predictor seems like a waste of silicon.
Or is a GPU mindset making me miss how the SIMD unit will work?
Last edited by a moderator: