Ah. I get it now. I guess I've never had the need to do that so it took a while to get it.
It's a very good way to gain performance if you have an special case branch that can only handle the special case, but is considerably faster. And you have a general purpose branch that can handle both cases.
For example: The general purpose branch is 20 instructions, and the special case branch is 10 instructions. Without "ifAny" style branch, you will execute 30 instructions for divergent warps. With it, you will only execute 20 (savings of 10 instructions).
This kind of branching is a violation of SPMD, so it's not officially supported in most SPMD languages (OpenCL, DirectCompute). You can do similar think in CUDA with warp voting. However GPU hardware doesn't need any extra instructions for this.
if (c)
{
// common case
[20 instructions]
}
else
{
// optimal special case
[10 instructions]
}
Compiles to something like this:
do the comparison c
if comparison result of c has no bits jump to A
(c) instruction 1
(c) instruction ..
(c) instruction 20
A:
if comparison result of c has all bits jump to B
(!c) instruction 1
(!c) instruction ...
(!c) instruction 10
B:
Where comparison result c is a bitfield (one bit per lane), and the bracketed (c) and (!c) are predicated instructions.
The so called "ifAny" version looks like this:
do the comparison c
if comparison result of c has ANY bits jump to A
instruction 1
instruction ...
instruction 10
jump to B
A:
instruction 1
instruction ..
instruction 20
B:
The second version only uses jump instructions, and it doesn't even need lane predication. On consoles you can write your own GPU microcode, so you can write constructs like this. On CUDA you can write constructs like this also (with warp voting), but it makes your program pretty easily unportable. Also you need to be exactly sure that the general case code also works right for all the special case inputs. Otherwise you will get different results for each thread based on the other threads in the warp. Floating point rounding errors can also be a problem, as compiler can optimize both paths differently (and you will receive slightly different results based on other threads on the same warp). But most CUDA code already is written assuming that warp width is 32, so they are not portable either way (you can assume that 32 threads are always lock stepped, and use this for optimization purposes. You need less barriers this way, but the code breaks down, if the warp width is not what was expected. Many libraries do this extensively).
Intel's desktop CPU's aren't going to represent the best balance between CPU and GPU for obvious reasons. Mobile chips, which represent the majority of the present day chips and the future of computing will tell quite a different tale.
OK, so lets then bring some mobile chips to the debate.
Sandy Bridge i7 extreme (2960XM: 4 cores, 2.7 GHz / 3.4* GHz turbo + HD 3000, 650 MHz / 1.3 GHz turbo):
- CPU: 4 (cores) * 8 (AVX) * 2 (separate mul + add ports): Nominal (2.7 GHz) = 172.8 GFLOP/s, turbo (3.4 GHz) = 217.6 GFLOP/s
- GPU: 12 (EU) * 4 (physical width of EU) * 2 (FMA): Nominal (650 MHz) = 62.4 GFLOP/s, turbo (1.3 GHz) = 124.8
Sandy Bridge i7 performance (2760QM: 2 cores, 2.4 GHz / 3.2* GHz turbo + HD 3000, 650 MHz / 1.3 GHz turbo):
- CPU: 4 (cores) * 8 (AVX) * 2 (separate mul + add ports): Nominal (2.4 GHz) = 153.6 GFLOP/s, turbo (3.2 GHz) = 204.8 GFLOP/s
- GPU: 12 (EU) * 4 (physical width of EU) * 2 (FMA): Nominal (650 MHz) = 62.4 GFLOP/s, turbo (1.3 GHz) = 124.8
Sandy Bridge i3 mainstream (2370M: 4 cores, 2.4 GHz, no turbo + HD 3000, 650 MHz / 1.15 GHz turbo):
- CPU: 2 (cores) * 8 (AVX) * 2 (separate mul + add ports): Nominal (2.4 GHz) = 76.8 GFLOP/s
- GPU: 12 (EU) * 4 (physical width of EU) * 2 (FMA): Nominal (650 MHz) = 62.4 GFLOP/s, turbo (1.15 GHz) = 110.4 GFLOP/s
CPU wins the FLOP/s race in all the other Sandy Bridge mobile parts, except the low end i3 model, and even then only if the GPU is running at maximum turbo clock. In high end Ivy Bridge chips, the CPU and GPU are basically tied. The mainstream Ivy Bridge models with HD 4000 win against CPU handily, but the low end models with HD 2500 graphics do not. So, I wouldn't personally describe it as a "quite a different tale". Currently sold Intel integrated mobile GPUs are pretty much tied with the CPU in peak FLOP/s. But this thing alone of course doesn't warrant any kind of vector ALU sharing between CPU and GPU. That would require much more than equal FLOP/s peak performances.
(*) CPU turbo clocks in the chart are based on the lowest turbo values (four cores active).
..Haswell GPU is substantially bigger then IB...
Haswell will double GPU performance from Ivy Bridge. But Haswell also doubles the CPU flops (dual FMA pipes). So the FLOP/s percentage shouldn't change much at all.