Performance -> Boolean vs Number crunching

K.I.L.E.R · Dec 31, 2006

Not really specific to any processor, however x86 and x87 CPUs tend to crunch numbers quite well. My issue is in designing a fast algorithm for these processor families. I've always believed, from ages past that branching is always slower than crunching a series of floating point numbers.

For simple boolean ops like:
a &= b;
a |= c;

Assume these are *not* bitwise ops, just basic conditional tests.

The cost of branching shouldn't exist, so conditional tests should be the middle ground of performance. Branching < Conditional tests < Floating Point < Fixed Point. Branching does use conditional tests, but my belief is that branching uses stacks, while simple conditional tests compounded on one another do not create a branch stack.

Can someone please confirm or explain any of my wrong beliefs? Thanks.

RussSchultz · Dec 31, 2006

K.I.L.E.R said:
Not really specific to any processor, however x86 and x87 CPUs tend to crunch numbers quite well.

The x87 hasn't existed for over 10 years...

My issue is in designing a fast algorithm for these processor families. I've always believed, from ages past that branching is always slower than crunching a series of floating point numbers.

For x86 processors, yes, branching is worse than straightline number crunching.

For simple boolean ops like:
a &= b;
a |= c;

Assume these are *not* bitwise ops, just basic conditional tests.

Then write them that way. I'm not exactly sure what you're writing. It LOOKS like A (bitwise AND) B, which is most likely executed in a single instruction, but then you discuss conditional tests, which the code, as written, isn't. If you meant A = A && B, then it really depends on the underlying architecture as to whether its a branch or not (the ARM, for example, it probably wouldn't compile down to a branch). It might on the x86. I highly recommend you write the code out, and look at the underlying assembly.

The cost of branching shouldn't exist, so conditional tests

A conditional test is branching, unless you mean conditional execution (which the ARM has, but I don't thnk the x86 does).

should be the middle ground of performance. Branching < Conditional tests < Floating Point < Fixed Point.

I believe in the current x86 architectures floating point is as fast, or faster, than integer math.

Branching does use conditional tests, but my belief is that branching uses stacks, while simple conditional tests compounded on one another do not create a branch stack.

Again, depends on the architecture, but typically branching (as in if/else) doesn't create any sort of stacks on the basis of the branch alone. You might have a local variable declared inside the if/else block that requires something be put on the stack, but the branch itself doesn't demand it.

Branches are slower because they stall the pipeline.

Again, I highly recommend compiling code and looking at hte resultant assembly. Even then, you need a book about the processor architecture you're interested in to understand the impact of branches/etc have on the pipeline. Some are pretty easy to understand, others (more modern processors) are very complex.

K.I.L.E.R · Dec 31, 2006

Thanks Russ.

3dilettante · Jan 2, 2007

K.I.L.E.R said:
The cost of branching shouldn't exist, so conditional tests should be the middle ground of performance. Branching < Conditional tests < Floating Point < Fixed Point. Branching does use conditional tests, but my belief is that branching uses stacks, while simple conditional tests compounded on one another do not create a branch stack.

Can someone please confirm or explain any of my wrong beliefs? Thanks.

A mispredicted branch will always be a performance problem for any processor with a pipeline.
Usually, there is a performance cost that is amortized over many branch encounters, if there is a pattern to the branch's behavior.

Properly predicted branches can have as little penalty as a single issue slot in one cycle--or as much as any other simple instruction.

Some architectures can't maintain this. The K8 inserts a pipeline bubble for every branch, and it's also possible that a branch instruction will split a superscalar processor's instruction fetch.
The P4's trache cache avoided the second problem.

If anything, the conditional test in your hierarchy should be the slower of the two. A chip hitting an unconditional branch doesn't have to worry about mispredicts that plague conditional branches.

For OoO processors, branching can sometimes be a significant win, if the conditional branches can be well predicted by the hardware. A software loop is usually picked up very well, and Intel's current chips have predictors that are dedicated to loop detection.

There are ways, like if conversion, that use predication to burn through both sides of a branch, but that should only be used if a branch cannot be predicted well. Avoiding branches by sacrificing decode and execution bandwidth will inflict a penalty in wasted execution 100% of the time. If a branch can avoid part of it in the common case, it is better to branch.

It's more of a question of whether the code is branching only where it must. A branch that if predicted well avoids 200 unnecessary ops is suddenly much more worthwhile than running those 200 ops and then undoing all the work.

ERP · Jan 5, 2007

K.I.L.E.R said:
Not really specific to any processor, however x86 and x87 CPUs tend to crunch numbers quite well. My issue is in designing a fast algorithm for these processor families. I've always believed, from ages past that branching is always slower than crunching a series of floating point numbers.

For simple boolean ops like:
a &= b;
a |= c;

Assume these are *not* bitwise ops, just basic conditional tests.

The cost of branching shouldn't exist, so conditional tests should be the middle ground of performance. Branching < Conditional tests < Floating Point < Fixed Point. Branching does use conditional tests, but my belief is that branching uses stacks, while simple conditional tests compounded on one another do not create a branch stack.

Can someone please confirm or explain any of my wrong beliefs? Thanks.

Simple version.....

Because of the somewhat brain dead conventions in C++ logical operations will almost always generate branches. Doing bitwise operations on integer values where true/false is signified by 0 or 1 will not.

Branches are best avoided.

Generally float ops are as fast or faster than integer ops, but they may have higher latency.

Converting between floats and ints is just plain slow.

And most importantly most code is performance limited by memory accesses.

K.I.L.E.R · Jan 5, 2007

Thanks guys. One question about your last point ERP, in what way are applications memory access limited?

ERP · Jan 5, 2007

K.I.L.E.R said:
Thanks guys. One question about your last point ERP, in what way are applications memory access limited?

If your app doesn't loop over the same data repeatedly, the cache misses on load usually dwarf the execution time of the logic.
Arranging data to be "cache optimal" usually buys more than trivial optimizations of logic operations.
Another C++ gotcha I see a lot is unneccessary copying of STL containers which leads to being malloc bound.

3dilettante · Jan 5, 2007

That does assume that the code is accessing the data in a way that doesn't get caught by the hardware prefetcher (assuming we're just thinking of chips from the last 5 years). A lot of sub-optimal code does access in a regular pattern that can be picked up.

It's hardly perfect, but the fact Intel's Conroe has such an improvement in IPC can probably be attributed to the loop cache and much more aggressive prefetching.

It doesn't make the problem go away, but it does make it so optimization efforts can focus on other things.

In workloads that don't hit the prefetchers, I wonder if x86 wouldn't benefit from some kind of bulk load or bulk prefetch, something broader than just a cache line. With just a little extra software control of locking cache lines, portions of cache could be used to emulate a local store. If access behavior isn't totally out of whack, the shorter latency of the L1 would in the general case do better than a larger uniform-latency local store.

pcchen · Jan 9, 2007

ERP said:
Because of the somewhat brain dead conventions in C++ logical operations will almost always generate branches. Doing bitwise operations on integer values where true/false is signified by 0 or 1 will not.

I believe that's because the logical operations are short circuited, e.g. if A is true in (A && B) then B won't be evaluated. This is a very useful feature. For example, it's safe to write:

if(a != 0 && *a > 100) {
...
}

If it's not short circuited then it's not safe because *a > 100 may be evaluated when a == 0.

RussSchultz · Jan 9, 2007

In general, I don't think that's true.

a may be a memory mapped register with read side effects)(or in C++ a class with the same, though the compiler could tell that), and the operation of the program could be adversely affected if the compiler skipped evaluation of a.

The only time things are short circuited are when you use the " ? : " construct, AFAIK (or nested if/else blocks)

Blazkowicz · Jan 9, 2007

RussSchultz said:
The x87 hasn't existed for over 10 years...

but there's a x87 in my x86 CPU

DemoCoder · Jan 9, 2007

logical operators (&& and ||) are by definition short-circuited in C, C++, Java and most similar languages.

if(a && b) is guaranteed not to evaluate B if A evals to false.
if(a || b) is guaranteed not to evaluate to B if A evals to true.

Dont confuse with & and | boolean/bitwise ops, which are NOT short circuited.

Short circuit behavior is fundamentally relied upon in huge volumes of code. If left-to-right short circuit eval was not guaranteed, there would be huge bugs.

Hell, I someones "comment out" stuff by writing "if(false && ...")

Simon F · Jan 9, 2007

RussSchultz said:
In general, I don't think that's true.

No. C definitely has short-circuited evaluation of (logical) Boolean expressions. Pascal, OTOH, did not.

a may be a memory mapped register with read side effects)(or in C++ a class with the same, though the compiler could tell that), and the operation of the program could be adversely affected if the compiler skipped evaluation of a.

But his test was to see if he had a null pointer and so this was a highly valid use.

Gubbi · Jan 9, 2007

Branches aren't always slower.

Since doing predicated execution (what converting conditionals into explicitly calculated booleans essentially is) can have worse execution latency than a correctly predicted branch you can end up with lower performance. This of course depends heavily of how well the branch can be predicted and how large an execution latency is avoided.

The branch prediction apparatus essentially breaks dependencies.

Cheers

ERP · Jan 9, 2007

pcchen said:
I believe that's because the logical operations are short circuited, e.g. if A is true in (A && B) then B won't be evaluated. This is a very useful feature. For example, it's safe to write:

if(a != 0 && *a > 100) {
...
}

If it's not short circuited then it's not safe because *a > 100 may be evaluated when a == 0.

It's actually for two reasons, the one you mention, and the fact that bools are not first class constructs in C++ they can be indescriminitely mixed with ints and the standard specifies that any none zero value is true.

I always used to ask the following question in interviews

Under what conditions do the following two code fragments yeild different program state

bool fn1();
bool fn2();

Fragment one is

bool res = fn1() && fn2()

Fragment two is

bool res = fn1() & fn2();

I'd guess less than 30% of people who consider themselves proficient C programmers get it right.

Simon F · Jan 9, 2007

ERP,
You'll have the "functions should be side-effect free" fanatics down on you like a ton of bricks

SF

ERP · Jan 9, 2007

Simon F said:
ERP,
You'll have the "functions should be side-effect free" fanatics down on you like a ton of bricks
SF

See that's the obvious answer... But that's the one I'm usually looking for.
A second option would be that the first function tests a shared pointer value returning false on a null pointer the second function then dereferences it without testing.....

Neither are particularly good practice, but I've seen both in real applications.

Npl · Jan 9, 2007

ERP said:
See that's the obvious answer... But that's the one I'm usually looking for.
A second option would be that the first function tests a shared pointer value returning false on a null pointer the second function then dereferences it without testing.....

Neither are particularly good practice, but I've seen both in real applications.

And the third option would be that the functions assume that fn1 will be executed first, but thats not guaranteed in the second fragment.
I guess thats a more popular mistake than option 2.

Simon F · Jan 10, 2007

Npl said:
And the third option would be that the functions assume that fn1 will be executed first, but thats not guaranteed in the second fragment.
I guess thats a more popular mistake than option 2.

That is, sort of, already covered by the "has side effects".

RussSchultz · Jan 10, 2007

Simon F said:
No. C definitely has short-circuited evaluation of (logical) Boolean expressions. Pascal, OTOH, did not.

Oh, well, learn something new every day

But his test was to see if he had a null pointer and so this was a highly valid use.

Ahh, but I believe order of fragments isn't dictated, so this code may work on some compilers, and not on others.

Anyways, I've always treated everything in a boolean expression as getting executed all the time, and do nested if/elses or the "? : " construct to dictate ordering if its important.

Performance -> Boolean vs Number crunching

K.I.L.E.R

Retarded moron

RussSchultz

Professional Malcontent

K.I.L.E.R

Retarded moron

3dilettante

ERP

K.I.L.E.R

Retarded moron

ERP

3dilettante

pcchen

Moderator

RussSchultz

Professional Malcontent

Blazkowicz

DemoCoder

Simon F

Tea maker

Gubbi

ERP

Simon F

Tea maker

ERP

Npl

Simon F

Tea maker

RussSchultz

Professional Malcontent

Similar threads