Post your code optimizations

MfA said:
Assuming whether you use i++ or ++i matters for the correctness, and you change your program flow to use one or the other I wouldn't be so sure pre-increment is faster. With pre-increment you avoid creating a copy, and you create a short range dependency ... bit of a mixed bag. Sure for a complex iterator the copy weighs heavily, but for an int?

Assuming it doesn't matter which you use I doubt a compiler creates different code either way.

Pre increment will be faster for iterators, but not for basic types.
 
arjan - thanks for those links :smile:

Xmas - now you've got me. The minimum LUT size should be found for the constant M with the smallest consecutive # of 1 bits in it. I don't have a proof, but my feeling is this number is at least 4, so I'm thinking ~120 entries minimum. I'm at work though so I'll have to think about this more later... :smile:
 
ERP said:
Pre increment will be faster for iterators, but not for basic types.
I tested it for iterators and pre-increment was no faster than post-increment. :!:

So once more, trust the compiler and write what's best for readability.

Besides, if performance really matters that much, don't use STL containers in the first place. Just use an array with simple indexing if possible, or write your own container class that is faster to iterate through. Beware, STL is already pretty fast, so it's not trivial to beat it (nor maintain robustness). But it's a whole lot more worthwhile than changing from post-increment to pre-increment!
 
i personally love this one. simple and logic, but a hell of an expression ;)

Code:
// BEHOLD: the expression of the devil
delta = ((long long)delta * unsigned(((1.0 / SAMPLERATE) * (1 << 6)) * (1LL << 32)) + (1ULL << 31)) >> 32;

this is from a routine to calculate a sample-cursor delta from a given amiga-period in an XM-player. the result of this expression is a UMLAL-instruction on ARM-cpus :)
 
euler distance on ARM in ~10 cycles

Code:
int fast_pythagoras(int x, int y, int z) {
	x = x ^ (x>>31);
	y = y ^ (y>>31);
	z = z ^ (z>>31);
	int summ = x+y+z;
	int maxx = max(max(x,y),z);
	return (summ+maxx)>>1;
}
 
Chalnoth said:
Code:
const int stride = 16;
index = 0;
for (i = 0; i < maxvalue/stride; i++)
  for (j = 0; j < stride; j++)
    p += arr[index++];

for (i = 0; i < maxvalue % stride; i++)
  p+= arr[index++];
This resulted in about a factor of two speedup when I tested it on my system.
Thou shalt be flogged for having computations in your terminal comparisons.

Granted, the compiler will likely fix this for you, but not recognizing the issue could send you down the dark path of putting function calls in the terminal comparison (on many processors without intrinsic divide and modulo, what you have above in your terminal comparisons IS a function call). This will guarantee register and stack spillage.
 
kusma said:
i personally love this one. simple and logic, but a hell of an expression ;)

Code:
// BEHOLD: the expression of the devil
delta = ((long long)delta * unsigned(((1.0 / SAMPLERATE) * (1 << 6)) * (1LL << 32)) + (1ULL << 31)) >> 32;

this is from a routine to calculate a sample-cursor delta from a given amiga-period in an XM-player. the result of this expression is a UMLAL-instruction on ARM-cpus :)
Got anything to do alpha blending?

I've down to about 25 cycles per pixel for 16BPP/565, but can't really do much better.
 
spookysys said:
Code:
int fast_pythagoras(int x, int y, int z) {
	x = x ^ (x>>31);
	y = y ^ (y>>31);
	z = z ^ (z>>31);
	int summ = x+y+z;
	int maxx = max(max(x,y),z);
	return (summ+maxx)>>1;
}
Doesn't seem to work.

fast_pythagoras(30,40,0) return 55. Should return 50.

Unless I'm confused about what its supposed to be doing.
 
RussSchultz said:
Got anything to do alpha blending?

I've down to about 25 cycles per pixel for 16BPP/565, but can't really do much better.
Sounds like a prime candidate for cute little SIMD-in-a-register tricks:
Code:
unsigned int color1, color2; // unsigned 32-bit variables
// load color1, color2 and alpha here ...

color1 &= 0xFFFF; // zero out upper 16 bits; might not be needed ...
color2 &= 0xFFFF;
color1 |= color1 << 16; // move green bits into upper half
color2 |= color2 << 16;
color1 &= 0x07E0F81F;
color2 &= 0x07E0F81F;

color2 *= 32 - alpha; // perform the blend
color1 = (color1 * alpha) + color2;
color1 = (color1 >> 5) & 0x07E0F81F; // move green bits back into lower half.
color1 |= color1 >> 16; // low 16 bits of color1 should now contain an alpha-blended color
In this case, alpha is taken to range from 0 to 32.
 
spookysys said:
Code:
int fast_pythagoras(int x, int y, int z) {
	x = x ^ (x>>31);
	y = y ^ (y>>31);
	z = z ^ (z>>31);
	int summ = x+y+z;
	int maxx = max(max(x,y),z);
	return (summ+maxx)>>1;
}
IIRC, that looks similar to the approximation in graphics gems
 
arjan de lumens said:
Sounds like a prime candidate for cute little SIMD-in-a-register tricks: ....
In this case, alpha is taken to range from 0 to 32.
Ahh but channels are typically given in 0 to(2^N)-1. This is where Jim Blinn's "3 wrongs make a right" comes into play (at least for an 8 bit case). Unfortunately, I can't copy the code because the book is at work.
 
arjan de lumens said:
In this case, alpha is taken to range from 0 to 32.
That's the unfortunate rub. I'm doing 8 bit alpha. 32 bit would be easier, since the multiplication won't 'smear' any components together (they be enough bits apart in the register).

I may revisit my requirements to see if I can sell 5 bit alpha if performance isn't good enough.

There's a 'better' equation to use, by the way, for alpha blending:
component = (alpha * (source - dest) + dest * 256)/256

where 256 would be 32 is this case. Some algebraic manipulation allows a reduction of a few cycles.
 
Simon F said:
Ahh but channels are typically given in 0 to(2^N)-1. This is where Jim Blinn's "3 wrongs make a right" comes into play (at least for an 8 bit case). Unfortunately, I can't copy the code because the book is at work.
Speaking of that, that's terribly irritating.

0 makes sense. (2^n)-1 just doesn't work right. :/ Is it just slightly translucent? Or opaque? If its slightly translucent, how do you code opaque? If its opaque, do you simply accept that the calculations are going to be slightly off?
 
RussSchultz said:
Speaking of that, that's terribly irritating.

0 makes sense. (2^n)-1 just doesn't work right. :/ Is it just slightly translucent? Or opaque? If its slightly translucent, how do you code opaque? If its opaque, do you simply accept that the calculations are going to be slightly off?
No, in 8-bit land 0<=>0.0 and 255<=>1.0 and so 128 is slightly more than a half.

So, if you have "C" = "A" * "B" you need to compute C as = int( A * B / 255.0f + 0.5f).

The mighty Blinn shows that you can do that correctly with just integer maths- three incorrect calculations magically turn out to give the right result :D

AHH, I remembered I did try it out once
Code:
static int JimBlinns(int a, int b)
{
        int Prod, Result;

        Prod = a * b + 128;
        Result  = (Prod + (Prod >> 8)) >> 8;

        return Result;
}
 
Last edited by a moderator:
RussSchultz said:
Thou shalt be flogged for having computations in your terminal comparisons.

Granted, the compiler will likely fix this for you, but not recognizing the issue could send you down the dark path of putting function calls in the terminal comparison (on many processors without intrinsic divide and modulo, what you have above in your terminal comparisons IS a function call). This will guarantee register and stack spillage.
How do you loop through an object that uses a property with method for the count? That's probably the one I use most for loops.

Although the compiler might optimize that into a fixed number if you don't add or delete members in your loop.
 
K.I.L.E.R said:
Sorry but I have to stick the ++i/i++ debate to sleep for good.

http://www.parashift.com/c++-faq-lite/operator-overloading.html#faq-13.15
Yes it's never slower but it doesn't mention the argument that i++ is more natural to read (and that's a personal opinion anyway). Also, for pointers used in a loop you mostly have to write *p++. The ++i form is hardly ever semantically required and most people would write i + 1 in these situations. And here's another minor reason:
Code:
int i = 0;
int j = i+++i;
I hope you don't write code like this but guess what's the value of j?

i++ is simply used more and thus easier to read so I write my loops with i++ as well. I couldn't find a practical performance benefit anyway and in theory with dead code elimination there is none at all. So the only argument remaining is readability, and that's where i++ wins.
 
RussSchultz said:
Doesn't seem to work.

fast_pythagoras(30,40,0) return 55. Should return 50.

Unless I'm confused about what its supposed to be doing.

Yeah.. well it's a pretty crude approximation.

Simon F said:
IIRC, that looks similar to the approximation in graphics gems

Hm yeah i guess that's it...

Note:
foo = foo ^ (foo>>31)
Takes the absolute (almost :) and compiles to 1 instruction: eor with asr.
 
DiGuru said:
How do you loop through an object that uses a property with method for the count? That's probably the one I use most for loops.

Although the compiler might optimize that into a fixed number if you don't add or delete members in your loop.
It won't, because it can't know what's going on inside your object.

If all of your members that are called within the loop are marked const, you could get away with it, but I doubt that your objects are like that.

The right thing to do is get the count prior to engaging in the loop, then iterate on that value.
 
Back
Top