Post your code optimizations

arjan de lumens · May 22, 2006

psurge said:
arjan, Xmas:
I totally forgot about this x87 stuff. I would like to retract my earilier hasty comment. A serious question though: I had thought that SSE (1,2,3?) included scalar fp instructions (which encode the operand precision in the instruction) - are those not there yet from a compiler/HW POV?

The scalar SSE instructions tend to have large instruction encodings, leading to large executables, and they are sometimes actually slower than their x87 counterparts - in particular, the P4 can e.g. schedule 1 FADD per clock cycle, but only 1 ADDSS every 2 clock cycles. I also suspect (but don't know for certain; I don't have one for testing right here) that the AthlonXP actually decodes the ADDSS as a vector instruction, which adds a nearly 1 cycle penalty compared to FADD. The assumption that everyone has an SSE-capable processor is still not considered 100% safe either, even now.

As such, scalar SSE does not normally offer any real benefit over x87 for ordinary use (except perhaps on AMD64, where scalar SSE has access to 16 registers and is not hampered by silly scheduling limitations).

psurge · May 22, 2006

Xmas said:
Well, I now understand why the 64 bits of the pseudo-70 (or 69)-bit number need to have no sequence of zeros, because the implicit last bits are zero. But still there is no problem at all with ((v & -v) * M) == 0, as long as that result is unique as well.
And because of this, 6 bit sequences are sufficient.

Xmas, i couldn't follow your post but... let me try again:

First note that
(v & -v) can take 65 different values: { 0, 2^0, 2^1, .... , 2^63 }.

(v & -v)*M = 0 when v = 0, regardless of M, so if we are using the highest
N bits of (v&-v)*M as indexes into the lookup table, then (M << (N-1))
cannot contain any sequence of N consecutive zero bits. Otherwise, for some v != 0,
one has ((v & -v) * M) >> (64 - N)) = 0 and the method breaks.

Finally, N cannot be 6 since this gives you only 2^6 = 64 indexes (and you have 65 possible return values).

You can get away with using 6 bits for your index only if you treat v = 0 as a special case for which garbage is returned. The GNU __builtin_ctz and __builtin_clz do this, but IMO it makes a lot more sense to just return 64.

sorry if I'm missing something or being unclear -
Serge

KimB · May 23, 2006

arjan de lumens said:
pow(a,b):

If b is a positive integer, then you can use the Exponentiation by squaring algorithm.

if b is a negative integer, then you can use the relation pow(a,b) = pow(1/a, -b) and then apply exponentiation-by-squaring.

Otherwise you will have to use the relation pow(a,b) = exp(b*log(a))

You'd probably also want to do some performance testing to ensure that if b is greater than some number, that the method exp(b*log(a)) is used instead. Otherwise the compute time required will just keep growing, while exp(b*log(a)) executes in constant time.

Rodéric · May 23, 2006

I'm doing that small optimisation, doesn't cost much, doesn't hurt readability, and last time I checked ++i was shorter (in ASM) than i++.

Code:

for ( int i = 0; i < iCount; ++i )

Nick · May 23, 2006

Ingenu said:
I'm doing that small optimisation, doesn't cost much, doesn't hurt readability, and last time I checked ++i was shorter (in ASM) than i++.

Code:

for ( int i = 0; i < iCount; ++i )

I'm sorry but that's nonsense. They both translate to the same instruction on any modern compiler. So for the sake of consistency with the rest of the world just write i++.

arjan de lumens · May 23, 2006

Chalnoth said:
You'd probably also want to do some performance testing to ensure that if b is greater than some number, that the method exp(b*log(a)) is used instead. Otherwise the compute time required will just keep growing, while exp(b*log(a)) executes in constant time.

You may additionally wish to check that 'a' is positive before you resort to exp(b*log(a)) (perhaps using the identity pow(a,b) = pow(-a,b) * pow(-1, b % 2), where the latter pow() can be implemented as a 2-entry lookup table), and also that your exponentiation-by-squaring implementation does not perform an actual if-then-else branch in its innerloop.

Ingenu said:
I'm doing that small optimisation, doesn't cost much, doesn't hurt readability, and last time I checked ++i was shorter (in ASM) than i++.

Code:

for ( int i = 0; i < iCount; ++i )

For a standalone expression, as in that for-loop, there is no reason that ++i should be faster than i++ (1 instruction in x86 either way). In a compound expression, it's a bit different: ++i is then always evaluated as a part of the expression while i++ effectively functions as a separate expression to be evaluated after the compound expression is done. This means that the compiler is more likely to spill i out of the register file between the use and increment in the i++ case than in the ++i case, adding at worst about 1-2 instructions. OTOH, using ++i causes the increment operation to be attached to the dependency chain of the compound expression, which can increase the latency of evaluating the expression.

Npl · May 23, 2006

Nick said:
I'm sorry but that's nonsense. They both translate to the same instruction on any modern compiler. So for the sake of consistency with the rest of the world just write i++.

if i is something more complicated, like a iterator-instance of a STL-Container, ++i is faster than i++. ++i increments i and returns i, i++ creates a new iterator-instance (copy of i`s current value) then increments i then returns the new instance.
it doesnt gets you anything with a simple integer obviously, but I`m too using ++i, if only for consistency with loops where i use iterators.

My own optimizations are few, I usually prefer the tertiary statement over if-else whenever possible, as compilers then more reliably optimize branches away ( current compilers could be alot better in that regard, but I got used to this after evaluating MSVC 6.0 ).

I also often review the assembly code created, the most common downfall for compilers is if they have to take the most generic case into acount, ie. expect than a variable could be altered concurrently and load/store it multiple times. Usually can be fixed by restrictive qualifiers like const or additional local variables.

Rodéric · May 23, 2006

I knew it was useful, even if I coudln't remember why, I did took that habit 6 years ago, so...

off topic:
Makes me wonder how evil this is

Code:

int iIndex;
for ( iIndex = 0; (iIndex < iCount) && (thingy[ iIndex ] != pFoo); ++iIndex );
return iIndex;

Nick · May 23, 2006

arjan de lumens said:
The scalar SSE instructions tend to have large instruction encodings, leading to large executables, and they are sometimes actually slower than their x87 counterparts - in particular, the P4 can e.g. schedule 1 FADD per clock cycle, but only 1 ADDSS every 2 clock cycles. I also suspect (but don't know for certain; I don't have one for testing right here) that the AthlonXP actually decodes the ADDSS as a vector instruction, which adds a nearly 1 cycle penalty compared to FADD. The assumption that everyone has an SSE-capable processor is still not considered 100% safe either, even now.

As such, scalar SSE does not normally offer any real benefit over x87 for ordinary use (except perhaps on AMD64, where scalar SSE has access to 16 registers and is not hampered by silly scheduling limitations).

The linear register file makes up for some of the inefficiencies in complex computations. Also, the Athlon 64 has no extra penalties so it's very good at scalar SSE. The Pentium 4 architecture is a bit cripled but has the advantage of a high clock frequency (compared to Athlon 64 - but obviously it helps x87 the most).

But most importantly for the future, the soon to be released Intel Core 2 architecture will be extremely well equipped for SSE. It has 128-bit wide execution units (instead of 64-bit), and three issue ports (instead of one). It can also decode up to four SSE instructions per clock cycle. So SSE will be much faster than x87 even for scalar operations.

For the interested here's a scalar SSE version of the fraction calculation:

Code:

inline float frc(float x)
{
	const float MAGIC = 12582912.0f;

	__asm
	{
		movss xmm0, x
		movss xmm1, xmm0

		addss xmm0, MAGIC
		subss xmm0, MAGIC
		subss xmm1, xmm0

		movss x, xmm1
	}

	return x;
}

To make it behave like x - floor(x), also for negative arguments, I do this:

Code:

inline float frc(float x)
{
	const float MAGIC = 12582912.0f;
	const float HALF = 0.4999999f;

	__asm
	{
		movss xmm0, x
		movss xmm1, xmm0

		subss xmm0, HALF
		addss xmm0, MAGIC
		subss xmm0, MAGIC
		subss xmm1, xmm0

		movss x, xmm1
	}

	return x;
}

Note that HALF is 0.5 minus a tiny epsilon (I could have used the binary format of the floating-point value below 0.5 but that's uglier). This is required to have the right rounding for arguments 1.0 and -1.0. I'm not entirely sure it always works correctly but some basic tests good results.

Nick · May 23, 2006

Ingenu said:
Makes me wonder how evil this is

Code:

int iIndex; for ( iIndex = 0; (iIndex < iCount) && (thingy[ iIndex ] != pFoo); ++iIndex ); return iIndex;

It's not exactly evil. Any trained programmer should be able to understand it. But it can be improved (the semicolon is easy to miss). I think it makes most sense to write it as a while loop:

Code:

int index = 0;

while((index < count) && (thingy[index] != foo))
{
    index++;
}

return index;

This makes it much clearer that incrementing the index to find a certain element is the actual thing you're doing. It won't be faster nor slower.

By the way, your Hungarian notation is inconsistent for thingy. Also, index and count are so obviously integers that it seems silly to write iIndex and iCount. But this discussion belongs in the code style thread.

Xmas · May 23, 2006

psurge said:
(v & -v) can take 65 different values: { 0, 2^0, 2^1, .... , 2^63 }.

Right, I did not consider the case of v being zero. I guess it comes down to whether you want to special-case 0 or have a <128 byte (I'm too lazy to calculate the minimum now

)LUT instead of a 64 byte one.

arjan de lumens · May 23, 2006

For a whole bunch of nifty little bit-fiddling algorithms/tricks and other micro-optimizations, you guys may check out these links:

and for some things you should NOT do (in terms of both performance and coding style), http://thedailywtf.com would probably be a nice place to start. One of my co-workers recently very nearly WTFed me for the following little gem (the function involved has been renamed to 'Foo' to protect the guilty):

Code:

int Foo( int x, int y, int z );

int xFoo( int i )
    {
    return Foo( (i & 0xFF), ((i>>8) & 0xFF), ((i>>16) & 0xFF) );
    } 

#define Foo(A, B, C) xFoo( A | (B << 8) | (C << 16))

The motivation for this bizarre little piece of code was that Foo() in my program was called from hundreds of places, always with 3 very small constant arguments, and I needed to reduce the size of the compiled object file in a hurry - IIRC, the object file in question shrank by ~5 Kbytes by this trick alone.

Simon F · May 23, 2006

arjan de lumens said:
The motivation for this bizarre little piece of code was that Foo() in my program was called from hundreds of places, always with 3 very small constant arguments, and I needed to reduce the size of the compiled object file in a hurry - IIRC, the object file in question shrank by ~5 Kbytes by this trick alone.

I've done similar on occasions by packing a few parameters (or pointers to params) into a struct and passing a pointer to that.

Actually, your example reminds me of the trick to process two colour channels at a time by leaving an 8-bit gap in a 32-bit int.

Another useful one for 8-bit colour manipulation (e.g. alpha blending) is described by Jim Blinn in "3 wrongs make a right".

Frank · May 23, 2006

Code optimizations?

I think the only things I ever do as long as everything runs "fast enough" is using arrays and growing them by hand instead of classes for storage (or use a single instance of that array for all the element instances), or for indices (lookup tables rule), and making a point of traversing unstructured data serially instead of using inbuild functions for finding, sorting and parsing.

That's about it, I think. I agree with the "premature optimizations are the root of all evil" school of thinking, and value readability above all.

I like the "magic number" examples, but I would probably use a lookup table myself if I had to come up with it.

ShootMyMonkey · May 23, 2006

So for the sake of consistency with the rest of the world just write i++.

Since when does the rest of the world use postincrement? I haven't used it for a long time in a forloop, and I can't recall ever meeting someone who would use it when not necessary (but then, I do live in a circle of people who care about speed). If I actually find it useful to use index i and then increment after that, then I'd use i++. Otherwise, never.

The only reason it ever works out to be the same in ASM is because in standalone instructions on a simple type, the compiler optimizes away the copying operation. That would likely never happen, for instance, on a class with an overloaded ++ operator.

[maven] · May 23, 2006

ShootMyMonkey said:
The only reason it ever works out to be the same in ASM is because in standalone instructions on a simple type, the compiler optimizes away the copying operation. That would likely never happen, for instance, on a class with an overloaded ++ operator.

Or because you were using a less dangerously implicit language than C++...

Using a single variable to hold multiple data entries (essentially SIMD without preventing overflows) can definitely be useful, e.g.

Code:

/** Converts a Hilbert Curve derived key into the coordinates of a point.
 * This implementation is based on the paper "Vertex-Labeling Algorithms for the
 * Hilbert Spacefilling Curve" by John J. Bartholdi, III and Paul Goldsman. I
 * have combined the x and y coordinate into a single unsigned int. Also, due to
 * fixed point arithmetic, the coordinates of the points grow with each
 * iteration so that sufficient accuracy is maintained.
 * @param[in] key Hilbert derived key to be converted. In essence, we interpret
 *   every two bits of the key as a certain spatial subdivision of the
 *   unit-square, as described in the paper.
 * @param[in] numbits number of bits in the key (which has to be even). Only
 *   valid for even values smaller than or equal to 30 (as otherwise the point
 *   coordinates overflow or the key does not provide enough bits).
 * @param[in] a, b, c, d coordinates (combined x and y) for vertex labelling.
 * @return fixed point coordinates with a precision of numbits / 2 (combined).
**/
static unsigned int recurse_hilbert(unsigned int key, unsigned int numbits,
	unsigned int a, unsigned int b, unsigned int c, unsigned int d)
{
	if (numbits > 1)
	{
		unsigned char subcell;

		numbits -= 2; /* each subdivision decision takes 2 bits */
		subcell = (key >> numbits) & 3; /* namely these two (taken from the top) */

		switch (subcell)
		{ /* divide into subcells */
			case 0:
				return recurse_hilbert(key, numbits, a << 1, a + d, a + c, a + b);
			case 1:
				return recurse_hilbert(key, numbits, b + a, b << 1, b + c, b + d);
			case 2:
				return recurse_hilbert(key, numbits, c + a, c + b, c << 1, c + d);
			case 3:
				return recurse_hilbert(key, numbits, d + c, d + b, d + a, d << 1);
		}
	}
	else
	{ /* final result is the midpoint of the cell, i.e. (a + b + c + d) / 4 */
		unsigned int outx, outy;

		/* extract x from the lower 16 bits */
		outx = ((a & 0xffff) + (b & 0xffff) + (c & 0xffff) + (d & 0xffff) + 1) >> 2;
		/* extract y from the upper 16 bits */
		outy = ((a >> 16) + (b >> 16) + (c >> 16) + (d >> 16) + 1) >> 2;
		return (outy << 16) | outx;
	}
	return 0; /* make gcc happy */
}

although not quite as nice as the reverse storage of fixed point values (fractional part in the upper word) and using adc to add deltas. That struck me as ingenious back in the day of software texture mapping.

arjan de lumens · May 23, 2006

I notice that the 'recurse_hilbert' function is actually tail-recursive - gcc indeed produces a completely recursion-free function when I try to compile it. In this particular case, I think I would prefer to return data as a struct with 2 shorts in it (readable, causes redundant data copying) - or pass into the function a pointer to a struct that I can fill up with return data (if I need maximum performance).

Playing tricks with the ADC instruction is no longer nearly as attractive as it used to be; the instruction takes something like 7-8 cycles on Pentium4.

As for the 'leaving-a-space-of-8-bits' trick for computing two color components at the same time - I have seen a variant that leaves two 5-bit holes to allow computation on three 5-bit color components at the same time.

Nick · May 23, 2006

ShootMyMonkey said:
Since when does the rest of the world use postincrement? I haven't used it for a long time in a forloop, and I can't recall ever meeting someone who would use it when not necessary (but then, I do live in a circle of people who care about speed). If I actually find it useful to use index i and then increment after that, then I'd use i++. Otherwise, never.

For a loop index I practially always see i++. It's not that ++i is wrong, but writing i++ is sort of a historical convention that is a tiny bit more fluent to read. It's like writing C++ instead of ++C.

I mean, i++ easily reads like "i incremented", or "i plus one", while ++i is more like a bunch of plusses followed by i. Prefix operators are just a tiny bit harder to read. And in my opinion any possible performance advantage from using them is not worth the ease of reading a postfix operator.

By the way, even for things like iterators post-increment gets optimized because the operator is inlined. Dead code elimination is really very effective these days.

The way I actually think about high-level code style versus optimizations is that the compiler is perfect. So something like doing a loop backward or writing ++i instead of i++ can in theory not improve performance, it only degrades readability. And even if in practice compilers aren't perfect then the tiny performance advantage of sacrificing code quality is still not worth it. Compilers also get better every year. An 'optimization' that turned out to be effective ten years ago can now be nothing more than code 'noise' creating confusion. Real optimizations are algorithmic optimizations. And if you do care about the performance of the generated assembly code, just write it in assembly. It's the only way to make sure you get what you want, independent of compiler capabilities.

Nick · May 23, 2006

[maven] said:
although not quite as nice as the reverse storage of fixed point values (fractional part in the upper word) and using adc to add deltas. That struck me as ingenious back in the day of software texture mapping.

Speaking of which, if anyone knows any incredibly fast software texture mapping (bilinear filtering in particular) that could speed up SwiftShader, I'm all ears.

MfA · May 23, 2006

Assuming whether you use i++ or ++i matters for the correctness, and you change your program flow to use one or the other I wouldn't be so sure pre-increment is faster. With pre-increment you avoid creating a copy, and you create a short range dependency ... bit of a mixed bag. Sure for a complex iterator the copy weighs heavily, but for an int?

Assuming it doesn't matter which you use I doubt a compiler creates different code either way.

Post your code optimizations

arjan de lumens

psurge

KimB

Rodéric

a.k.a. Ingenu

Nick

arjan de lumens

Npl

Rodéric

a.k.a. Ingenu

Nick

Nick

Xmas

Porous

arjan de lumens

Simon F

Tea maker

Frank

Certified not a majority

ShootMyMonkey

[maven]

arjan de lumens

Nick

Nick

MfA

Similar threads