Empirical parametric runtime code optimization.

soylent

Newcomer
So I was playing around with some time critical inner loops. Translating them into SSE using the SSE instrinsics, aligned malloc, __declspec(align(16)) declaration and so forth. The first thing I noticed is that using the intrinsics sucks ass; usually ending up at lower performance than straight x87 FPU c code for simple stream processing-like loops. This was puzzling, so I learned enough assembler to do the same thing in inline asm and got a nice speed bump over x87 FPU code.

So now I was trying to improve my asm code. I noticed that even simple things could do a huge difference.

For some loops including a prefetch instruction was a performance loss by 5% or so, for others it was a performance gain of 20% or so; and it's fairly sensitive to how early you prefetch data.

Order is another simple parameter that made a few percent difference.

Unrolling the loop to handle two data points at a time was sometimes a performance gain due to less loop overhead; sometimes a loss due to additional juggling of constants and such in and out of xmm registers.

Pure algorithmic changes sometimes gives a good speed-up for no apparent reason.

CPU's are very complex things, and there is a lot of precaching, out of order execution, register renaming and cleverness to the extent that it's impossible to infer that just because a particular version of the code gets better performance on an athlon64 it will be the one to perform best on a core 2 duo or future processor.

It seems fairly straight forward to write a small framework that will allow you to set a few paremeters to optimize for at runtime on the first run. You'll want to test against several versions of the code; and you'll want to fiddle with the prefetch, either not including it or changing how many data points into the future it's prefetching. You might also want to fiddle with the order of independent instructions. And when you've benchmarked these different alternatives you'll want to select a particular version of the code for use as well as store it in some sort of settings file.

For 0-20% performance gain on inner loops determined by profiling the application to be major CPU hogs would this not a be a good idea for inner loops that will never be changed?

Most notably, wouldn't it be worthwhile in graphics drivers as those can hog quite a bit of CPU and they're used by so many games?
 
ATLAS (Automatically Tuned Linear Algebra Software, ) does something like this at install time: they generate lots of versions of each core procedure with different block sizes, loop orderings, FP instruction set, prefetch distance -- all the things you mentioned -- compile them, benchmark them, and then the final installed version gets whatever was fastest.

I know at one point the Linux kernel did something similar but simpler for a few inner loops like memcpy and the checksum/hash/whatever used for error-checking in TCP. They had a bunch of hand-coded routines optimized for different processors, and at every boot they'd benchmark them and pick the fastest.

ATLAS has some ability to get an optimal version on CPUs that didn't exist when it was written since it explores a large parameter space. But the install process took a couple hours last time I tried...

This is an optimization problem (in the mathematical sense), and like many optimization problems there's sub-optimal local minima and other traps; you do need a reasonably sophisticated search algorithm, or lots of time for a brute force approach.

As for graphics drivers, I'm not sure they have very many compute-heavy inner loops left now that vertex processing has pretty firmly moved to the GPU. Well, except for Intel... The work the driver does these days is mostly like "general applications" -- integer, branchy, non-coherent, all that nasty stuff. D3D 9 allows the app to choose software vertex processing, in which case the runtime (not the driver) uses hand-tuned libraries supplied by Intel and AMD. I'm sure that internally those libraries pick between several versions based on which particular CPU you're on. I'm not sure how often these libraries are updated as new CPUs are released though, so maybe they could benefit from a parametric auto-generation approach.
 
For the longest time, this was the only way to tell if a rendering feature in open gl was fast enough to be used. There were a number of performance suites around to measure various rendering paths.

I'm not familiar with anything similar for CPU's though, but I mostly have worked on consoles where the CPU is a known quantity.
 
My SoftWire project allows you to do exactly this.

But in my experience the cost/gain ratio is really not that interesting, and it's getting 'worse' while CPUs get better. The first problem is that the search space is immense. A sequence of just 10 independent instructions can be ordered in 10! different ways, i.e. 3628800 permutations. Secondly it's very hard to do accurate performance measurements. Even with thread priority at maximum you need a few milliseconds per measurement to get a good average. Furthermore you have to warm up the cache for consistent measurements, but in actual use cache misses have an influence that is far from neglectible.

Michael Abrash made the same conclusions with his Speedboy tool.

I find it much more productive to do just a couple of manual performance measurements, and take the code you expect to work best on the range of CPUs it's meant to run on. In the end the performance gain is pretty low and you can get much bigger and consistent gains with algorithmic and generic optimizations.
 
Eeeek, double post, not quite sure how that happened. Ignore this post and read the one below.
 
Last edited by a moderator:
Some very intersting replies; thanks.

Nick does the parameter space need to be that large?

If you can identify the kinds of permutations and changes that tend to be useful and only use that subset you're not going to have to search as many parameters. The simplest case is to just have a few hand-written equivalents and selecting the fastest one; that shouldn't be such a can of worms as equivalence checking of corner cases and random valid parameter combinations should easily reveal obvious bugs.

Would it be worth while to pursue a slightly less brute force approach like simulated anealing or steepest descent from a set of random starting locations in parameter space?

Speed and reliability seems like very important aspect to optimize for such methods to be useful outside of a niché.

It ought to reduce variations due to other processor usage and what state the CPU happens to be when it enters your test if you chose N different parameter combinations at a time and tested them in random order a few times.

How does the NX bit work? Is it opt-in or opt-out from application point of view? Do I need to do something special to tell the OS that a memory buffer may be 'legally' used as code?
 
Nick does the parameter space need to be that large?
No it doesn't. But it's definitely an explosive problem. Even a 'handful' of parameters is too many. Part of the problem is that you need more parameters to get an appreciable gain. For just one or two parameters it's most often possible to find a good compromise manually. I doubt you can't find a prefetch distance that works well for all brands and models...
Would it be worth while to pursue a slightly less brute force approach like simulated anealing or steepest descent from a set of random starting locations in parameter space?
It depends on the situation. If you know it converges to a reliable optimum in little time then it can be worthwhile. But if someone runs it on say a Transmeta processor it might have totally different behavior.

So I would avoid using it at run-time. It can be useful to find a compromise that is close to optimal, but I wouldn't risk doing it during program execution.
How does the NX bit work? Is it opt-in or opt-out from application point of view? Do I need to do something special to tell the OS that a memory buffer may be 'legally' used as code?
It's opt-out. You have to make the memory buffer executable. SoftWire uses the following code:
Code:
#ifdef WIN32
    unsigned long oldProtection;
    VirtualProtect(machineCode, length, PAGE_EXECUTE_READWRITE, &oldProtection);   // #include <windows.h>
#elif __unix__
    mprotect(machineCode, length, PROT_READ | PROT_WRITE | PROT_EXEC);   // #include <sys/mman.h>
#endif
 
It is not easy to do something like that.

First, you can measure impact of prefetch but will it work the same way in measuring loop like it does in the actual code? You would have to run actual code to test it properly which sort of defeats the purpose of measuring, right?

Then, not only prefetch distance but also its code position matters.

Cache line size plays a role too.

Associativity is a major pain because it determines maximum block size you can use.

Certain instruction mixes perform better on certain flavors of CPUs.

Then AMD memory controller likes interleaved load and store and Intel likes separate loads and stores.

But the worst IMO is aliasing. You can't work around that dynamically. You would have to vary array starting addreses for all arrays you access in a piece of code and measure the performance each time -- plenty of combinations there.

In the end, it is much better to let a good compiler (like the Intel C++ Compiler for example) do the majority of work for you. Then you only have to fine tune the most performance critical parts of code.
 
Personally, and especially with a game, I would take a distributed computing approach. I would using something like LLVM to allow the program to compile variations on each clients computer, then periodically sending the improvements to a central server. There they can be amalgamated by processor type, system configuration, etc and delivered on an as needed basis to the client. Moreover, future software will have the advantage of this comprehensive library.

Just for kicks, if it were a game, I would give rewards to aspiring programmers to optimize core routines beyond what the in-house team did. I wonder how many 1000s of hours of effort I could reap for cheap?
 
Personally, and especially with a game, I would take a distributed computing approach. I would using something like LLVM to allow the program to compile variations on each clients computer, then periodically sending the improvements to a central server. There they can be amalgamated by processor type, system configuration, etc and delivered on an as needed basis to the client. Moreover, future software will have the advantage of this comprehensive library.
JIT compilation kindof already does that. A processor specific compiler takes the bytecode and tries to generate the optimal code for the system. Some C# code on a Pentium 4 should beat a C++ program compiled for 386.

That's not always the case in practice, and JIT doesn't second guess, but I think it still has some interesting development ahead.
 
JIT has a problem. It is still abstraction when you want to get down to the metal.

I have recently optimized old hand-written assembler SSE code originally written for Northwood which was tuned for Prescott overtime so it performed well on everything from Pentium III to latest Pentium D. When Core 2 Duo came out, I had to revisit the code since Intel C++ compiler was beating my code. I managed to fix it. It was some aliasing so I hand tuned the array starting positions on the stack and I managed to beat the compiler once again and by a margin, especially in 64-bit code.

What is interesting is that I haven't used any new instructions and when I tested it on the older machines it also worked faster there by the same amount. Goes to show that good optimization is good optimization wherever you run it.
 
Most of the time, changing the algorithm has the most impact. And I tend to focus on the representation of the data. A small lookup table is probably the best overall method to speed things up.
 
Back
Top