So I was playing around with some time critical inner loops. Translating them into SSE using the SSE instrinsics, aligned malloc, __declspec(align(16)) declaration and so forth. The first thing I noticed is that using the intrinsics sucks ass; usually ending up at lower performance than straight x87 FPU c code for simple stream processing-like loops. This was puzzling, so I learned enough assembler to do the same thing in inline asm and got a nice speed bump over x87 FPU code.
So now I was trying to improve my asm code. I noticed that even simple things could do a huge difference.
For some loops including a prefetch instruction was a performance loss by 5% or so, for others it was a performance gain of 20% or so; and it's fairly sensitive to how early you prefetch data.
Order is another simple parameter that made a few percent difference.
Unrolling the loop to handle two data points at a time was sometimes a performance gain due to less loop overhead; sometimes a loss due to additional juggling of constants and such in and out of xmm registers.
Pure algorithmic changes sometimes gives a good speed-up for no apparent reason.
CPU's are very complex things, and there is a lot of precaching, out of order execution, register renaming and cleverness to the extent that it's impossible to infer that just because a particular version of the code gets better performance on an athlon64 it will be the one to perform best on a core 2 duo or future processor.
It seems fairly straight forward to write a small framework that will allow you to set a few paremeters to optimize for at runtime on the first run. You'll want to test against several versions of the code; and you'll want to fiddle with the prefetch, either not including it or changing how many data points into the future it's prefetching. You might also want to fiddle with the order of independent instructions. And when you've benchmarked these different alternatives you'll want to select a particular version of the code for use as well as store it in some sort of settings file.
For 0-20% performance gain on inner loops determined by profiling the application to be major CPU hogs would this not a be a good idea for inner loops that will never be changed?
Most notably, wouldn't it be worthwhile in graphics drivers as those can hog quite a bit of CPU and they're used by so many games?
So now I was trying to improve my asm code. I noticed that even simple things could do a huge difference.
For some loops including a prefetch instruction was a performance loss by 5% or so, for others it was a performance gain of 20% or so; and it's fairly sensitive to how early you prefetch data.
Order is another simple parameter that made a few percent difference.
Unrolling the loop to handle two data points at a time was sometimes a performance gain due to less loop overhead; sometimes a loss due to additional juggling of constants and such in and out of xmm registers.
Pure algorithmic changes sometimes gives a good speed-up for no apparent reason.
CPU's are very complex things, and there is a lot of precaching, out of order execution, register renaming and cleverness to the extent that it's impossible to infer that just because a particular version of the code gets better performance on an athlon64 it will be the one to perform best on a core 2 duo or future processor.
It seems fairly straight forward to write a small framework that will allow you to set a few paremeters to optimize for at runtime on the first run. You'll want to test against several versions of the code; and you'll want to fiddle with the prefetch, either not including it or changing how many data points into the future it's prefetching. You might also want to fiddle with the order of independent instructions. And when you've benchmarked these different alternatives you'll want to select a particular version of the code for use as well as store it in some sort of settings file.
For 0-20% performance gain on inner loops determined by profiling the application to be major CPU hogs would this not a be a good idea for inner loops that will never be changed?
Most notably, wouldn't it be worthwhile in graphics drivers as those can hog quite a bit of CPU and they're used by so many games?