cristic said:
ShootMyMonkey said:
The real problem with QueryPerformanceCounter and QueryPerformanceFrequency is that they're so amazingly slow that they completely screw with the profile. It's not significant if you're measuring something that takes a really long time by itself, but if you're trying to time a block that's just a few hundred cycles or something, the Win32 functions will just kill you.
Well run that code like a gazillion times and divide the measured time with a gazillion. Running and timing a few instructions once will most likely tell you nothing.
It's not that simple. I can confirm what ShootMyMonkey said because I hit the same problem myself.
QueryPerformanceCounter takes --
on average (and that's the problem) -- one microsecond to complete. That time will be included in your measurements. Apart from the annoying direct consequence that code sprinkled with QPC based profiling timers can be slowed down a whole lot, there's no simple way to compensate for the huge, inherent error margin, because it fluctuates wildly from call to call.
That's been a huge problem for me.
I originally used RDTSC for everything and had great, reliable results on my old (fixed clock speed) machine. But I figure that's not good enough anymore. OTOH QPC isn't good enough for profiling.
The solution I've come up with was to use QPC only for timing large things, in the 5ms+ ballpark. Seconds per frame for animation purposes is a prime candidate here. I still use RDTSC for profiling, but that's not active in "shipping" code. Whenever I need to collect accurate profiling data, I just deactive "Cool'n'Quiet" and all is fine.
In my little benchmark project it turned out to be more complicated. There, I need to make sure that some specific loops run only for a limited amount of time, but it's unknown (at compile time) how many iterations that would take.
I originally had something like this:
Code:
RDTSCTimer t;
int frames=0;
//prep benchmark
t.reset();
do
{
//workload
++frames;
} while ((frames<1000)&&(t.elapsed_seconds()<0.75));
double delta_t=t.elapsed_seconds();
Limiting runtime with a QPC based timer takes a lot of time away from the actual benchmark workloads, while taking a benchmark result time with RDTSC is vulnerable to clock speed variations. So I had to do something.
I finally settled on a hybrid approach. Elapsed seconds is very fast to compute on my RDTSCTimer class. I figured that I don't need the time limit to be precise. A rough ballpark is enough, I can easily tolerate +/-50 per cent error here. The overall delta_t for the run needs to be precise OTOH, so I use QPC for that.
Like this:
Code:
RDTSCTimer t;
QPCTimer robust_t;
int frames=0;
//prep benchmark
robust_t.reset();
t.reset();
do
{
//workload
++frames;
} while ((frames<1000)&&(t.elapsed_seconds()<0.75));
double delta_t=robust_t.elapsed_seconds();