Benchmarks and "optimizations"

Reverend

Banned
Let's say we're talking about either a synthetic benchmark or a recorded demo of a game used for benchmarking (via actual game "timedemo" feature).

In what ways do you think or know that a driver guy can "optimize" to get higher framerates (which in turn means higher benchmark score, whether in terms of framerates or "score") without the probability that the guy running the benchmark will/may notice any difference (dropped frames, IQ infidelity, as examples) ?

Take into consideration factors like precision modes and number of frames that must be rendered.
 
I think that depends on what you want to get from a benchmark.

If a IHV can optimize their drivers to get very good performance from a certain benchmark without compromising its rendering results, you get a "best performance from this hardware" type results. Such results may reveal the potential of the hardware, but not necessarily reflect real gaming performance.

However, could a synthetic benchmark shows the real gaming performance? Unlikely IMHO. So I think its acceptable for a IHV doing such optimizations. Timedemo is another issue, however. Some timedemos do reflect real gaming performance by recording all input events and recreate everything (including AI/physics computation). If a driver is optimized just for timedemo but not for game, that would be pointless. On the other hand, if the game is also improved, there wouldn't be any problem (without compromising the rendering results, of course).
 
Specifically regarding shaders, think its going to be down to the BM vendors to supply shader quality tests that take the shaders they use and throw extremes of values and conditions at them to ensure that a driver isn't, say for example, replacing float with fixed point math in recognised shaders or execution paths it thinks can be excuted at lower precision (outside of the rules laid down by the API for precision)...
 
I'm not a software person but, I do think that with any benchmark that is exactly equal on each run the driver guys can and will know exactly what is on screen each frame. Knowing this it's not hard to "optimize" by removing seldom seen visuals by simply do a non-render or replace with a one color texture.
 
What about actual timing issues? Like some "spoof" partial frame writes to bump up the fps or a mechanism that actually knocks the benchmark out of realtime (guess that would be pretty obvious if you stopwatched it, huh?!).

MuFu.
 
...which is why UT2003's botmatch demos use the game's engine, including a saved seed and everything, and can result in wild variations in their action at times. Not that fantastic for reproducible results, but provides the most accurate possible performance picture.
 
One possible way of optimization is extreme buffering.
It can happen that a benchmark has CPU limited and GPU limited parts.

To avoid holding back the GPU the driver could try buffering (the instuctions to render) a lot of frames while running the GPU limited parts of the benchmark.

This kind of optimization ruins gameplay as it can introduce untolerable mouse-lag.
That's why say UT2003 had to resort having a 'reduce mouse lag' options which is basicly forces syncronisation between the CPU and the GPU.
 
Calavaro said:
I'm not a software person but, I do think that with any benchmark that is exactly equal on each run the driver guys can and will know exactly what is on screen each frame. Knowing this it's not hard to "optimize" by removing seldom seen visuals by simply do a non-render or replace with a one color texture.
So what we can do about this? "we" meaning the reviewers and the developer of the synthetic benchmark and/or game(s).

There's a reason why I'd like to follow up on Calavaro's comment.
 
You could have a degree of predictable randomness to the benchmark, where the benchmark would be slightly different everytime you ran it with a different [user inputable] seed value (but identical if you ran the test twice with the same seed value, regardless of what computer it was ran on).
 
I've alwyas thought that synthetic benchmarks are just a way for a driver team to show off how well they can optimise their drivers for that application. Just as they can optimise for known pathways in games.

The only real way to get around it would be, as Ilfirin says, to randomise the content of the benchmark each time it was run.

You could always hook into a static benchmark and replace known shader code with your own, even sections of code itself.
 
Hmm...I've already posted such an idea, I think (I'll do a search in a minute).

The idea was to create a customizable benchmark scene that allowed the person evaluating to create custom camera paths to expose such optimizations. This could expose many such optimizations (which are truly invisible only when the circumstance are predictable, so bad), but allow other optimizations (truly invisible, period), and then the mixture of tests would be to further cover making sure that the set of optimizations the IHV offered are more likely to be truly general case.

This also works with my automated and somewhate randomized screenshot capabilities idea, so that special driver screenshot code would be more difficult to accomplish.
 
Of course, there's always the alternative that we ensure no IHV can get their hands on such synthetic benchmarks and/or user-recorded game demos. Media guys would need to work on this together and there would need to be trust (i.e. no leaks!). The general users would lose out of course since they can't verify any benchmarks in a website since they won't have the benchmarks and/or game demos :(
 
Good luck,
Companies with lots of cash will get to reviewers some how (make them the first to have next gen cards etc).

I think the best thing to do is if you find out a company is cheating on a public benchmark, and that it can be reproduced, get other reviewers in on it together and expose it. If you can find them doing it for one benchmark, you can bet that it's their strategy across the board in benchmarks.

What if Microsoft designed a custom benchmark for themselves only and ran it at the time they certified the driver (make it one more test that has to run)?

Larry
 
My initial idea was proposing that the person evaluating could specify directly what was tested, record it, and play it back to benchmark among several cards. It was prompted by discussion of the 3dmark03 aniso test, I think...I consider it a problem that it doesn't automatically provide regular and easily reproducable motion (EDIT: that can be objectively applied for further comparitive testing) to allow easy comparison of aliasing in actual use.

My thought was originally of a scene designed to readily expose all pertinent issues at once, and then facilitate the reviewer/evaluater inputting keys and directing camera rotation around the scene. This could then be saved out to a file and then played back, and would be unique to each evaluation session and under the control of the person evaluating.

Further facets would be letting the person control changing parameters, like colors, light intensity, and other properties that could expose image quality shortcuts, and maybe animation properties (a "living" model with skeletal animation, etc) as part of this "recorded demo".

Wouldn't eliminate cheating opportunities, but open the amount of parameters that would have to be taken into consideration when cheating to a rather large amount of variance, hopefully making such optimizations impractical except as they would be general case optimizations.

Ilfririn's example is a shortcut that fits as an extension of what 3dmark is already doing, and could be applied to the criteria above as an option as well. I think user controllable is more important, however, to facilitate an association between the specific thoughts of the person evaluating and the testing, to account for possibilities not accounted for by the benchmark creators specifically, so as to not limit the possibilities and likelihoods that can be expressed. Reproducibility would be covered by the saving of the file, and the focus for such customizable testing would be for an option more focused on expressing the ideas from the game tests (which are separate) in such a way as to evaluate whether the game test results indicate special case "cheating", or not.

Of course, many games do this half of this already, but doing this for a synthetic test (EDIT: in this usage, a test designed not for actual gaming, but for the "synthetic" criteria of accurately reflecting what would be stressed by games) and with these types of controls opens up many new doors for testing and exposure, and optimizing that worked for this would atleast be more likely to be both general case and truly invisible.

Can't find my original suggestion yet, though (the searching with asteriks still appears to have issues, like not being able to go to another page with one set of search results, so I gave it up for now). However, some opinions I've expressed before seem somewhat related to my concerns in this regard (except that I've since been shown to be wrong about what nVidia was doing with their 3dmark 03 boosting driver set :-?).

EDIT: To address Rev's question, and Calavaro's one color texture example, that would be covered by allowing variance that would require the standard amount of colors in the texture to render properly in all cases, and highlighting the difference in performance and/or image quality between the game tests and the associated "Cheat Check" test associated with it, the latter being where such cheats can more easily be exposed.
 
Reverend said:
Calavaro said:
I'm not a software person but, I do think that with any benchmark that is exactly equal on each run the driver guys can and will know exactly what is on screen each frame. Knowing this it's not hard to "optimize" by removing seldom seen visuals by simply do a non-render or replace with a one color texture.
So what we can do about this? "we" meaning the reviewers and the developer of the synthetic benchmark and/or game(s).

There's a reason why I'd like to follow up on Calavaro's comment.


Are you trying to catch the cheats or make a hack proof benchmark? I agree with lar2r. I don't think you will accomplish the latter. Sort of like trying to make a cheat proof multi-player PC game. And those graphics companies have a lot more at stake and the resources to make hack proof benchmarks impossible. Why can't the reviewers use recorded in game sequences with FRAPS at least as a baseline comparison to validate if there is something fishy with the driver? Unless you come up with an inexpensive way to make a benchmark foolproof or make it very expensive (negative publicity or money wise) to write a performance hack, IMO you are better off just trying to find the cheats. Your solution needs to be simple if you are serious about trying to make it work. Just my 2 cents.
 
Results shouldn't vary if you use the same settings/ aspect ratio. You can't directly compare Linux and Windows though you can compare Intel vs AMD since 2166. Always check the random seed the game appends to the benchmark.log file.

I can't stress enough that results are only compareable at the same aspect ratio so 1280x1024 is out - use 1280x960 instead :)

-- Daniel, Epic Games Inc.

Tagrineth said:
...which is why UT2003's botmatch demos use the game's engine, including a saved seed and everything, and can result in wild variations in their action at times. Not that fantastic for reproducible results, but provides the most accurate possible performance picture.
 
I'd suggest:

(1) Don't write shaders which are doing the same thing over and over again. This makes it easy for IHV's to gain significantly by changing the shader code itself. Instead write the shaders in a way so that each part of the shader is crucial for the final output.

(2) Is it possible to do pass tests on shader results? It's important that the pass tests must be taken randomly somehow. This way it should be possible to check whether the shader output is 100% accurate or not. You can perhaps even check whether the shader was done at 16bit, 24bit or 32bit.

(3) Another option would be to create shaders which don't look "nice", but which are written in a way that any changes to the shader code will dramatically change the final output picture. Unfortunately such shaders would probably not feel/look very beautiful, which could reduce the attractivity of the benchmark...

I guess that just doing random camera paths or slightly changing the rendered objects won't help much - unless you also change the shader code itself completely. But then the results are not comparable anymore. I think that a game uses a specific number of shaders. Those shaders won't change just because of a different camera path. If an IHV replaces all the available shaders by easier code which trades accuracy/quality for speed nothing helps - except checking the shader output somehow.

So for me the big question is: Is it possible to check the exact result of a specified shader operation? Is there a Direct3D/OpenGL call like "TellMeResultOfLastRunShader"? It's important that you can ask *after* the shader has run. If you ask before the shader is run, the driver can trick again.
 
Reverend said:
Calavaro said:
I'm not a software person but, I do think that with any benchmark that is exactly equal on each run the driver guys can and will know exactly what is on screen each frame. Knowing this it's not hard to "optimize" by removing seldom seen visuals by simply do a non-render or replace with a one color texture.
So what we can do about this? "we" meaning the reviewers and the developer of the synthetic benchmark and/or game(s).

There's a reason why I'd like to follow up on Calavaro's comment.

Good question.... maybe use a standalone fps counter (fraps)?
You could also re-run benchmarks with the standalone fps counter on different drivers (ie older drivers). That will help you indentify any "optimizations" made and if they "cheat".

Maybe use non-standard benchmarks, ie create your own that is never released publicly. Use only for your internal purposes and as a guide. Compare to established benchmarks to see if there has been benchmark specific optimizations with x driver. (if that made sense).
 
Reverend said:
So what we can do about this? "we" meaning the reviewers ...

Benchmarking games that are not normaly used in 3d card reviews. You can use Unreal 2 instead of UT2003 or Serious Sam, you can benchmark Aquanox 2 instead of Aquamark, and using FRAPS you can accuratly benchmark almost any cool looking DX8 game out there. Indiana Jones, C&C Generals or DTM Race Driver are quite fillrate limited. With FRAPS you basicly could just take a bunch of fillrate limited titles out of the latest game charts instead of using one (Serious Sam) or even three years old games (Q3A).
 
Back
Top