Ideas about cheating in benchmarking and a possible solution

silhouette · May 29, 2003

Hi people,

There is a lot of discussion going on about this benchmarking issue, and since I do not have the time and patience to read all of the statements, I may repeat someone else thoughts here. If I am doing that, I want to apologize.. Also I want to state that there is an interesting suggestion at the end, so please read all the way through end.. I really appreciate you guys' comments on that.

Ok! Lets start with a question: What is the sole purpose of benchmarking? To compare performance of different chipsets? Or to findout how a chipset performs on a particular task? If we pick the first one as the answer, then the obvious next question is how to pick the right tools to make such comparison or how to write these tools:

First of all, there is an ongoing misunderstanding about these vertex/pixel shaders. People believe that when an architecture support a particular shader version, they assume that that architecture can run any shader code efficiently. Unfortunately, this is not true. Even if two architectures would support exact same comfiguration (like same set of instructions and numeric precision), different implementations may lead to different performance under different code for the same shader version.

Let me give an example. Lets assume we have two architectures: Implementation A and Implementation B. And assume both architectures use same numerical precision and support same instruction set. But lets assume that Implementation A runs a 3 instruction shader at the same speed with the implementation B runs a 2 instruction shader that does exactly the same thing with the 3 instruction shader. And lets say, Implementation B runs the 3 instruction shader much slower than implementation A. Even for this simple case, where both implementation A and B support same numerical precision and instruction set, we can not write a code that works equally well on both implementations. So how can we benchmark them? If we use the 3 instruction shader in the benchmark, implementation A seems to be much faster than implementation B. If we use the other code, implementation B comes out faster.

Now, things get more complicated in Nvidia/ATi case. They have different implementations of pixel shader 2.0: They support totally different numerical precision (FP24 vs FP16/FP32) and they have different instruction set. So how can we define a proper way to benchmark them with the same code, even we can not find one for two architectures with completely same features.. This will question not only the validity of the popular benchmarking tools like 3dMark's, but also the games published under Nvidia's so called "the way it is meant to be played" program (which makes me believe that the game is designed for the Nvidia's architecture).

Therefore, for me, I guess the answer to the question "what is the sole purpose of the benchmarking?" is "to findout how a chipset performs on a particular task.".. Now, this is more tricky, because it requires a special code for each architecture. In this case, the developer has to define this particular task, then write a code that is optimized for each architecture. Then, the benchmarks shows us how each chipset can perform on that task. But this brings another question: Is it possible for a developer to write a shader for each architecture? Or what happens if a new architecture comes out after the tool is written? As one of the developer states in his forum, because of the pressures from the publishers, they do not have the time to optimize the same code for different architectures. So how do we measure the performance of the new architectures if there is no special code written for it?

I guess, this is the point where the driver developers come in to the game. Since they can find out what a shader does and they know the architecture they are programming for, they can replace the shader code with a new one that does the same thing but is optimized for their architecture. So the new question here is: "Is this legitimate? or "is this cheating?"... I guess both Tim Sweeney and John Carmack give appropriate answers for that. If the optimization does not change the particular task, then it is not cheating. And even, this would enable the users to see the real performance of their chipset under that task. ATI's optimization for the Test 4 of 3dMarkk2003 is a nice example of this. This particular optimization (shuffling of the place of instructions) shows us it is possible to gain 10% of performance without altering the quality. If this is possible, why avoid this. In case of the Nvidia's optimizations, I believe setting extra clip planes and altering back buffer cleaning procedure using the knowledge of camera and world position is cheating, but changing the shader code that "does not" alter the image quality(IQ) is not cheating (even if they use less precision. As JC stated, using less precision may not effect the quality in some cases).. Please note that I add the statement that "that does not alter the IQ". If it changes IQ, then it is also cheating..

So, things get more confused.... First, it is very hard to find a definition for benchmarking. Second, it is also hard to define the legitimacy of driver optimizations which changes the developers' code.. So what is the solution? Here is one: I do not know if it is implementable or feasible.. But I guess, it is not a bad one either:

For a long time, we -the users- have control over the rendering image quality using the display control panel.. We can select the degree and quality of anistropic filtering, degree of full-screen anti-aliasing, LOD bias to trade between performance over texture quality and some other settings that effect the performance/quality. So why not implement this very same feature into the control panel so that we can select the degree of optimization as well:

The control panel may include a tab that has the list of applications that can benefit from special optimizations that alter the developers shader codes and/or rendering paths. The optimizations can be classified under four different classes:
1- Improved IQ, improved performance
2- Same IQ, improved performance
3- Slightly less IQ, improved performance
4- Less IQ, significantly improved performance

The first class may include a shader code change from Ps1.x to Ps2.0 (integer to floating point). An architecture may execute a Ps2.0 code faster than a Ps1.x code which has the same functionality. Eventually, the output quality may also increase because of the use of better precision. The second class may include changes like instruction re-ordering (like ATI's optimization in 3dmark2003). The third class may include changes like using faster but less precise operations (like Nvidia's optimization using less precision operations in 3dmark2003). Finally, the fourth class may include complete re-write of the same shader, which changes the IQ significantly, but gives a huge performance boost (GeforceFX5200 users may prefer that).

This property in control panel would give us to have more freedom for making a trade-off between quality vs. performance. Also, it gives the hardware reviewer the chance to play with those settings as well. They can compare quality/performance under different settings and also can evaluate the useability of an application under different optimization settings.

Finally, these optimizations can be validated by the developers' quality assurance (QA) group. If a developer validates a patricular optimization, this can be even shown in the control panel, and the user would know that the IQ change with a particular optimization would have found OK with the developer.

Any comments?

Best,

Ideas about cheating in benchmarking and a possible solution

silhouette

Similar threads