Game benchmarks vs benchmark applications

Patric Ojala

Newcomer
There has been some talk around about how game benchmarks measure ‘real world’ performance and how benchmark applications like 3DMark are just ‘synthetic’ ( said in the worst possible tone of voice :) ). Fortunately things are not quite that grim. There is a clear distinction between benchmark applications and game benchmarks, and both are needed for a complete performance measurement.

Back when the 3DMark concept was invented, only few games included a benchmark. Nowadays many of the big titles have one, but also many less known games with lower predicted sales amounts add a benchmark to the product. This is most likely done both for the common good and in the hope of increasing sales with the added publicity. This is a positive trend, but it does not mean that benchmark applications like 3DMark would become needless.

- Then how does 3DMark differ from the game benchmarks, and what can it offer that game benchmarks cannot?

A good approach could be to start by defining what game benchmarks actually measure, then compare this to 3DMark and see how the two benchmark categories can complete each other.

This text was written for professional benchmark users, like the hardware press, to give recommendations on how to measure graphics hardware performance. Enthusiasts should also find this text enlightening, since it may help understanding and evaluating the hardware reviews you read. That's why I thought it would be suitable content for Beyond3D.

What do game benchmarks measure?
A game benchmark typically renders in real time a recording from a game. This produces a repeatable measurement of how fast the graphics of that game runs on the tested system. Since it uses a recording of the animation, a game benchmark seldom measures in-game workloads, like physics, artificial intelligence and user input. This may be quite acceptable when benchmarking graphics performance, but it must be noted that the game benchmark may not quite indicate how fast that game will run when actually playing it.

A more important aspect to notice with game benchmarks is that they usually run like the game, optimized for the system they are run on. Several code paths are often implemented for hardware from different vendors, and the workload of these code paths may not be equal. Thereby the results from that game benchmark are not comparable between hardware from different vendors, or to be more precise, those results do not directly reflect the hardware performance. Also, the rendering workload may be adjusted to offer at least a playable frame rate on each system. If you have a high-end system, the graphics are set to high detail. If you have an older or a value system, a lot of the details in the scene are discarded in order to gain a playable frame rate also on that system. The positive aspect of this is that the game play is enjoyable both on high end systems and older / value systems. The negative aspect from a benchmarking perspective is that even though a value system gets half the frame rate compared to the high end system, the high end system may have done four times the work the value system did. The high end system probably produces a better rendering, but this is not reflected in the benchmark result.

This leads us to the conclusion that a game benchmark measures the performance of that particular game. It does not give a comparable hardware performance reading by itself. Also, the user seldom has a complete understanding of what kind of workload the game benchmark generates on each piece of hardware.

Which game benchmarks to use?
A game benchmark gives the best available measurement how fast the system to be tested will run that particular game. This means that you should not just look at what game benchmarks other (maybe more famous) websites or magazines use, you should only use the benchmarks of games the readers most likely will play. There is no reason to benchmark using Quake3 if relatively few people these days play Quake3. You should also keep in mind that Quake3 hardly offers a challenge for today’s high end graphics hardware. The benchmark results will most likely be limited by the some other part of the system than the graphics card. Another example of a game benchmark is Aquamark 3. It looks and feels like a benchmark application, but as in most game benchmarks, the workload is changed depending on what system runs it. The benchmark therefore most likely measures how the upcoming game Aquanox 3 will run, but it does not offer comparable hardware performance measurements. Also, you should only use Aquamark 3, if you believe the game Aquanox 3 will sell well and be widely played.

The game benchmarks to use are those of the most played games, or upcoming games that are expected to become widely played. The latest Unreal Tournament benchmark should be one obvious choice, and the highly hyped Half Life 2 is another benchmark to recommend once it becomes available. Do not bias the review by using only benchmarks that are recommended by one single IHV. On the graphics side, the reviewer should choose both game benchmarks recommended by ATI and NVIDIA, and other graphics IHVs too for that matter. Add benchmark application results, like the 3DMark score, and take an average of all obtained results for the conclusion which hardware really is the fastest. The results should naturally also be presented separately, but the average is important in order to obtain a result that better reflects the overall hardware performance. Otherwise the reader (or the editor) might be influenced by which results are presented first in the review, or which is the editor’s favorite game. Another approach is to average the game benchmark results and present separately the benchmark application results, since these numbers present a bit different measurements.

About FRAPS
FRAPS is an application that can measure the frame rate of any 3D application. This is basically a good thing, since now all the most popular games could be used as benchmarks. Additionally, the limitations of designated benchmark modes in games do not apply to FRAPS benchmarking. You can benchmark any part of the game, meaning that possible unfair driver optimizations for a famous benchmark mode in a game do not necessarily distort the results of a randomly selected FRAPS run. Also, when measuring actual game play with FRAPS, the measurement is indeed of the game play itself, not just of real time playback of a recording. Thereby the CPU load should match that of actual game play, which is seldom the case in designated benchmark modes.

The problem with FRAPS is finding the sequence in the game to benchmark. Which part of the game is repeatable enough to offer genuinely comparable measurements? If a sequence is played just slightly differently on one system, that measurement may not be comparable anymore. Another choice is measuring a real time cinematic sequence in the game. Then again, how representative is that cinematic to the workload in actual game play? All this leads to the conclusion that FRAPS basically is a great substitute for game benchmarks, but it is quite difficult to generate genuinely comparable and relevant measurements using it. FRAPS does not substitute benchmark applications, since the same ambiguity about the various code paths in games apply also to FRAPS measurements.

What does 3DMark measure?
3DMark simply answers the question which hardware is fastest, disregarding how much some game may be optimized for a certain piece of hardware. The workload is always mathematically equivalent and mostly even identical on all supported hardware. The various code paths for different hardware in games affect their benchmark results enough to make game benchmarks measure above all the efficiency of that game, the software. 3DMark is on the other hand is designed to give a comparable performance measurement of the hardware. In other words, game benchmarks answer the question “how fast does this game run on different hardwareâ€, while 3DMark answers the question “how fast are different pieces of hardwareâ€.

Lately there has been discussion about how some IHVs identify benchmarks and execute these programs differently in the drivers than the application requests. Optimizations in the drivers are a good thing, but these must not alter the application that is running. This has also been addressed by Futuremark, in order to keep the benchmark results valid. A game developer is pleased as long as the game runs well on all buyers’ systems, and seldom puts too many resources on looking after the validity of the benchmark results.

While game benchmarks measure the performance of current games, 3DMark is designed to offer the workload of the rendering of next generation games. As a new hardware generation is presented on the market, it usually takes between one and two years before even a small number of games utilize the new hardware features. 3DMark is usually launched around the same time as the new hardware generation, and can immediately give a prediction of how fast the new hardware will run games designed for it. Both 3DMark2001 and 3DMark03 have succeeding well in predicting the 3D performance of the next generation games. The Futuremark Benchmark Development Program (BDP) members (the PC hardware manufacturers and Microsoft) see to it that each new 3DMark version correctly predicts the future game features and in general measures the graphics hardware performance correctly.

Why does not 3DMark03 measure CPU performance?
3DMark03 has been questioned a lot because it does not offer the same CPU load as game benchmarks. 3DMark03 has even been blamed in some forums to be a bad graphics card benchmark, because it does not scale with the CPU. This is most confusing :) . When developing 3D benchmarks, there is a clear trade-off between making the benchmark scale with the CPU or with the GPU. The weakest link in the chain will dictate the overall performance. If you want to compare graphics cards, you clearly need to choose a benchmark that scales with the graphics card. Looking at 3DMark2001, it was at launch very much limited to the graphics performance. Now 2.5 years after launch, the high end graphics hardware is powerful enough to make the rest of the system the bottleneck in many cases. For this reason Futuremark has got a lot of feedback that 3DMark2001 would be a better graphics benchmark than 3DMark03. This is not the case, 3DMark2001 is better suited for value or legacy hardware benchmarking. It is not by mistake that 3DMark03 scales mostly with the graphics card; it was designed that way by the recommendation of the BDP members and for a good reason.

Because 3DMark03 is designed to limit the measurement by the graphics card performance, a CPU test was added. This test gives results that scale with the CPU and memory performance, when run on a system with a high end graphics card. It is important to use the same graphics card for each system or CPU to be compared, since the different graphics cards may otherwise distort the results.

Other benchmark applications
After Ziff Davis 3D Winbench was discontinued, there are no other benchmark applications comparable to 3DMark. This is unfortunate, and only time will tell if there will be any alternatives. There are a number of less comprehensive benchmark applications, mostly concentrating on feature specific benchmarking. A few examples of these are Shadermark and Rightmark. These mostly measure the performance of one feature at a time in separate smaller tests. Each of the tested features get thereby well tested, like testing one type of shader at a time. Still, these results do not necessarily correlate to the performance that is gained with larger data and shader amounts in use in the same scene. 3DMark also contains a number of smaller tests like these to isolate the performance of some key features. For professionals that want an insight into why game benchmarks and 3DMark scale the way they do, this kind of tests can reveal certain strengths and weaknesses of the hardware.

Conclusions
There is no conflict or trade-off between using game benchmarks and benchmark applications like 3DMark. Both are needed to get a comprehensive hardware performance measurement. A game benchmark measures the performance of that game, but the result is highly affected by the various code paths implemented for different hardware. This kind of measurement reflects above all the efficiency of the software, or game in this case, and not the hardware itself. The drivers may also be tuned for benchmark modes in popular games and less attention is paid to this problem with game benchmarks than with benchmark applications. Only an average of a number of results from game benchmarks picked wisely will produce something of a hardware performance measurement.

3DMark on the other hand is designed to measure the hardware performance itself, is not distorted by unclear optional code paths, and active work is done to keep questionable driver tuning away. Future game technology is used as base for the measurement, meaning that the next generation of games will most likely stress the graphics hardware like 3DMark does. By using both the right game benchmarks, benchmark applications like 3DMark and possible feature specific benchmarks, the real hardware performance measurement and comparison should be as comprehensive as possible.
 
Back
Top