Introduction
With the current difficulties inherent in obtaining reliable benchmarks, many reviewers and enthusiasts are starting to use Fraps in order to benchmark games that do not have any benchmarking capability. One method of benchmarking with Fraps is detailed in Reverend's recent review of the GeforceFX 5600 Ultra (Rev2) (edit by Rev -- this review is by Neeyik, not me). However, this method can only be used with a few games. Using Fraps to record a standard gameplay situation provides a wealth of benchmarking opportunities. The problem is, no one knows how repeatable the results are. So let's find out...
Test System
CPU: AMD 2800+ running at 180*11.5
Motherboard: Epox 8RDA+, unified nForce drivers 2.45
Ram: Geil PC3700 256meg x 2
Video card: Gigabyte 9700pro, Cat 3.6
Operating system: Windows XP, service pack 1
Unless otherwise specified the following driver settings were used:
AA: 4x
AF: Quality 8x
Texture Preference: High Quality
Mipmap Detail Level: High Quality
Wait for vertical sync: Default Off
Truform: Always Off
Benchmark
Quake3, point release 1.32
Level: Q3DM4
The following settings were used:
Graphics Settings: High Quality
Screen Resolution: 1152 x 864
Geometric Detail: High
Default Level Bots
Difficulty: Hardcore
Q3DM4 was selected because I like the level, and since I had to play it a lot - I might as well like it.
Test Procedure
The actual death match length and the average, min and max frame rates were recorded with Fraps for 1, 2, 4 and 8 minute nominal length death matches. A minimum of 10 death matches were played for each death match length. Fraps was started as soon as the death match started and stopped when the match finished. No attempt was made to visit every part of the map or to produce consistent scores in anyway.
Results
If you need a statistics refresher, a glossary of statistics terms can be found here:
http://www.cas.lancs.ac.uk/glossary_v1.1/main.html
Analysis 1 - The Effect of Sample Sizes
30 death matches of the 1 minute duration were recorded in order to get an insight into the amount of samples required to get a reasonably accurate 95% confidence interval. Because the confidence interval is based on the sample standard deviation and not the population standard deviation, the accuracy of the confidence interval depends on the immediacy of the sample standard deviation to the population standard deviation. The first series in the graph below shows the 95% sample confidence intervals of the average frame rate for vary amounts of samples. The second series approximates the population confidence intervals for the same sample sizes by using the 30 sample standard deviation for each confidence interval calculation. The general trend is that increasing the number of samples reduces the size of the confidence interval. But the glaring exception is at 5 samples, which produces a fantastically low result. This is merely a statistical anomaly produced by a tight grouping of the first 5 samples. From 10 to 30 samples the sample confidence intervals match the "pseudo" population confidence intervals much more accurately. However, it should be remembered that the population confidence intervals are only an approximation, and the apparent converge of the intervals occurs quicker here than it would with the actual population standard deviations.
The main result that can be gained from this is that 10 samples is the minimum number required to achieve reasonably consistent results.
Analysis 2 - The Effect of Death Match Length
In order to test the effect of death match length, 10 matches were played of 1, 2, 4 and 8 minutes duration. The graph below shows the average of the min, average and max frame rates for each of the death matches lengths. The error bars show the 95% sample confidence intervals for each result.
If we examine the average frame rates first we see that they all have very similar means and confidence intervals. Performing an analysis of variance (ANOVA) confirms that it is likely that population averages for each death match length are the same, as F is considerably lower than F crit. The average frame rates can be considered to be quite accurate, and are certainly statistically significant.
As expected the, longer death matches have closer means and smaller confidence intervals than the 1 minute matches. Though, it does seem that increasing the death match length above 4 minutes will not yield a significant increase in accuracy over either the 2 or 4 minute match lengths.
The min frame rates tell a very different story. There is a general trend towards lower frame rates for the longer death match lengths, which is not surprising given the nature of what is being measured. The longer a match is, the more chances there are for computationally and graphically intensive scenarios to arise. Also of interest are the relatively large sample confidence intervals. Performing an ANOVA on the min frame rates shows that it is unlikely that they have the same population mean. F is considerably larger than F crit in this case. Despite the min frame rate being quite an important statistic, it is difficult to recommend using the min frame rate values obtained from Fraps
Finally, the relatively useless measurement of max frame rates. These are basically the same as the min frame rate measurements, except the general trend is of increasing max frame rates for longer match lengths. Performing an ANOVA shows that it is likely that the measurements have the same population average. However, given the fairly large confidence intervals and the fact that it is not a very useful statistic, there is not much to be gained from using the max frame rate results.
Analysis 3 - 2xAA - 4xAA Comparison Test Case
In order to test the legitimacy of the results above, 10 death matches of 4 minutes duration were recorded with 2xAA enabled. Comparing the results with the 4xAA, 4 minute results produced the graph below. Despite concluding that the min and max measurements were not accurate/useful enough, I have included them anyway. Why? Because I can.
Doing yet another ANOVA on the average frame rates shows that it is extremely unlikely that both sets of data have the same mean. The P-value gives the probability of incorrectly rejecting the null hypothesis (The null hypothesis for this test it: average 2xAA = average 4xAA), in this case a very-easy-to-ignore 2.16e-10. We can be quite certain that there is a statistically significant difference between the measured average frame rates.
[edit]Fixed incorrect Average Frame Rate ANOVA results[/edit]
Checking out the min frame rates now, we see that even they manage to produce some useful results, with a paltry 0.3% chance of having the same average.
Doh! The max frame rates let us down. Or do they? It's quite possible that the max frame rates are more system limited than fill rate limited. Without more information we can only guess.
Conclusion
In general, it seems that Fraps works very well as benchmarking program. A fairly minor change from 2xAA to 4xAA produced some very conclusive results for the average frame rates. The statistical significance of the max and min frame rates is dubious and, in my opinion, they should be used with caution - or not at all. The sweet spot for minimizing benchmarking time and maxmizing accuracy is probably around 10 - 20 samples with death match lengths of around 2 - 4 minutes.
The way the results are displayed should also be considered. In general, I think it is prudent to display the calculated confidence intervals with any graphs as they give a very could visual indication of the ANOVA results.
It must also be remembered that only one level in one game was tested. It is probably quite dangerous to extrapolate the results obtained here to other games and levels. More testing is required to get a fuller picture of how games behave when using Fraps. Depending on interest, I may update this with extra levels at a later date. Ideally, we need someone with both a GeforceFX 5900 Ultra and Radeon 9800 Pro to do some testing, then the truth will finally be revealed.
With the current difficulties inherent in obtaining reliable benchmarks, many reviewers and enthusiasts are starting to use Fraps in order to benchmark games that do not have any benchmarking capability. One method of benchmarking with Fraps is detailed in Reverend's recent review of the GeforceFX 5600 Ultra (Rev2) (edit by Rev -- this review is by Neeyik, not me). However, this method can only be used with a few games. Using Fraps to record a standard gameplay situation provides a wealth of benchmarking opportunities. The problem is, no one knows how repeatable the results are. So let's find out...
Test System
CPU: AMD 2800+ running at 180*11.5
Motherboard: Epox 8RDA+, unified nForce drivers 2.45
Ram: Geil PC3700 256meg x 2
Video card: Gigabyte 9700pro, Cat 3.6
Operating system: Windows XP, service pack 1
Unless otherwise specified the following driver settings were used:
AA: 4x
AF: Quality 8x
Texture Preference: High Quality
Mipmap Detail Level: High Quality
Wait for vertical sync: Default Off
Truform: Always Off
Benchmark
Quake3, point release 1.32
Level: Q3DM4
The following settings were used:
Graphics Settings: High Quality
Screen Resolution: 1152 x 864
Geometric Detail: High
Default Level Bots
Difficulty: Hardcore
Q3DM4 was selected because I like the level, and since I had to play it a lot - I might as well like it.
Test Procedure
The actual death match length and the average, min and max frame rates were recorded with Fraps for 1, 2, 4 and 8 minute nominal length death matches. A minimum of 10 death matches were played for each death match length. Fraps was started as soon as the death match started and stopped when the match finished. No attempt was made to visit every part of the map or to produce consistent scores in anyway.
Results
Code:
1 min
4AA
Sample Time (s) Min (fps) Average (fps)
1 63.797 113 216.232
2 61.39 147 223.342
3 61.375 148 220.806
4 61.235 101 218.747
5 61.156 117 212.963
6 61.219 154 234.649
7 60.641 130 217.229
8 61.328 137 204.963
9 62.813 145 211.58
10 61.2973 111 212.816
Sample Mean 61.62513 130.3 217.3327
Sample Standard Deviation 0.939704570655907 18.6252516761519 8.00590182372571
Max 63.797 154 234.649
Min 60.641 101 204.963
Range 3.156 53 29.686
Confidence Interval (95%) 0.58242386898597 11.5438314134092 4.96201514869886
11 60.641 156 236.721
12 61.297 143 222.8
13 60.985 153 223.136
14 61.078 140 208.307
15 61.125 140 218.159
16 62.078 75 224.507
17 61.109 71 214.289
18 61.547 129 197.296
19 61.046 126 223.765
20 60.985 126 218.201
21 60.609 93 220.775
22 61.109 153 231.111
23 61.125 143 226.47
24 61.063 142 205.967
25 60.984 105 204.233
26 61.047 101 216.357
27 60.782 120 214.685
28 61.766 156 230.045
29 61.156 148 221.09
30 61.234 151 228.925
Sample Mean 61.3005766666667 129.133333333333 218.6722
Sample Standard Deviation 0.6367920494667 23.8034093955167 9.19708126639583
Max 63.797 156 236.721
Min 60.609 71 197.296
Range 3.188 85 39.425
Confidence Interval (95%) 0.227868781953625 8.51777893559193 3.29107077806208
2 min
4AA
Sample Time (s) Min (fps) Average (fps)
1 121.093 144 225.933
2 121.015 137 212.882
3 122.359 88 207.425
4 122.765 131 217.423
5 121.016 74 206.518
6 121.204 86 207.187
7 121.219 96 212.804
8 121.094 72 215.089
9 121.687 139 217.697
10 121.047 108 216.7
Sample Mean 121.4499 107.5 213.9658
Sample Standard Deviation 0.625364507046685 28.0960653631232 6.0051247521504
Max 122.765 144 225.933
Min 121.015 72 206.518
Range 1.75 72 19.415
Confidence Interval (95%) 0.387597578105219 17.4137911031454 3.72194421641432
4 min
4AA
Sample Time (s) Min (fps) Average (fps)
1 241.453 103 211.892
2 241.578 126 215.098
3 241.485 78 213.532
4 241.515 73 209.601
5 240.797 136 213.827
6 241.563 117 214.565
7 240.797 141 213.258
8 241.266 117 211.003
9 241.343 88 213.389
10 240.438 115 216.247
Sample Mean 241.2235 109.4 213.2412
Sample Standard Deviation 0.400535266857954 23.4245834778574 1.9654169927948
Max 241.578 141 216.247
Min 240.438 73 209.601
Range 1.14000000000001 68 6.64600000000002
Confidence Interval (95%) 0.248249617032201 14.5184316056224 1.21815494450061
8 min
4AA
Sample Time (s) Min (fps) Average (fps)
1 480.656 124 211.059
2 481.281 75 209.757
3 480.016 85 208.313
4 481.344 87 212.968
5 480.547 127 214.732
6 481.172 90 207.726
7 480.562 84 212.958
8 481.25 65 211.929
9 480.609 120 214.823
10 481.328 74 218.456
Sample Mean 480.8765 93.1 212.2721
Sample Standard Deviation 0.457161350176893 22.3728506106049 3.26615827506475
Max 481.344 127 218.456
Min 480.016 65 207.726
Range 1.32799999999997 62 10.73
Confidence Interval (95%) 0.283346160735418 13.8665731973385 2.02434743714822
4 min
2AA
Sample Time (s) Min (fps) Average (fps)
1 241.063 155 233.026
2 240.968 126 231.478
3 241.125 138 225.215
4 241.234 140 226.062
5 240.625 153 238.636
6 243.797 145 237.988
7 241.281 118 231.257
8 240.907 145 231.786
9 241.282 127 234.584
10 241.016 132 233.702
Sample Mean 241.3298 137.9 232.3734
Sample Standard Deviation 0.889376660622191 12.0963722752824 4.36261005464231
Max 243.797 155 238.636
Min 240.625 118 225.215
Range 3.172 37 13.421
Confidence Interval (95%) 0.551230899413252 7.49726686584818 2.70392238821223
If you need a statistics refresher, a glossary of statistics terms can be found here:
http://www.cas.lancs.ac.uk/glossary_v1.1/main.html
Analysis 1 - The Effect of Sample Sizes
30 death matches of the 1 minute duration were recorded in order to get an insight into the amount of samples required to get a reasonably accurate 95% confidence interval. Because the confidence interval is based on the sample standard deviation and not the population standard deviation, the accuracy of the confidence interval depends on the immediacy of the sample standard deviation to the population standard deviation. The first series in the graph below shows the 95% sample confidence intervals of the average frame rate for vary amounts of samples. The second series approximates the population confidence intervals for the same sample sizes by using the 30 sample standard deviation for each confidence interval calculation. The general trend is that increasing the number of samples reduces the size of the confidence interval. But the glaring exception is at 5 samples, which produces a fantastically low result. This is merely a statistical anomaly produced by a tight grouping of the first 5 samples. From 10 to 30 samples the sample confidence intervals match the "pseudo" population confidence intervals much more accurately. However, it should be remembered that the population confidence intervals are only an approximation, and the apparent converge of the intervals occurs quicker here than it would with the actual population standard deviations.
The main result that can be gained from this is that 10 samples is the minimum number required to achieve reasonably consistent results.
Analysis 2 - The Effect of Death Match Length
In order to test the effect of death match length, 10 matches were played of 1, 2, 4 and 8 minutes duration. The graph below shows the average of the min, average and max frame rates for each of the death matches lengths. The error bars show the 95% sample confidence intervals for each result.
If we examine the average frame rates first we see that they all have very similar means and confidence intervals. Performing an analysis of variance (ANOVA) confirms that it is likely that population averages for each death match length are the same, as F is considerably lower than F crit. The average frame rates can be considered to be quite accurate, and are certainly statistically significant.
Code:
Anova: Single Factor - Nominal Death Match Length, Average Frame Rate
SUMMARY
Groups Count Sum Average Variance
1 min 10 2173.327 217.3327 64.0944640
2 min 10 2139.658 213.9658 36.0615232
4 min 10 2132.412 213.2412 3.86286395
8 min 10 2122.721 212.2721 10.6677898
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 145.047199 3 48.3490665 1.68630159 0.18721065 2.86626544
Within Groups 1032.17977 36 28.6716602
Total 1177.22696 39
As expected the, longer death matches have closer means and smaller confidence intervals than the 1 minute matches. Though, it does seem that increasing the death match length above 4 minutes will not yield a significant increase in accuracy over either the 2 or 4 minute match lengths.
The min frame rates tell a very different story. There is a general trend towards lower frame rates for the longer death match lengths, which is not surprising given the nature of what is being measured. The longer a match is, the more chances there are for computationally and graphically intensive scenarios to arise. Also of interest are the relatively large sample confidence intervals. Performing an ANOVA on the min frame rates shows that it is unlikely that they have the same population mean. F is considerably larger than F crit in this case. Despite the min frame rate being quite an important statistic, it is difficult to recommend using the min frame rate values obtained from Fraps
Code:
Anova: Single Factor - Nominal Death Match Length, Min Frame Rate
SUMMARY
Groups Count Sum Average Variance
1 min 10 1303 130.3 346.900000
2 min 10 1075 107.5 789.388888
4 min 10 1094 109.4 548.711111
8 min 10 931 93.1 500.544444
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 7042.875 3 2347.625 4.29664106 0.01086496 2.86626544
Within Groups 19669.9 36 546.386111
Total 26712.775 39
Finally, the relatively useless measurement of max frame rates. These are basically the same as the min frame rate measurements, except the general trend is of increasing max frame rates for longer match lengths. Performing an ANOVA shows that it is likely that the measurements have the same population average. However, given the fairly large confidence intervals and the fact that it is not a very useful statistic, there is not much to be gained from using the max frame rate results.
Code:
Anova: Single Factor - Nominal Death Match Length, Max Frame Rate
SUMMARY
Groups Count Sum Average Variance
1 min 10 3179 317.9 458.988888
2 min 10 3249 324.9 109.211111
4 min 10 3301 330.1 269.655555
8 min 10 3340 334 119.111111
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 1455.27500 3 485.091666 2.02762200 0.12734480 2.86626544
Within Groups 8612.69999 36 239.241666
Total 10067.9749 39
Analysis 3 - 2xAA - 4xAA Comparison Test Case
In order to test the legitimacy of the results above, 10 death matches of 4 minutes duration were recorded with 2xAA enabled. Comparing the results with the 4xAA, 4 minute results produced the graph below. Despite concluding that the min and max measurements were not accurate/useful enough, I have included them anyway. Why? Because I can.
Doing yet another ANOVA on the average frame rates shows that it is extremely unlikely that both sets of data have the same mean. The P-value gives the probability of incorrectly rejecting the null hypothesis (The null hypothesis for this test it: average 2xAA = average 4xAA), in this case a very-easy-to-ignore 2.16e-10. We can be quite certain that there is a statistically significant difference between the measured average frame rates.
Code:
Anova: Single Factor - 2xAA/4xAA, Average Frame Rate
SUMMARY
Groups Count Sum Average Variance
2AA 10 2323.734 232.3734 19.0323664
4AA 10 2132.412 213.2412 3.86286395
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 1830.20538 3 1830.20538 159.876564 2.16-10 4.41386305
Within Groups 206.057073 18 11.4476152
Total 2036.26245 19
[edit]Fixed incorrect Average Frame Rate ANOVA results[/edit]
Checking out the min frame rates now, we see that even they manage to produce some useful results, with a paltry 0.3% chance of having the same average.
Code:
Anova: Single Factor - 2xAA/4xAA, Min Frame Rate
SUMMARY
Groups Count Sum Average Variance
2AA 10 1379 137.9 146.322222
4AA 10 1094 109.4 548.711111
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 4061.25 1 4061.25 11.6864898 0.00306301 4.41386305
Within Groups 6255.29999 18 347.516666
Total 10316.55 19
Anova: Single Factor - 2xAA/4xAA, Max Frame Rate
Doh! The max frame rates let us down. Or do they? It's quite possible that the max frame rates are more system limited than fill rate limited. Without more information we can only guess.
Code:
Anova: Single Factor - 2xAA/4xAA, Max Frame Rate
SUMMARY
Groups Count Sum Average Variance
2AA 10 3441 344.1 251.433333
4AA 10 3301 330.1 269.655555
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 980 1 980 3.76135442 0.06828722 4.41386305
Within Groups 4689.79999 18 260.544444
Total 5669.79999 19
Conclusion
In general, it seems that Fraps works very well as benchmarking program. A fairly minor change from 2xAA to 4xAA produced some very conclusive results for the average frame rates. The statistical significance of the max and min frame rates is dubious and, in my opinion, they should be used with caution - or not at all. The sweet spot for minimizing benchmarking time and maxmizing accuracy is probably around 10 - 20 samples with death match lengths of around 2 - 4 minutes.
The way the results are displayed should also be considered. In general, I think it is prudent to display the calculated confidence intervals with any graphs as they give a very could visual indication of the ANOVA results.
It must also be remembered that only one level in one game was tested. It is probably quite dangerous to extrapolate the results obtained here to other games and levels. More testing is required to get a fuller picture of how games behave when using Fraps. Depending on interest, I may update this with extra levels at a later date. Ideally, we need someone with both a GeforceFX 5900 Ultra and Radeon 9800 Pro to do some testing, then the truth will finally be revealed.