fraps Benchmarking

Nathan · Aug 25, 2003

The Baron said:
Has anyone made a functional yet remarkably stupid bot that takes neglegible amounts of CPU power?

Some of my friends would make excellent stupid bots. Maybe we could use them?

Being serious now: I think using bots (and some sort of scripting to restart matches, turn fraps on and off, etc...) is a great idea. I don't know how possible it is to make a bot that's cheap on cpu cycles. You'll still need the route finding and aiming/shooting code. I suppose you could dumb the tactics down a bit. I wonder what percentage of cpu time the tactics part of the AI takes up.

BTW, thanks for the feedback guys, it makes me feel all warm and fuzzy inside. Mmmm, warm fuzz.

Hanners · Aug 25, 2003

The Baron said:
Has anyone made a functional yet remarkably stupid bot that takes neglegible amounts of CPU power?

Sounds like the bots in Quake 3.

WaltC · Aug 25, 2003

Bolloxoid said:
Of course we know that there is negligible variation with recorded timedemos compared to random scenes. Which is exactly why we need statistical techniques to find out how we can arrive at reliable results using random scenes.

What would make you think there'd be only a "neglible" difference between multiple random scenes and multiple reiterations of pre-recorded timedemos? I would expect that there'd be "neglible" differences between iterations of the timedemo, because the machine is being instructed to do exactly the same thing each iteration. OTOH, when a person attempts these iterations manually then there could be significant differences in the framerates between manual iterations simply because of the general impossibility of duplicating all of the actions you took in the original iteration in each succeeding iteration. I would think that would be obvious...

There are some good reasons for using bot deathmatches (or randomised flybys) instead of recorded timedemos. First, they are immune to the so much publicised "on-rail" cheats found in 3dmark03 and possibly in some widely used, popular benchmarking timedemos for actual games. Secondly, they might be more representative samples of graphics card performance because they might expose the card (and driver) to a wider selection of scenes to render than a short timedemo that is always identical.

If you make your own timedemo, though, you automatically eliminate the possibility of an IHV doing anything similar to it as was done in 3dMk03 because the IHV has never seen your timedemo...

If you are talking about optimizations in the drivers relative to the game engine itself, these will apply equally in your own timedemos as well as your own manually played, time-limited deathmatches.

The point to using your own timedemos relative to Fraps is one of the elimination of as many variables between iterations which will affect frame rates as you can manage, it seems to me.

jb · Aug 25, 2003

Nathan said:
Some of my friends would make excellent stupid bots. Maybe we could use them?
Being serious now: I think using bots (and some sort of scripting to restart matches, turn fraps on and off, etc...) is a great idea. I don't know how possible it is to make a bot that's cheap on cpu cycles. You'll still need the route finding and aiming/shooting code. I suppose you could dumb the tactics down a bit. I wonder what percentage of cpu time the tactics part of the AI takes up.

BTW, thanks for the feedback guys, it makes me feel all warm and fuzzy inside. Mmmm, warm fuzz.

Nice work on the Fraps stuff.

Also there is a command in UT2k3 that will show you how much time is spent on AI code, rendering and other stuff. I dont remember it off hand...I will look it up and get back to you (provided someone else does not post it here).

jb · Aug 25, 2003

Try one of these:

STAT ALL - Shows all stats
STAT AUDIO - Shows audio stats
STAT FPS - Displays your frames per second
STAT GAME - Displays game stats
STAT HARDWARE - Shows hardware stats
STAT NET - Shows network game play stats
STAT NONE - Turns off all stats
STAT RENDER - Displays rendering statistics

Bolloxoid · Aug 25, 2003

WaltC said:
What would make you think there'd be only a "neglible" difference between multiple random scenes and multiple reiterations of pre-recorded timedemos?

I was a bit unclear, I was trying to say that there are negligible differences between different runs of the same recorded timedemo, as opposed to running random time-limited scenes.

If you make your own timedemo, though, you automatically eliminate the possibility of an IHV doing anything similar to it as was done in 3dMk03 because the IHV has never seen your timedemo...

Yes, but then the question is why use timedemos at all if we can sample the graphics card's performance better by making it render a larger amount of differing scenes by introducing randomness.

The point to using your own timedemos relative to Fraps is one of the elimination of as many variables between iterations which will affect frame rates as you can manage, it seems to me.

A bit of test theory. All tests have to attributes: reliability (how consistent the results between runs) and validity (how representative the tests are of the variable they are supposed to measure). By using identical timedemos you are certainly increasing reliability, but by using several random scenes you are increasing validity by making the card render a larger selection of scenes, helping you to get a more comprehensive measure of the card's performance.

An exagerrated example: benchmarking a five-second timedemo of an empty room in a game is extremely reliable (there is practically no variation between different runs) but it is hardly a valid benchmark of the graphics cards overall capability to run that game. So in essence we are talking about a satisfactory combination of reliability and validity. Randomness may be a way to increase validity.

WaltC · Aug 25, 2003

Bolloxoid said:
...A bit of test theory. All tests have to attributes: reliability (how consistent the results between runs) and validity (how representative the tests are of the variable they are supposed to measure). By using identical timedemos you are certainly increasing reliability, but by using several random scenes you are increasing validity by making the card render a larger selection of scenes, helping you to get a more comprehensive measure of the card's performance.

An exagerrated example: benchmarking a five-second timedemo of an empty room in a game is extremely reliable (there is practically no variation between different runs) but it is hardly a valid benchmark of the graphics cards overall capability to run that game. So in essence we are talking about a satisfactory combination of reliability and validity. Randomness may be a way to increase validity.

Well, if you wanted to be more thorough, you could simply record your timedemos for different scenes/maps--as many as you wanted, right?

Agreed that a 5-second timedemo, or manual run of equal length, is unlikely to tell us much of anything. I guess I'm just not sold on the idea that doing multiple manual runs introduces randomness as much as it would introduce sample inconsistency regarding frame rate averages between runs.

An extreme example of what I mean: first run you sit for two minutes facing a wall; second run you shoot everything that moves for two minutes while standing still facing away from the wall; third run you shoot everything that moves while at a constant run through the level. You wind up with three wildly different frame rate results for the game that have little to do with the software, your system, the 3d card, etc., but mostly have been determined by what you did or didn't do while playing the game manually for two minutes. You can get major framerate deviation in practically any 3d game by doing similar things while you play. So, to try and get more consistent frame rates while playing manually you try to be as consistent in the way you play as you can humanly be--yet you will inevitably fail in being perfectly consistent between runs. Using a pre-recorded timedemo of your own makes gameplay absolutely consistent between iterations, so why not do that, instead? Just MO...

Nathan · Aug 25, 2003

jb said:
Try one of these:
STAT ALL - Shows all stats
STAT AUDIO - Shows audio stats
STAT FPS - Displays your frames per second
STAT GAME - Displays game stats
STAT HARDWARE - Shows hardware stats
STAT NET - Shows network game play stats
STAT NONE - Turns off all stats
STAT RENDER - Displays rendering statistics

Cool, maybe I should get UT2003 - though 2004 is just around the corner, and it has vehicles.

Bolloxoid said:
A bit of test theory. All tests have to attributes: reliability (how consistent the results between runs) and validity (how representative the tests are of the variable they are supposed to measure). By using identical timedemos you are certainly increasing reliability, but by using several random scenes you are increasing validity by making the card render a larger selection of scenes, helping you to get a more comprehensive measure of the card's performance.

Nicely said. I would also like to point out that recorded demos do not accurately portray the cpu load for a game. They may be very useful for testing video cards, but I think that a reviewer should be careful about how they use them for testing other computer components.

WaltC said:
An extreme example of what I mean: first run you sit for two minutes facing a wall; second run you shoot everything that moves for two minutes while standing still facing away from the wall; third run you shoot everything that moves while at a constant run through the level. You wind up with three wildly different frame rate results for the game that have little to do with the software, your system, the 3d card, etc., but mostly have been determined by what you did or didn't do while playing the game manually for two minutes. You can get major framerate deviation in practically any 3d game by doing similar things while you play.

Standing still looking at a wall is not representative of actually game play. As long a you're playing the game "the way its meant to be played"

- I don't see what the problem is. Who cares what fps standing looking at wall generates?

I think there is a problem, though. Comparing a 9800 pro against a 5200 @ 1600 x 1200 will generate completely different gameplay. I pretty sure I'd get hammered on the 5200, and do pretty well with a 9800. Having said that, the frame rates will be so completely different that it should be obvious which is faster. There may be a problem if the image quality differences between a 5900 ultra and 9800 pro causes differences in the game play. I don't think that its possible to isolate those effects completely.

AAlcHemY · Aug 25, 2003

Nathan said:
Cool, maybe I should get UT2003 - though 2004 is just around the corner, and it has vehicles.

Actually, there are allready vehicles in ut2k3, open the folder maps and look for vehicledemo.bsp. Just open the map by dubble clicking it.

Sorry for offtopic.

Chris123234 · Aug 26, 2003

AAlcHemY said:
Nathan said:

Cool, maybe I should get UT2003 - though 2004 is just around the corner, and it has vehicles.

Click to expand...

Actually, there are allready vehicles in ut2k3, open the folder maps and look for vehicledemo.bsp. Just open the map by dubble clicking it.

Sorry for offtopic.

Its a pretty lame demo compared to what 2004 will be. You can even aim the rockets. Only drive.

But yea there are vehicles. I guess modders were supposed to make vehicle mods or something?

offtopic too

Bolloxoid · Aug 26, 2003

WaltC said:
Well, if you wanted to be more thorough, you could simply record your timedemos for different scenes/maps--as many as you wanted, right?

Sure, if that is possible with the particular title we are using. The point of this article is that we can manage to get reliable results also when the game does not have built-in benchmarking/timedemo functionalities, which is of course why people use Fraps in the first place. That is why your suggestions about using recorded timedemos to "hit the mean faster" struck me as odd.

K.I.L.E.R · Aug 26, 2003

Hi,

Thanks for the link. Very interesting read. I initially was very suprised by the data indicating such a variation in minimum FPS values between samples. Then I read further and saw that each sample was unique, and not running on a fixed demo, which explained why the results were different.

Kinda cool that if you do enough benchmarks at random while running through a level you can derive values that have some weight.

I would still recommend that people use Fraps with pre-recorded demos or replays, if only to ensure that everything is reproducible. You also don't need to take so many samples

Regards,
Rod Maher

k_r_u_n_o@hotmail.com wrote:

> User Feedback From: Kruno
>
> http://www.beyond3d.com/forum/viewtopic.php?t=7527&postdays=0&postorder=asc&start=0
>
> Something worth taking a look at if you have the time.
>
> "The sweet spot for minimizing benchmarking time and maxmizing accuracy is probably around 10 - 20 samples with death match lengths of around 2 - 4 minutes. "
>
> Maybe you can look into it and keep improving FRAPS?
>
>> From what I read there is still room for improvement. Please don't quite now.
>
>
>
>
>
>

Nathan · Aug 26, 2003

Thanks K.I.L.E.R

Mark · Aug 26, 2003

Just simple averages of 10 runs using Vice City. Each sample was approximatly 4m:45s long, with no attempt made at keeping each sample route taken similar. Like I said, these are just averages, but I still think the results are accurate.

Code:

Grand Theft Auto : Vice City							
Sapphire 9800 Atlantis							
							

No Anti-Aliasing / No Anisotropic							
1024x768			1280x1024			1600x1200	
Sample	AVG		Sample	AVG		Sample	AVG
1	64.5		1	65.0		1	61.9
2	66.0		2	65.0		2	63.0
3	70.6		3	56.9		3	68.4
4	66.1		4	63.2		4	48.8
5	70.3		5	67.4		5	69.6
6	70.6		6	69.3		6	53.3
7	57.3		7	70.0		7	65.8
8	80.4		8	62.3		8	55.1
9	73.4		9	57.9		9	65.1
10	64.7		10	59.8		10	51.6
Average	68.4		Average	63.7		Average	60.3

4x Anti-Aliasing / 8x Anisotropic							
1024x768			1280x1024			1600x1200	
Sample	AVG		Sample	AVG		Sample	AVG
1	67.5		1	54.3		1	55.8
2	63.3		2	70.7		2	55.2
3	69.1		3	61.4		3	52.9
4	63.0		4	67.0		4	56.3
5	58.9		5	48.1		5	48.6
6	74.3		6	60.3		6	46.7
7	72.4		7	50.2		7	46.6
8	56.3		8	66.1		8	49.1
9	62.0		9	66.5		9	47.3
10	71.5		10	67.1		10	54.9
Average	65.8		Average	61.2		Average	51.4

A feature I would like to see in FRAPs is an automatic timer mode. That is, you can set a time limit for how long FRAPs logs framerates once you start logging.

WaltC · Aug 26, 2003

Bolloxoid said:
Sure, if that is possible with the particular title we are using. The point of this article is that we can manage to get reliable results also when the game does not have built-in benchmarking/timedemo functionalities, which is of course why people use Fraps in the first place. That is why your suggestions about using recorded timedemos to "hit the mean faster" struck me as odd.

Well, as Nathan's stats were compiled under Q3, which I presume can be used to build your own timedemos...

...what would be odd about suggesting it?

Nathan said:
Standing still looking at a wall is not representative of actually game play. As long a you're playing the game "the way its meant to be played" - I don't see what the problem is. Who cares what fps standing looking at wall generates?

I think there is a problem, though. Comparing a 9800 pro against a 5200 @ 1600 x 1200 will generate completely different gameplay. I pretty sure I'd get hammered on the 5200, and do pretty well with a 9800. Having said that, the frame rates will be so completely different that it should be obvious which is faster. There may be a problem if the image quality differences between a 5900 ultra and 9800 pro causes differences in the game play. I don't think that its possible to isolate those effects completely.

Nathan, I don't see a problem, except one of variability between frame rates based on the way the reviewer happens to play--his style of play--and the differences in that play over each succeeding iteration. There's no question that differences in frame rates will occur between iterations simply because each successive iteration is played differently than the one before it. It's guaranteed, IMO. My examples were meant only to illustrate extremes in those differences, and to illustrate what they could do to frame rates, when each iteration is played manually. Because of this variable, it seems difficult to judge the caliber of Fraps, since the differences in the frame rates between iterations are not the fault of the Fraps software, but rather result from the simple fact that each successive iteration is played differently than the one before it.

Bolloxoid · Aug 26, 2003

WaltC said:
Well, as Nathan's stats were compiled under Q3, which I presume can be used to build your own timedemos......what would be odd about suggesting it?

Quote from the opening paragraph of Nathan's text (emphasis mine):

Nathan said:
With the current difficulties inherent in obtaining reliable benchmarks, many reviewers and enthusiasts are starting to use Fraps in order to benchmark games that do not have any benchmarking capability. One method of benchmarking with Fraps is detailed in Reverend's recent review of the GeforceFX 5600 Ultra (Rev2) [...] However, this method can only be used with a few games. Using Fraps to record a standard gameplay situation provides a wealth of benchmarking opportunities. The problem is, no one knows how repeatable the results are. So let's find out...

WaltC said:
Because of this variable, it seems difficult to judge the caliber of Fraps, since the differences in the frame rates between iterations are not the fault of the Fraps software, but rather result from the simple fact that each successive iteration is played differently than the one before it.

That is the whole premise of the article, to find out statistically how many iterations are needed to make this type of benchmarking methodology reliable. This is not a test of the technical accuracy of Fraps! You have misunderstood the purpose of the analysis.

Nathan · Aug 26, 2003

WaltC said:
Nathan, I don't see a problem, except one of variability between frame rates based on the way the reviewer happens to play--his style of play--and the differences in that play over each succeeding iteration. There's no question that differences in frame rates will occur between iterations simply because each successive iteration is played differently than the one before it. It's guaranteed, IMO. My examples were meant only to illustrate extremes in those differences, and to illustrate what they could do to frame rates, when each iteration is played manually. Because of this variable, it seems difficult to judge the caliber of Fraps, since the differences in the frame rates between iterations are not the fault of the Fraps software, but rather result from the simple fact that each successive iteration is played differently than the one before it.

My point was that image quality differences could cause the reviewer to play the game diffierently on different cards. This could mean the differences in fps between the 2 cards are not just due to the hardware differences, but also a CONSISTENT difference in the reviewers playing style. The whole point behind doing multiple runs and using statistical analysis is that effect of random variations between runs is reduced.

Another example is using 2 different reviewers, where one does the benchmarking with a card1 and the other with card2. Are the differences in the fps of each card due to the hardware, or the playing style of the reviewers?

Ratchet, thanks for the results. I'll stick 'em through my super-stats-benchmark-o-matic machine and see what comes out.

Nathan · Aug 27, 2003

OK, here goes...

Code:

Ratchet's Vice City Results Analysis

No Anti-Aliasing / No Anisotropic
                         1024x768   1280x1024   1600x1200    
Mean                     68.39      63.68       60.26
Standard Deviation       5.88       4.31        7.07
95% Confidence Interval  3.64       2.67        4.38

4x Anti-Aliasing / 8x Anisotropic
                         1024x768   1280x1024   1600x1200    
Mean                     65.83      61.17       51.34
Standard Deviation       5.73       7.43        3.84
95% Confidence Interval  3.55       4.60        2.38


Anova: Single Factor, No Anti-Aliasing / No Anisotropic
Probability of all 3 resolutions having the same mean: 2.3%

Anova: Single Factor, 4x Anti-Aliasing / 8x Anisotropic
Probability of all 3 resolutions having the same mean: 0.0056%

Anova: Single Factor, 1024x768
Probability of No AA / No AF and 4x AA / 8xAF having the same mean: 37%

Anova: Single Factor, 1280x1024
Probability of No AA / No AF and 4x AA / 8xAF having the same mean: 39%

Anova: Single Factor, 1600x1200
Probability of No AA / No AF and 4x AA / 8xAF having the same mean: 0.37%

It's a bit of mixed bag, really. On the plus side, there is definitely a trend of decreasing frame rates with increasing resolution for both No AA / no AF and 4x AA / 8x AF. Unfortunately, the confidence intervals for 1024x768 and 1280x1024 are too high to draw any firm conclusions. 1600x1200 works beautifully though, with a 99.63% chance of the means being different.

If I was to write a conclusion, it would go something like this: The results show, with high probability, that increasing the resolution in GTA Vice City has a negative impact on the frame rate. It is also likely that adding 4x AA and 8x AF will reduce the frame rate as well.

Ratchet, I think you probably picked one of the least suitable games to benchmark with Fraps. Considering that, I'm quite happy, since the results are generally quite useful. Thanks again.

Mark · Aug 27, 2003

ok, how about Rallisport Challenge...

I raced each of the four races twice (Rally, Hillclimb, Ice Racing, RallyCross) using a random course/track and car. I logged FRAPs framerate from the start of the race to the finish, and logged again while watching the reply of the race from the default replay camera. This gave me around 2-3 minutes per sample and 16 samples when everything was said and done (8 races, 2 samples per race). I then averaged the results once again.

Code:

	        Average	Min	Max
0x AA / 0x AF			
1024x768	 96.8	50.0	166.9
1280x1024	75.6	46.9	119.9
1600x1200	58.9	33.4	92.6
4x AA / 8x AF			
1024x768	 58.7	32.9	103.6
1280x1024	41.3	23.7	70.1
1600x1200	31.2	18.7	54.0

Of course, I never recored the samples, so this post is pretty much useless *cough*

Nite_Hawk · Aug 27, 2003

Nathan:

That was truly a joy to read. I've been lamenting the lack of statistical analysis in this industry for a long time. Thank you for the wonderful piece of work.

I'm currently working on a project that would be helped significantly by tests like the one you just performed. The idea is that with enough samples of data from around the web a database can be compiled and analysis done to make predictions about how variations in systems will affect certain benchmark scores (Say, that across the board we have 1000 different samples of the radeon 9700 running ut2k3 botmatch on various kinds of hardware. With what degree of accuracy can we predict variations in scores versus changes in say, CPU speed? Alternatively, how do different cards perform given the same test system?). In addition, with enough data we should be able to start making predictions about how a specific configuration should perform based on how others do. I think there is a *lot* of really neat things that could be done if someone was willing to do them.

Now that you've actually found fraps to be a reasonably good benchmark (atleast in the average frame rate case), I'd absolutely love to see people start doing tests like your for many different combinations of hardware. This is very *very* cool.

edit: spelling

Thanks,
Nite_Hawk

fraps Benchmarking

Nathan

Hanners

WaltC

jb

jb

Bolloxoid

WaltC

Nathan

AAlcHemY

Chris123234

Bolloxoid

K.I.L.E.R

Retarded moron

Nathan

Mark

aka Ratchet

WaltC

Bolloxoid

Nathan

Nathan

Mark

aka Ratchet

Nite_Hawk

Similar threads