Neutral Benchmarks

AJ

Newcomer
Hi guys,

While so many different debates concerning benchmarking are going on at the same time it's hard (or next to impossible) to answer them all extensively. Having now read more B3D forums that ever before (thanks for giving me a nudge worm!)I do think most of the points have already been more or less evaluated, but I thought I'd join the conversation regardless. :)

Here are some of the points of view that we (Futuremark) see as important:

- Performance measurement is a crucial part of PC industry. Manufacturers spend a lot time, resources and money on how new technologies are designed and measuring the end result is the final step before customer purchases hardware. From a manufacturers point of view the important part is how hardware is being bought and sold.

- Any application can be used as a benchmark. The applications that are widely used will become important for previously mentioned purchasing process. Most manufacturers will do what ever they can to ensure their hardware is shown in favorable light with at least a dozen of the most important applications like this. As benchmark developers we get constant pressure from all manufacturers and in an evenly applied environment that's a good thing. Please note that many game developers also live with this pressure. (I hope that possible game developers here in the forums might actually enlighten the rest of us with one or two examples?)

- Cheating is possible with any application. With games there's a number of things you can do which will improve your performance profile (dropping frames, funny stuff with aa or filtering, texture resolutions, etc.) without improving the end user experience or even actually decreasing it. While the argument that games should be used as benchmarks does have a lot of merit, the conclusion that synthetic benchmarks work against the industry is false. Synthetic benchmarks like 3DMark03 make manufacturers do the same optimisation work for the application but also to the DirectX API. There's currently few games in the market that would agressively use DX8 features. 3DMark03 is the first tool in a row of many that will enable manufacturers to work with this kind of issues. In the end both developers and gamers benefit it manufacturers optimise for the API's and all developers get the same benefits from this.

- Our aim in developing benchmarks (not just 3DMark, but all benchmarks that we design) is to ensure that objective apples-to-apples comparisons are possible. In order to do this we give all manufacturers and equal opportunity to be involved in our development through beta program cooperation. We also disclose all technical information about our benchmarks to the public. How many other developers do the same? We have policies in place with regard to what DirectX features we adopt, when we adopt them etc. Finally we work very closely with Microsoft to ensure that we are measuring DX in a proper manner. Summing this up we realize that developing benchmarks is a signifincant responsibility and do our best to meet this absolute requirement for objectivity.

- We began developing benchmarks back in 1997 when the industry was still very confused about what was fast 3D hardware and what was not. Back then only thing that we had for performance metrics was games. This confusion back then was due to fact that each and every manufacturer was always able to show a game that performed fast with their hardware or if everything else failed they'd show a game that wasn't even published yet and claim something about the future performance.
I believe most of the readers here in B3D forums were already involved back in 1995 to 1997 and remember what kind of jungle the industry was back then. With such a confused market selling hardware was dictated almost solely by the size of your marketing budget and the skill of your PR department.
When benchmarks emerged from Ziff-Davis and us the net result was that everyone sold more hardware. Not necessarily because they received good benchmark scores (all manufacturers offer both high end and low end products) but because consumers felt safer with their purchase decisions. When in doubt one tends to postpone a purchase and confusion hurts everyone in the industry.

- Frankly, after last week it feels like we're going back to 1997. Games as benchmarks certainly have a lot of positive aspects, but the scary part is that then every manufacturer can highlight a game that they do well with. Looking at the examples so far I'm on the verge of despair. While people demand an amazing level of technical disclosure from the benchmarks I've seen very few technical breakdowns about _any_ of the games that are 'commonly accepted' as solid benchmarks. Additionally there does not seem to be any logical or at least openly disclose way to explain why some game is used as a benchmark while the other is (Serious Sam, Commache4 being good examples, while both are excellent games neither really qualifies as a triple A title, nor to my knowledge has publicly disclosed any technical benchmark information).

- Deliberately creating confusion to the market place is a very short-term focused goal which is not beneficial to anyone. It first and foremost will lead to unhappy customer who will move over from PC world to PS2 or somewhere where they have a warm and happy feeling about how to spend their money.

- Futuremark as a company is dedicated to creating objective and technically accurate benchmark software that is publicly available with technical disclosures. Solid and time tested standards for performance measurement are an integral part of computer industry and in part ensure that computer hardware manufacturers focus more on their actual engineering budgets instead of relying simply to their marketing budgets.

Cheers,

AJ

Ps. Given the level of criticism over last week I’ll have to pre-empt a few follow-up questions:

Yes, games are good as benchmarks and we support their use. They as well as other tools may be used to mislead, so demand public technical disclosures about them also (incidentally John Carmack does this very well in his plan files).

No, 3DMark03 did not invalidate 3DMark2001. :) They are different tools that should be used for measuring different generations of hardware. 3DMark99 and 3DMark2000 have now been retired as legacy products.
 
What I think would help it to do away with the final scores and the weights. The FPS scores in the test are what we find really important. The weighted score is for more or less bragging rights and does not reflect well....
 
To answer one aspect of that, I think reviewers have a woefully inadequate set of tools to review anything with, a game is chosen because it *can* benchmark, not because it's great. Even games that do have a benchmark mode, are poorly implemented, the exception being serious sam, which I think is more benchmark than game. 3DMark is used so much because it is one thing that is well done that gives consistant results and results you can make comments on. Commanche 4 as a benchmark crashes quite a bit, is not very broad in video card support, I distrust any review that uses it frankly, it ran but did it display correctly? UT2003 is ok for testing UT2003, but it's benchmarking mode could be vastly improved, the high and low peaks could be generated by a rand(), the precision in the results are a joke, doesn't give me great confidence in the averages. Ironically some ok benchmarks come from one vendor, NVIDIA, that chameleonmark is quite useful, but I tend to suspect some heavy leaning towards NVIDIA's architecture in the code. There is also the aspect of good benchmarks that just go stale, they become out of date, leaving reviewers with even less to choose from. So, it's really not much about choice, but lack of options.

Game benchmarks are good because they tell users what some aspect of playing that one paticular game would be like on thier hardware. Game benchmarks are bad because they can be targetted for whoever the market leader is at the time they are in development or whoever helps them the most or outright sponsers them. It still is valid for that game, but it can give a warped impression of general performance.

All games should have a benchmark mode, or the API they run on should be capable of generating stats on their behalf, which could be better if done right. Reviewer resort to eyeballing frame counters using things like fraps, which is just sad.

My $0.02
 
But JB, the entire purpose of the bechmark is exactly to provide a "score." OEMs and the "clueless paper mag reviewer" don't have enough technical knowledge to (and don't want to know either) to look at several individual scores and draw their own conclusions.

Yes, FutureMark has to make more or less a "judgement" call on the relative importance of each game scene to come up with a weighting formula. Ask 100 different people how they would weight the scores, and you're likely to get 75 different answers. What's important, is to look at the forumulas and, even if you would tweak it differently, see if at least you find them "reasonable."

I personally find 27% for GT 1,2,3 and 20% for GT4 to be perfectly reasonable.
 
jb said:
What I think would help it to do away with the final scores and the weights. The FPS scores in the test are what we find really important. The weighted score is for more or less bragging rights and does not reflect well....

I couldn't agree more
"24 avg fps" is perhaps not as eye catching as 73127612 3dmarks but at least it gives a number that we can deal with

I personally use 3dmark2001se overall score for my reviews
but I've been thinking for some time of just using the actual fps numbers instead
and for 3dmark03 I'll only use the actual fps in everything but the standard test at 1024x768


on 2001:
sometimes the overall score can be very misleading
as for an example
the default score of 3000 marks with my Radeon "256" equalled perfectly playable games
a score of 4000 using 1600x1200with 6x aa and 16x aniso on my 9700 pro does not in any way reflect playability in even 25% of those games that were fully playable on my Raddeon 7200 when it scores 1000 points less

of coursee the kids out there (and yeah a lot of grown people too) like to have their big score to have something to brag about (because they can't get a job, a girlfriend etc. who knows? ;) )

taking away the weighting would be a starter at least
 
I couldn't agree more
"24 avg fps" is perhaps not as eye catching as 73127612 3dmarks but at least it gives a number that we can deal with

Problem with that is...it has more room for abuse than a 3DMark score.

What does 24 AVG FPS tell you for GT3? That the card that does that can only run DX8 games at 24 FPS?

If Card A runs GT4 at 30 FPS, and card B doesn't run GT4 at all, what does that tell us?

The 3DMark score is purposely NOT given in FPS for that very reason. FutureMark could have done a "real weighted average" of the scores and come up with an "overall average FPS" instead of an overall "score." They purposely chose not to, because looking at the 3DMark socre as "FPS" is the wrong way to look at it.

I personally use 3dmark2001se overall score for my reviews
but I've been thinking for some time of just using the actual fps numbers instead and for 3dmark03 I'll only use the actual fps in everything but the standard test at 1024x768

Why? What has changed between '01se and '03? (Or are you saying you changed?)

the default score of 3000 marks with my Radeon "256" equalled perfectly playable games a score of 4000 using 1600x1200with 6x aa and 16x aniso on my 9700 pro does not in any way reflect playability in even 25% of those games that were fully playable on my Raddeon 7200 when it scores 1000 points less

If you are going to completely mis-use the benchmark like that, what did you expect? (Assuming I understand your statement correctly...it's a bit gabled.)

At any given setting, does the 9700 Pro score higher than the Radeon 7200? Is the Radeon 9700 Pro not also more playable than the 7200 at the same settings?

of coursee the kids out there (and yeah a lot of grown people too) like to have their big score to have something to brag about (because they can't get a job, a girlfriend etc. who knows? )

taking away the weighting would be a starter at least

How so? Now you'll have the same "kids" gunning for the "highest FPS" score instead. (Flashes back to the same race "my score is bigger than yours" race to get the highest GLQuake score....)
 
Before anyone else gets there, Lars takes another look at NVIDIA's position:

Ah....much better, Lars. :)

It's more or less an attempt to explain each side of the case. I have no issues with that, and it's reasonably well done to boot.

Two quibbles:

As far as PS1.4 support in currently available games is concerned - ATi has published the following list. Decide for yourself:

Where's the list? ;)

So we now see that from NVIDIA's standpoint, the criticism of the new 3DMark 2003 is justified.

I would have qualifed that a bit differently, to say: "we now see that from NVIDIA's standpoint, assuming that they can continue to convince developers to specifically cater to their hardware the criticism of the new 3DMark 2003 is justified. [/quote]
 
Joe DeFuria said:
...

So we now see that from NVIDIA's standpoint, the criticism of the new 3DMark 2003 is justified.

I would have qualifed that a bit differently, to say: "we now see that from NVIDIA's standpoint, assuming that they can continue to convince developers to specifically cater to their hardware the criticism of the new 3DMark 2003 is justified.

No, I think in the context of the statements before and after, the statement is well worded and suitable.

Sorry, Joe, but I felt with all the improvements in the approach I had to defend Lars for once. ;)
 
Sabastian said:
Just where is this list of games? Can't find it anywhere. I hear about it but can't find this list at all.
It's beginning to be a big laugh this thing. Everyone is talking of that "big list" but nobody is able to publish it. I don't see the point. Whether it exists and then publish it, or it doesn't and then :rolleyes:

Hope to see it soon :!: :devilish:
 
Joe DeFuria said:
Problem with that is...it has more room for abuse than a 3DMark score.

What does 24 AVG FPS tell you for GT3? That the card that does that can only run DX8 games at 24 FPS?

If Card A runs GT4 at 30 FPS, and card B doesn't run GT4 at all, what does that tell us?

The 3DMark score is purposely NOT given in FPS for that very reason. FutureMark could have done a "real weighted average" of the scores and come up with an "overall average FPS" instead of an overall "score." They purposely chose not to, because looking at the 3DMark socre as "FPS" is the wrong way to look at it.

I simply can't see what your point is here.
If Futuremark is weighting the score that would make more sense to you than the individual scores?

Why? What has changed between '01se and '03? (Or are you saying you changed?)
primarilly the fact that benchmarking any other card than R300-based ones today is pointless, 2001 gives a much better view on actual performance there

If you are going to completely mis-use the benchmark like that, what did you expect? (Assuming I understand your statement correctly...it's a bit gabled.)

At any given setting, does the 9700 Pro score higher than the Radeon 7200? Is the Radeon 9700 Pro not also more playable than the 7200 at the same settings?

a score should be a score IMHO
if I get a 70 avg fps in quake 3 in 1600x1200 with no AA / AF on my 7200
and 70 fps at the same res but with full AA/AF on my 9700 I know that they are equally playable
but a result from 3dmark2001 could be almost up to 2000 points higher than another score yet it would actually reflect worse gameplay in the vast majority of games
this is especially true when benchmarking with AA/AF

As for misusing a benchmark I dunno why you would call it that
many people would like to find out:
"I have a Radeon 8500 now and I run most games at 1024x768 with 16x AF, what settings will I be able to use if I get a 9700 Pro?"
of course I don't publish any benchmarks like that (ie I don't put up a 7200 to a 9700 using different settings)

How so? Now you'll have the same "kids" gunning for the "highest FPS" score instead. (Flashes back to the same race "my score is bigger than yours" race to get the highest GLQuake score....)

still 42.7 fps isn't really much to start bragging about compared to "4809 3dmarks"
I'm just saying that I think the "mark" thing is pointless
it doesn't give a figure that tells you anything, well actually I'd say it tells you less than the individual scores


beyond that I think the "if you don't have the hardware function you can't get any points from this test" stuff is pointless
sure 3dmark03 is meant to reflect future performance but we'll probably have 3dmark07 by the time we actually start seeing games where you must have DX9 hardware as a bare minimum
that also adds to the effect that I think the overall score is pointless
and it's also one of the contributing factors that make scores incomparable between many cards/settings

in 2001 it made some sort of sense though me thinks
either you had HW shaders or you didn't

the game test 4 in 03 makes as little sense as a
Pixel Shader 1.2 or up only test would IMHO
 
nice to see you, AJ, joining here as well. :)

but back to the question about having a score or not having it...
we all know that in benchmarking you just can't make bench that would be perfect for everyone. And I think it the way like 3DMark is now, is pretty good. For normal reviewers and for gamers, there's simple number that represents the card performance in these particular tests. For Advanced technical reviewers there is a option for individual performance figures for all tests to get enough information making their own conclusions and metrics.

the problem seems to be more like that maybe my joke to Worm about having Final Reality 2003, wouldn't have been bad idea. New name would have made clear that this new benchmark shouldn't make 2001SE obsolete. Now most of people in the boards are in the faith that 3DMark 2003 is replacing 2001SE and in that light, it really doesn't look like really good "complete gamers benchmark" as 2001SE was. 03 differs so much from 2001SE that maybe name change would have been good idea, to make it look like completely new GPU/card benchmark, not like "gamers benchmark" that it is supposed to be now. But, yes... it is so easy to be wisely afterwards. (too bad that I didn't had any idea how radically 03 would differ from 2001SE... otherwise I would have tried to tell about possible problems ahead. but again, this is "afterward wise men talk" so, I'll drop it here.)

and of course, there was a bunch of ppl complaining during 2001 launch about how Nature didn't run on their hardware. (umh... I must admit that I was on that group too... after all I had bought DX8 compatible (but not compliant) AIW Radeon just before 3DMark2001 was launched. disapointment was rather big after hearing just bunch of explains why Radeon didn't support PS 1.0 after all.)

so, what are my suggestions to make it better?? umm... hard to say really... maybe, showing scores from individual tests as a sum on result screen would at least explain to regular gamer as well as normal reviewer how the score gets together, as well as text displayed which tells which DirectX version is mostly used on each Game Test.
 
It's beginning to be a big laugh this thing. Everyone is talking of that "big list" but nobody is able to publish it. I don't see the point. Whether it exists and then publish it, or it doesn't and then
It is a bit like finding a 4 leaf clover... ;)

I dont think it has to be that big. Honestly there are about perhaps 10-12 AAA must have titles a year. Titles that everyone talks about etc. Having PS 1.4 support in a majority of those should be moire than enough to justify PS 1.4

Generally the mom and pop games dont even need ps 1.1... let alone true DX8.1 features.
 
If Futuremark is weighting the score that would make more sense to you than the individual scores?

Yes, because the 3DMark score is not simply a game performance score! It's purpose is to reflect both performance and feature support.

primarilly the fact that benchmarking any other card than R300-based ones today is pointless, 2001 gives a much better view on actual performance there

Correct. 3DMark 2001 gives much better view for OLDER HARDWARE for current games!. FutureMark has made this point exactly. Continue to use 3DMark01 for older hardware, and use 3DMark03 for newer hardware and future performance estimates.

And I don't agree that 3DMark03 is pointless for DX8 hardware. It's no less pointless than 3DMark01 was for DX7 hardware. DX7 hardware couldn't run all the 01 tests either.

if I get a 70 avg fps in quake 3 in 1600x1200 with no AA / AF on my 7200 and 70 fps at the same res but with full AA/AF on my 9700 I know that they are equally playable

Right....equally playable ON THAT GAME. Does this translate to other games? Are you saying that because Quake3 shows that the 7200 and 9700 are equally "playable" at some setting, that other games will show the same relative playability between the two cards?

Will that translate to Doom3? No way.

but a result from 3dmark2001 could be almost up to 2000 points higher than another score yet it would actually reflect worse gameplay in the vast majority of games this is especially true when benchmarking with AA/AF p

I say again, you are trying to use the benchmark in a way its not intended. In any case, when Doom3 comes out, see if the "score difference" in 3DMark 03 is more in line with how the two cards are "playable" on that title. And again, "playable" refers to performance and image quality.

"I have a Radeon 8500 now and I run most games at 1024x768 with 16x AF, what settings will I be able to use if I get a 9700 Pro?"

Right, and the answer is, it depends completely on what game you're talking about!

it doesn't give a figure that tells you anything, well actually I'd say it tells you less than the individual scores

Well, I strongly disagree:

http://www.beyond3d.com/forum/viewtopic.php?t=4297

I wouldn't say it tells us less than the individual scores. It "generalizes" the individual scores to come up with a meaningful result.

What does a score of "can't run it" for GT4 on a 7200 tell us, as opposed to 30 FPS on a 9700 Pro? That the 7200 doesn't support DX9? We can get that from the spec sheets.

The overall score places a WEIGHT on the importance of being able to run the test at all, presumably based on some estimate of how much games going forward will make use of the tecniques in each test.

beyond that I think the "if you don't have the hardware function you can't get any points from this test" stuff is pointless

Well then, what are you using 3DMark01 for?

sure 3dmark03 is meant to reflect future performance but we'll probably have 3dmark07 by the time we actually start seeing games where you must have DX9 hardware as a bare minimum

So then your complaint is it's "too forward looking?" We have one group of people saying "it's not DX9 enough" and others like you saying it's too forward looking.

That's an indication to me, that it's probably about right. ;)
 
Back
Top