Chalnoth said:
3DMark is a terrible benchmark. It tries far too much to be a game benchmark, but the goals are entirely different from a real game. This makes it irrelevant as a benchmark of performance within games.
Of course its goals are different - its a benchmark, not a game.
What's your point?
It is still relevant to critisize the instances where it doesn't use optimally efficient code
because it invites/simplifies cheating. But it doesn't otherwise make the data any more or less relevant. Which ties into....
A good synthetic benchmark is one that tests specific aspects of a video card.
I'm sorry to be blunt, but this statement is complete and utter nonsense.
The quality of a benchmark is directly tied to how feasible it is to apply its results to predicting application performance.
That the benchmark is synthetic simply doesn't enter into it, other than that it can be designed not to use very specific rendering tricks that you could predict would introduce corresponding bias in the results. That's an
advantage of synthetic benchmarks mimicking real applications.
Take SPEC (int/fpu) as an example. The benchmark codes are selected to be representative for the application space to be modeled, to have reasonable memory access characteristics et cetera. The care with which the SPEC suite is constructed makes it as good a predictor as is possible for its target application space. Pretty much unarguably better than just about any real application(s) except of course if you can benchmark exactly the ones you are interested in with exactly the same parameters that you will use for production runs. Generally not a realistic option these days. (Yes, the SPEC codes have their origin in actual problems. Doesn't make the suite any less artificially constructed. Constructed=better, if done well.)
Testing specific aspects is possible, but the problem of going from the data points to a prediction about application performance becomes difficult or impossible. Difficult but doable for developers - impossible for consumers/reviewers who do not know the exact specifics of how the games are coded.
There has been one very important exception to this and that is the parameter that is measured as "fillrate" in the 3DMark tests. I did some lightweight multivariate statistical analysis three years ago which indicated that out of all the synthetic tests 3DMark used, the fillrate numbers were the only ones that correlated with games performance at all. The rest were deep in the noise. The domination of fillrate for games was also the reason that all those pages that reviewers filled with performance graphs looked remarkably similar - they were effectively showing fillrate graphs over and over and over again. They might as well have run Q3 at high resolution (to measure effective fillrates) and low resolution (to be able to factor out the host system performance) and be done with it.
The 3DMark score does not do this, and the specific tests that 3DMark does offer are pretty narrow in scope, and thus not very useful.
They never were useful, apart from fillrate. Interesting - yes, useful - no. They were always there to satisfy curiosity, basically.
Again, subsystem measurement requires a subsequent synthesis step in order to be able to make application predictions. Which is nigh on impossible to do if you don't have the code in front of your eyes, and not particularly trivial even then.
Benchmarking, when done for utilitarian purposes rather than to alleviate boredom, is all about prediction. Judging by what little data we have available to us, this task will get somewhat more difficult in the future because fillrate might not be the one overwhelming factor. This does not in any way imply that synthetic benchmarking is worse off than application benchmarking. Look at the datapoints in that ugly little benchmarking paper that came out of ATI - does it look as if one application would be a particularly good descriptor of the DX9 group? The approach taken both by SPEC and FutureMark - to include tests that are representative of different application behaviour, and then weighing the results to a final score while still requiring that individual subresults be published, is very reasonable. You might quibble with the weighting scheme. You might argue that the precision in the predictions made possible is not good enough for the problem you want to adress (this may well be true!). But the approach is sound, and is as good as anyone has been able to come up with in the area of general purpose benchmarking tools so far.
PS. Going away over Christmas in less than an hour, or so my wife tells me.
Be well, all of you.