Benchmark Life Cycles

Chris

Newcomer
Hello All,

I believe that what we are all witnessing is part of the natural benchmark life cycle.

Marketing and P/R departments exist to control the information available to consumers. Any entity or process that provides the consumer with information outside this control is considered dangerous. If this information reflects poorly on a particular companies products, that companies P/R, Marketing, Developer Relations, and Legal staff _will_ take action.

This leads to a repeatable and predictable life cycle that benchmarks experience.

It goes something like this:


*Birth*

Entity A attempts to create a benchmark to measure competing product performance in an unbiased manner.


*Childhood*

Benchmark becomes widely used.
Begins to make a difference in consumer choice (i.e. $).
$'s brings the attention of the product manufacturers.


*Adolescence*

Benchmark makes measurable difference in consumer choice.
The winners exalt the results.
The losers (and to a lesser degree the winners), have four choices:

(a) improve product performance (most companies are already very motivated to this)
(b) improve _perceived_ product performance (optimize for the test)
(c) alter testing methodologies to their advantage (cajole, lobby, threaten, beta test...)
(d) discredit the test. (Nvidia has done a exemplary job of this)


* Adulthood*

Benchmark still makes measurable difference in consumer choice.
Consumers begin to hear about competitors efforts at b, c, & d.
Seeds of doubt in the benchmark's accuracy as a product comparison tool are sown.

Failing to accomplish b and/or c, the marketing arm of almost any company _will_ attempt (d).
It is their job to do so.
This is not about truth or morality.
It is about the livelihood of a company and it's employees.
Marketing and P/R departments do not choose wording based upon truth.
They choose wording based upon 'defensibility'. As in, 'Will this stand up in court?'


* Old Age *

As competitors efforts towards b, c, or d become more widely known, the relevance of the benchmark is increasingly questioned by the consumer.
The makers of the benchmark alter the test to reduce the effect of b, c, and d but the damage is done.
Benchmark looses effectiveness in swaying consumer choice.


* Death *

We're not there yet. Direct X benchmarks have a sort of phoenix cycle built in as each new DX version is released.


If you look back at benchmarking in general (not just video) you will see a similar pattern in each benchmark's life cycle.

I think that FM has done a great job in surviving and keeping their tests relevant for as long as they have. But the facts are that they have created a contest with winners and losers. The losers will take whatever steps they feel are necessary to prevent the benchmark from effecting their sales. These steps usually bring an end to the effectiveness of the benchmark for product comparison.

This particular cycle feels as though it is a bit accelerated, but it appears to be the same fate that most benchmark methodologies eventually attain.

Regards, Chris.
 
Benchmarks will never die despite what anyone thinks. If all the benchmarks in the market become unreliable then someone will always come out and try to make a reliable benchmark.

I want a company that has these traits to create a benchmark:

Honest.

Never backs down wheather a lawsuit or bribe is being forced down their throats. (This has NOTHING to do with the NV and FM fiasco if anyone believes I am relating to that issue)

Makes it free and open source.

Unfortunately I don't think this will ever happen. ;)

Actually I am making a benchmark and it will be soon released. You guys will love it, it will replace 3dmark alltogether and it will be 100% freeware.
 
Hello K.I.L.E.R.

I did not mean to imply that benchmarking will die. I was just pointing out that individual benchmarks have life cycles.

I (and many others I'm sure) welcome your efforts. I will be watching the birth of your benchmark with great interest.

Regards, Chris.
 
Chris said:
Hello K.I.L.E.R.

I did not mean to imply that benchmarking will die. I was just pointing out that individual benchmarks have life cycles.

I (and many others I'm sure) welcome your efforts. I will be watching the birth of your benchmark with great interest.

Regards, Chris.

Just to let you know I am doing a parody benchmark and not a serious one. :)

The reason I am doing it is for laughs. I am really going to get a kick out of this and I am sure some others will.

I am going to make it humorous.
 
I have to say I disagree with this "Evolution of a benchmark" on just about all levels. It's incorrect when applied to anything resembling a model that we have, no matter what one chooses to be a model.

If you take time honored, traditional benchmarks- like drystones, whetstones, etc.etc. These fall into your "Birth" stage, but then enter a semi-permanent "Lifecycle" stage and continue to be used for what they bring.

If we try to apply your cycle to, say, 3DMark, it fails right from the onset with the "Birth" stage.

From your evolution, it dictates Birth as-
Entity A attempts to create a benchmark to measure competing product performance in an unbiased manner.

3DMark fails from the onset as it was not designed to measure competing products and never did so in an unbiased manner.

From 3DMark's onset, the early versions served only one purpose- to champion a new feature that only a singular IHV supported at the time. Early tests championed HW T&L and created massive bias in favor of HW T&L. By 3DMark2001, they had enough clout to champion the GF3 by including shader tests and docking scores of any other "competing product" that didnt have this featureset.

So to declare a benchmark's birth direction to "measure competing product performance in an unbiased manner"- this would suggest benchmarks that created a baseline of features to therefore be compared.. by competing products. So this evolution doesn't apply to this benchmark either.
 
Hello Sharkfood,

Sharkfood said:
I have to say I disagree with this "Evolution of a benchmark" on just about all levels. It's incorrect when applied to anything resembling a model that we have, no matter what one chooses to be a model.

Wow, I had to work hard at being that completely wrong... No, really. The all encompassing nature of this statement leaves me literally nowhere to turn.

If you take time honored, traditional benchmarks- like drystones, whetstones, etc.etc. These fall into your "Birth" stage, but then enter a semi-permanent "Lifecycle" stage and continue to be used for what they bring.

I am very familiar with the 'time honored' benchmarks. How many version of each exist due to specific compiler optimizations? Do consumers use them any longer when choosing equipment or compilers?

Try this: "Single synthetic benchmarks like Drystone, Whetstone have been quite common for small machines for a decade [8]. But due to increasing cache sizes and better optimizing compilers, which are able to detect and eliminate unnecessary code, they became obsolete."

more detail here

This assessment was made on behalf of a "new" benchmark called Linpack. Of course, a little while after it became widely used we have:

"This was not entirely successful, as specific "optimizers" were created to make LINPACK run faster on some CPU architectures"

more detail here

Then came SPEC. Oh, wait... sorry.

"Recently, a California Court ordered a major microprocessor manufacturer to pay back $50 for each processor sold of a given speed and model, because the manufacturer had distorted SPEC results with a modified version of gcc, and used such results in its advertisements."

more detail here

If we try to apply your cycle to, say, 3DMark, it fails right from the onset with the "Birth" stage.

From your evolution, it dictates Birth as-
Entity A attempts to create a benchmark to measure competing product performance in an unbiased manner.

3DMark fails from the onset as it was not designed to measure competing products and never did so in an unbiased manner.

OK... feel free to leave the 'unbiased' part out. Clearly from your view, you have _never_ considered 3DMark to actually be a benchmark. Fine by me.

My point was (attempt at restatement):

Benchmarks, or anything claiming to be a benchmark (better?) create winners and losers in the marketplace. The losers invariably do things that, over time, make that particular benchmark untrustworthy as a product differentiation tool. People like to compare performance so this creates a continuous need for benchmark creation as the old benchmarks get tainted. Thus, benchmarks appear to have a lifespan.

It is cool to realize that even if the model I choose is of an alternate universe where things work exactly as I have described them...

no matter what one chooses to be a model

I would still be wrong :)

Regards, Chris.
 
Well, I rather liked the principle of the analogy, and think it offers a useful perspective. In this case, I care less about absolute correlation to final circumstances than I do about the applicability to trends.
 
I am very familiar with the 'time honored' benchmarks. How many version of each exist due to specific compiler optimizations? Do consumers use them any longer when choosing equipment or compilers?

Of course LINPACK (and SPEC for that matter..) would so choose to declare Dry/Whetstones as "obsolete" in their attempt to rise to power - they are still used at all levels today just due to their fairness/robustness. Surely. open sourced benchmarks can be abused and compiler optimizations can lean results a bit, but there are still only a singular versions in use today.

One of the first things we do when we book time at benchmark centers is to load up and run some basics. The folks in Atlanta looked at us a little strangely as we ran these on a Superdome, but it gave us some sort of baseline to compare. Of course they had many commercial suggestions for us to try- and interestingly enough, they were almost mutually exclusive from the ones recommended to us from the folks at Sun a few days prior.. Nothing else besides tradition/open source benchmarks are what we consider trust worthy. When benchmark time is measured in 10's of 1000's of dollars per day, having skewed results isn't acceptable... and of even less use when trying to tune a machine.

This is much better than purely marketing funded ventures that are total black boxes. This is what usually splits the "commercial" vs. "traditional" border. And which is why neither falls into your evolution model.

Even a while back with that database benchmark- when someone got busted running perfectly valid queries in record times was invalidated because the benchmark didn't notice it was running on a database completely incapable of updates. This goes to show how useless black box benchmarks truly are.

Benchmarks, or anything claiming to be a benchmark (better?) create winners and losers in the marketplace. The losers invariably do things that, over time, make that particular benchmark untrustworthy as a product differentiation tool. People like to compare performance so this creates a continuous need for benchmark creation as the old benchmarks get tainted. Thus, benchmarks appear to have a lifespan.

I agree with this basis to a degree, with the exception of the onset "winners" and "losers" are often times tainted right at conception. It can often times be the "winner" that did something right at the start to make the benchmark untrustworthy, and these seems to be the case with so many custom tailored benchmarks, which should better be labeled as commercial benchmark ventures.

On the whole, I dont believe it's possible to even produce a trustworthy benchmark as a commercial enterprise. There is too much legal liability to be unbiased, and this removes the usefulness from such a tool at the point of birth. I've seen it on all levels- from cheapo consumer videocards, all the way up to multi-million dollar enterprise servers, which I benchmark twice to three times a year. The more dollar signs behind the results, the more you have a tendency to shy away from any form of commercial benchmarks, and the more you tend to lean towards traditional and open source models.. there is an obvious reason for this.
 
Perhaps FutureMark should have approached Dell for investment and support. Dell can represent the voice of the consumers. You can lie to me, you can lie to FutureMark - but lie to Dell and its tens of millions of customers and even NVidia would be in a world of hurt.

That to me would seem on the surface to be a sensible partnership. Dell acquires the expertise to independently showcase and offer the best - spot flaws and demand they be corrected. FutureMark gets a steady case flow and protection. ATi and Nvidia get to compete on their individual merits.

Could you see FutureMark folding to implied threats from NVidia if Dell was solidly in their corner?
 
Sharkfood said:
From 3DMark's onset, the early versions served only one purpose- to champion a new feature that only a singular IHV supported at the time. Early tests championed HW T&L and created massive bias in favor of HW T&L. By 3DMark2001, they had enough clout to champion the GF3 by including shader tests and docking scores of any other "competing product" that didnt have this featureset.
So your definition means that a benchmark is invalid when only one IHV supports a certain feature at the time of the benchmark's release even though that feature is one of the major new things in a new API version? :rolleyes:

3DMark has always been one of the first benchmarks to support new DX features. When you're in that kind of business, it's only natural that not every card out at that moment supports all of the features in the benchmark.

If you want all cards to support all features of a benchmark, you'll have to use old tech and that in turn means that your benchmark won't have any importance nor relevance (AMD's nBench, or whatever it was called, comes to mind... yes, I know it was a CPU bench, but those are benchmarks too)
 
So your definition means that a benchmark is invalid when only one IHV supports a certain feature at the time of the benchmark's release even though that feature is one of the major new things in a new API version?

There is nothing wrong with measuring performance for a particular feature that only a singular IHV supports.. BUT.. anyone with even the smallest shred of logic about them should then instantly realize that results can no longer possibly be used to "compare" with other IHVs- they are incomparable.

After all, how can you compare A with B? You need to compare A with A otherwise you are comparing/measuring apples vs. oranges. 3DMark has done this since day one and continues to do so. It tabulates a final score, which is used to "compare" various IHV's hardware with totally incomparable results. But the results did perform the goal as stated- to champion a particular IHV, regardless of it's *comparable" real world performance in all other things.

Such a caveat can be handled a number of ways. The most common is a "no-show" for the other IHV's.. or "NA" score, which in turn negates it's use as tabulation in an overall score (or in this case "3DMarks" being the unit).

So it's pretty simple, really. It's a total breach of logic to suggest a benchmark should measure things that are only capable on a singular IHV, then fallback and say the results are used to compare performance on multiple IHV's. It just simply cant be done since you can't quantify A = B for run vs. no run. Alternatively, you can adapt your benchmark to have more exhaustive results and create scoring from a baseline of shared functionality and performance comparisons can be derrived from these tests, PLUS any newer/isolated features can be measured but not included in the "comparison" score.
 
Hello All,

Sharkfood: I do not find myself disagreeing to much of anything in your subsequent posts. I find that even somewhat tainted benchmarks are a valuable tool for use in system tuning. My focus was on their use as a product differentiation mechanism.

I started the topic largely to point out two things:

(1) Benchmarks come and go...
(2) Corporate behavior is bound differently than personal behavior.

Many have expressed moral outrage (even on our rather sedate boards) at how events have transpired. This is largely due to attaching personal accepted behavior to that of a corporation.

Reverend was dead-on when he suggested to FM that they simply describe the optimizations in question rather than label them as 'cheats'. Many have derided FM for not 'standing their ground'. Well... In my view they stood their ground when they used the term 'cheat' publicly. This is a good candidate for the opening that Nvidia's legal dept needed to present FM with a choice: To be or not to be.

FM used personal standards of conduct when choosing their wording and deeds. Nvidia acted like a corporation. Courts (try to) define the bounds of corporate behavior, not morality.

I have always been impressed at the courage FM has shown in trying to build a business model around benchmarking. Certainly there are other successful models: Consumer Reports, J.D. Power, etc.

But those models benchmark a wide variety of products in ways where optimization by manufacturers is difficult. FM's world is a much tougher place to exist due to narrowness of focus.

Regards, Chris.
 
Back
Top