Benchmark code of ethics

Hyp-X · May 27, 2003

demalion said:
Hyp-X said:

it puts quite different workload to different cards.

Click to expand...

For now, could you clarify this? What do you mean by different workload for different cards here? It is the ability to execute the workload that differs between drivers and cards.

In some tests a GF4 card have to tranform a vertex twice as much as an R9700.
And it doesn't have to do this because there is an API call which is implemented more or less efficiently on different cards.
No, a vertex is tranformed how many times is specified in the API, and is generally up to the application.
The GF4 tranforms a vertex twice as much, because the application calls for twice as much transformation.

It this the same workload executed differently?
If it is please specify how do you mean workload.

Myrmecophagavir · May 27, 2003

Hyp-X said:
What about hardware that came out half a year after the game?

What could they do?
They either ask the game developer to create a patch with a new optimization, or they try to do that in the driver.

FM will refuse the former, and the disallow the latter.

But a game is finished once it's finished. Supporting new hardware is not something developers want to go back and add to their last game, they'll be working on their next game by now. FM is the same - the new features and possible optimisations can be considered for the next iteration of the product. A version of 3DMark is designed to test the feature set of that generation. New features will be tested by the next release.

Hyp-X · May 27, 2003

Myrmecophagavir said:
Hyp-X said:

What about hardware that came out half a year after the game?

What could they do?
They either ask the game developer to create a patch with a new optimization, or they try to do that in the driver.

FM will refuse the former, and the disallow the latter.

Click to expand...

But a game is finished once it's finished.

You wish. (Actually you do not.)

Supporting new hardware is not something developers want to go back and add to their last game, they'll be working on their next game by now.

I said either the developer or the IHV.
If title is important there's a chance one of them will do that.

FM is the same - the new features and possible optimisations can be considered for the next iteration of the product. A version of 3DMark is designed to test the feature set of that generation. New features will be tested by the next release.

But what if a current generation card comes out, but it requires optimization?
Does that mean the card is flawed, and therefore it is rightfully penalized?
Is the P4 flawed because it needs re-optimization of existing code to run as fast as possible unlike the Athlon that does not?

I'm not talking about next generation cards or new features. I'm talking about current (and in some case past) generation cards and the same feature set 3dmark is testing.

demalion · May 27, 2003

Hyp-X said:
demalion said:

Hyp-X said:

it puts quite different workload to different cards.

Click to expand...

For now, could you clarify this? What do you mean by different workload for different cards here? It is the ability to execute the workload that differs between drivers and cards.

Click to expand...

In some tests a GF4 card have to tranform a vertex twice as much as an R9700.

That's because of the capabilities of the card within the DX API, Hyp-X. In fact, it is because of the capabilities of the card in any API.

And it doesn't have to do this because there is an API call which is implemented more or less efficiently on different cards.

Yes, there is, PS 1.3 versus PS 1.4. The GF 4 doesn't have the capabilities to perform at the level of PS 1.4.

No, a vertex is tranformed how many times is specified in the API, and is generally up to the application.

The GF 4 is in actuality only capable of accomplishing the workload by multiple transformations. Including that functionality is only preventing exclusion, and recognizes the differing levels of API support offered in DX for the cards. The problem isn't because it is a GF 4, but because the GF 4 only supports PS 1.3.

The GF4 tranforms a vertex twice as much, because the application calls for twice as much transformation.

Because that is the way the workload can be accomplished with PS 1.3.

It this the same workload executed differently?
If it is please specify how do you mean workload.

If the above wasn't clear: using an API to deliver the same scene and effects to the limits of the capability of the hardware. The API exposes multiple shader levels for pixel shading, and it is the limitations of the GF 4 that are coming into play. There are no exclusion games played by depending on utilizing the dynamic range of the 8500 (which is a real advantage for the 8500 that isn't represented), so I don't understand why you are attacking the idea of the GF 4 doing more transformation when that reflects an actual limit of its abilities within the DX API.

Are you saying nVidia should have the right to tell Futuremark to exclude API usage workloads that their cards don't handle as efficiently as other, even older cards, do?

no_way · May 27, 2003

To take some thoughts further:
Much of the "cheating" issues come down to "trust". Whom do you trust, nVidia, 3DMark or the reviewer who quotes benchmark numbers ? Lets speculate a little ...
Imagine a body called "OpenGL hardware quality labs" or something along the lines.
A body that does hardware/driver testing and puts a stamp of approval on drivers&hardware that goes with it. They would first test general compliance to OGL standard. Then they would test that the drivers are doing what drivers are supposed to do: translate software calls to hardware and do nothing else.
How could this be accomplished ? Run lotsa different testing suites, while comparing the output to reference ( even analog monitor signal can be recorded and re-digitalized for comparison with adequate hardware .. just the comparison algorithm has to be aware of possible degradation of quality due to analog signal )
Testing suites could come from any current benchmark or game developer. The lab would just take care to always run the most thorough testing suite.
But they should only focus on compliance to standard, no benchmarking.

Such organization could theorethically be formed by current hardware journalists/benchmark developers in cooperation, supported by ARB itself.

Thats in a ideal world

Geo · May 27, 2003

What's the difference between an app recognizing a card (one assumes this is accomplished by analyzing the information provided by the drivers) and the drivers recognizing the app? Aren't both done for the purpose of changing the resulting experience for the gamer compared to what they would have experienced otherwise?

So we're putting the devs on a higher plain morally by allowing them to do card recognition? I suspect there are very valid reasons to allow both to do so, while recognizing that the ability to abuse the capability is inherent. I still think a useful code of ethics can be achieved, but it will take greater granularity than "don't detect the x".

For getting the IHVs to sign-up, it seems to me that a substantial subset of the largest webmasters could make this happen by prominently displaying on every review the status of the IHV relevant to this code of ethics. Giving it the right name will add to the pressure. The PR departments will be going nuts if their company is constantly being identified as "Has REFUSED to sign the NO BENCHMARK CHEATING Code of Ethics". Maybe an asterisk on every single benchmark result with the note at the bottom of each graph. Heh.

Dave H · May 28, 2003

Hyp-X said:
Is it doesn't reflect their concerns because they quit the beta testing,
- or -
they quit the beta testing because FM didn't what to "reflect their concerns".

A bit of both, perhaps. Frankly, given how late in the process Nvidia quit, I doubt 3dMark03 would have significantly different performance characteristics (on Nvidia hardware or otherwise) had they stayed in. But I don't have any details, so I don't know. Since Nvidia has been implying heavily that this is the case (recently going so far as to, in what is apparently going to be their only response to this whole scandal, say "we don't know what they did, but it looks like they have intentionally tried to create a scenario that makes our products look bad"), it bears mentioning that they only have themselves to blame for quitting.

Did they quit in the first place because their concerns weren't being met? Well, yes, they did--although obviously the concerns they now say are "intentionally creating a scenario that makes our products look bad" (the changes from build 320 to build 330) are not the concerns they had when they decided to quit in December. Between the whine paper they sent to all the sites (except B3D of course), and the competitive landscape at the time 3dMark03 was released, I think it's relatively clear what Nvidia's concerns were that weren't being met. At the time 3dMark03 was released, Nvidia was trying to position its GF4Ti series against ATI's 9500/9700 series, and its GF4MX series against ATI's 9000/8500 series. 3dMark03, being a forward-oriented benchmark, put these comparisons in a pretty dim light. As any genuinely forward-oriented benchmark would have.

As for Nvidia's specific complaints, I spent a lot of time analyzing them back then, and came to the conclusion that they were generally bunk, with the possible exception of calculating silhouette edges in GT2 and GT3 on-GPU with a VS 1.1 shader instead of on-CPU (because I didn't understand the trade-offs well enough to comment, and because the posters at opengl.org, who did, seemed to agree that this was suspicious). Even this complaint (and the related skinning complaint), however, has to answer to the fact that, despite Nvidia's explicit claims to the contrary, GT2 and GT3 are not by any stretch of the imagination vertex shader limited, on GF4 or any other card.

Here's one of my posts from back then discussing most of the issues (written in the context of Wavey's article on the subject). As you can see, I found a number of the choices FM made somewhat questionable, and generally would have come up with a somewhat different result if it were up to me to create a benchmark along their stated goals. Reading it over today, I'd say I agree pretty wholeheartedly with what I said back then. The only change I might make is that back then I was more on the stencil shadowing bandwagon whereas now I realize that near-future games are likely to be using techniques such as shadow maps as well. So while I did say back then that having two tests to measure the same rendering technique seemed like a bad idea, now I would probably add the suggestion that one use shadow maps instead.

Hyp-X said:
I said:

As for issue d, Futuremark obvious makes a conscious decision to use a general rendering path, based around the DX9 spec, rather than use vendor-specific rendering paths.

Click to expand...

The fun thing is they have different rendering paths, but they say they are not vendor-specific paths.
Since they do this they have no basis to justify those paths.

I assume you're referring to the fact that GT2 and 3 have a PS 1.4 rendering path and a multipass PS 1.1 fallback? If so, a couple responses. First, both paths embody valid approaches from the point of view of the DX 8.1 spec, so they are in some way not vendor-specific but merely different-levels-of-the-spec-specific. Of course, it's obvious that FM would not have bothered to code the PS 1.1 fallback path if there were not hardware which could run PS 1.1 but not PS 1.4.

So does this completely demolish FM's claims to be hardware-neutral? Not really; it only demolishes the straw-man argument that they should be hardware-ignorant. Again, it all comes down to compromise. FM's attempts to be as fair as possible should not require it to ignore reality.

Making a benchmark like 3dMark requires a balance between fairness/vendor-neutrality and accuracy. Sometimes one side is compromised, sometimes the other. You seem to be saying that if FM cannot purely uphold one side or the other, the benchmark has no meaning or merit. That seems silly to me.

Hyp-X said:
So you agree that by disallowing equal quality optimizations (I mean the things ATI did, and NOT what nVidia did), they becoming less accurate at displaying future game parformance?
Cool

Sure, of course, particularly when it comes to high-profile games. When it comes to low-profile games, such a choice might make the benchmark more accurate. But giving up some accuracy for fairness does not mean that the benchmark has no accuracy whatsoever.

Hyp-X said:
I said:

Are you referring to me? I've certainly argued that 3dMark03 has some relevence in respresenting the performance characteristics of future games. I hope I've never claimed, though, that there will not be a systematic performance difference between it and those games, as if 3dMark03 is a 2004 PC game traveled back in time to today.

Click to expand...

I didn't say it was possible to write such a benchmark.
But some people think it is.
Not neccessarily you.

I'm not sure some people really think this. I've never seen such a statement made by anyone too credible on any forum, nor by Futuremark themselves. I think you may be chasing a straw-man.

Hyp-X said:
The problem is not when you'll see future games give you different framerates (that's inevitable).
The problem is when you'll see future games having a different vendor bias.

And for this issue to be solved it's NOT the IHV's but the ISV's the right ones to consult with.
IHV's couldn't tell you what features will be used in future games, only maybe what they'd like to be used.

Agreed 100%. This fits in with something I noticed when I was looking at the membership of the SPEC Consortium for my earlier post: it's chock full of members who are actual users of scientific computing simulations, like the national labs and so on. FM's beta program now has representation from the IHVs and the press. ISV representation would be a big plus, and could probably improve the benchmark a great deal.

Hyp-X said:
Again I say if a mid-2004 PC game will run either 2x fast or 0.5x as fast on most of the video cards of that time than 3dmark, then it's good.
It's not the absolute, it's the relative performance that matters.

Agreed again. And I'm not at all sure that the attempted vender-neutrality of 3dMark03 doesn't produce some systematic performance bias as compared to actual results from the games it tries to model (as I wrote in my last reply). I do think that the fact that most games are written in DX, that DX is a relatively "narrow" API and each particular DX level "embodies" a particular level of hardware pretty closely, and that many games will resemble the 3dMark03 game tests in rendering style and workload, all mean that whatever systematic bias exists won't be so great as to make the results lose all predictive power.

Hyp-x said:
I said:

Do I seriously think 3dMark03 compromises its predictive abilities if FM refuses to allow special-case optimizations in IHV drivers because doing so fails to model this "last ditch, devrel didn't do enough to prevent a/b/c/d during game development" motivation for driver optimizations? No. Do you?

Click to expand...

Well "devrel didn't do enough to prevent" is a one sided view.
What about hardware that came out half a year after the game?

What could they do?
They either ask the game developer to create a patch with a new optimization, or they try to do that in the driver.

FM will refuse the former, and then disallow the latter.

It'd be my guess that this issue is one in which the beta program works especially well. In fact I'd think this is one of the explicit purposes of having the IHVs play such a large role in the development of the benchmark: to make sure it the final benchmark will be generally optimized not just for their current products but for their upcoming ones as well. (And it's one of the reasons for confidentiality rules on the beta program.)

3dMark03 tries to capture the performance of the DX9 PS/VS 2 generation of cards. Presumably the IHV participation in the beta program allows them to do so fairly. When a truly new generation of cards comes out, FM should release a new 3dMark benchmark. (Whether PS/VS 3 will qualify or we'll have to wait for DX10, who knows? Personally I'd hope PS/VS 3 does get met with a 3dMark04.)

Hyp-X said:
So there are two game test with Doom3 like visuals, but with very different behaving codepaths than Doom3. Are other developers will do the same?

Which one of the tests represent techniques present in future RTS games?

Every developer will use stencil based shadows instead of shadow buffering?

What general techniques and rendering styles are we talking about?

As I said in my old post linked above, I don't think 3dMark03 got it perfect. Two stencil shadow game tests seems silly; again, one test with stencils and one with maps/buffers seems like a great idea. Maybe they should have used a HLSL for GT4 (although in hindsight this argument, common at the time, seems IMO a little silly now, part of a hoopla over HLSLs that is still slightly premature). And I'm not sure GT1 really has a clear purpose in life; more of a "let's lump in a test with a DX7 fallback as a sop to GF4MX owners with a test of a left-out genre like flight simulation, and toss in some vertex shaders for kicks" sort of thing. (As for your RTS suggestion, it's interesting and would certainly provide a different sort of workload, although traditionally RTSs don't tend to be particularly stressful on the graphics end...although a graphically advanced "close-in" RTS like a spiffied-up WC3 certainly might.)

Is the selection of game tests in 3dMark03 perfectly representative of a cross-section of 3d performance-intensive 2H04 PC games? No. Are each of the individual tests somewhat representative of what many individual games in that category will be like? I'd say yes.

In the end I think our ideas about what 3dMark03 is really like are pretty similar. We both recognize that a benchmark like that is subject to an inherent tension between modelling the performance of (non-vendor-neutral) games accurately but itself remaining vendor-neutral.

The difference between our viewpoints lies in what we conclude from the existence of this tension. You seem to think that since such a benchmark will inevitably sacrifice something in terms of accuracy, we can't treat it as having anything accurate or useful at all to say about future game performance. I think that's too stringent, and creates an ideological purity where none is needed. You probably think I'm too wishy-washy, always admitting the contradictions of the approach and the flaws of 3dMark03 as it actually is, but never giving up on the idea that we can get useful information out of it anyways.

I think that we're having an interesting discussion. But in the end (not yet, necessarily) we might have to agree to disagree.

rwolf · May 28, 2003

geo said:
What's the difference between an app recognizing a card (one assumes this is accomplished by analyzing the information provided by the drivers) and the drivers recognizing the app? Aren't both done for the purpose of changing the resulting experience for the gamer compared to what they would have experienced otherwise?

I don't think the driver vendors should be rewriting games or benchmarks. If they want to sit down with the developer and help them with optimizations for their card that is fine.

Could you imagine if Microsoft designed NTFS to detect if the Oracle DB was running on it and then replace code in Oracle DB to make it run faster? (Not likely to happen)

Nvidia was able to gain 25% on 3DMark03 with no benefit to 3D applications or games at all. What is the point?

demalion · May 28, 2003

Don't know if you've read this, but it seems pertinent to me for consideration with regard to improving 3dmark's workload selection moving forward. Actually, it could be that such consideration was already represented, I'm not familiar with a discussion of it in that context.

On a more political note: I also wonder about a PS 3.0/VS 3.0 timetable for the next 3dmark in relation to nVidia, ATI, and Imagination Technologies.

Geo · May 28, 2003

rwolf said:
I don't think the driver vendors should be rewriting games or benchmarks. If they want to sit down with the developer and help them with optimizations for their card that is fine.

Could you imagine if Microsoft designed NTFS to detect if the Oracle DB was running on it and then replace code in Oracle DB to make it run faster? (Not likely to happen)

Nvidia was able to gain 25% on 3DMark03 with no benefit to 3D applications or games at all. What is the point?

I would expect MS to sell it as the "Oracle Special Edition" at 5x the cost to large businesses who'd see that as a bargain to get better Oracle performance. Unfortunately, the competitive landscape between Oracle and MS would prevent it from happening anyway.

Nvidia was bad. No doubt about it. It's wrong to drive your car into a crowd of innocent bystanders wiping them out. That doesn't mean there aren't proper usages for a car. . .or detecting an app.

Dave H · May 28, 2003

demalion said:
Don't know if you've read this, but it seems pertinent to me for consideration with regard to improving 3dmark's workload selection moving forward. Actually, it could be that such consideration was already represented, I'm not familiar with a discussion of it in that context.

An interesting link, worth keeping in mind for anyone designing a benchmark. To a great extent, 3dMark "works around" this problem by only encouraging tests to be run in a single configuration (1024x768, noAA/AF). This, um...certainly simplifies things, but it is of course yet another point where 3dMark diverges from the trade-offs made when people play (or benchmark) real games.

As for the idea of scaling non-resolution-dependent workload along with scaling the resolution, it's absolutely a great idea for a "gamer-oriented" website that only posts a few benchmark configurations, each meant to reflect realistic settings for gameplay. OTOH, keeping all other variables the same while varying resolution is what makes the realized-fillrate vs. resolution style of graph in B3D's reviews so damned useful. OTTH (third), even better for B3D to do several such graphs, each one at different settings for all the resolution-independent stuff; or even better, if there were some objective way to quantify those other factors, make some graphs which keep resolution constant and use e.g. vertex count as the independent variable. (I just love making more work for other people. 8))

Similar options would be great in a 3dMark-type benchmark, both for use in the hands of a great review site like B3D and to make it fun for us 3d nerds to play around with at home. (Hell, give us options like that and the Pro version might actually be worth something to me.) But it does seem to run counter to FM's view of 3dMark as a "simple" benchmark that can be run with one click and summarize its findings with one single number. :?

Althornin · May 28, 2003

geo said:
I would expect MS to sell it as the "Oracle Special Edition" at 5x the cost to large businesses who'd see that as a bargain to get better Oracle performance. Unfortunately, the competitive landscape between Oracle and MS would prevent it from happening anyway.

yeah, but to be an accurate comparison, small flaws would have to occur each time the DB was accessed, leading to eventual data corruption.

rwolf · May 28, 2003

Althornin said:
geo said:

I would expect MS to sell it as the "Oracle Special Edition" at 5x the cost to large businesses who'd see that as a bargain to get better Oracle performance. Unfortunately, the competitive landscape between Oracle and MS would prevent it from happening anyway.

Click to expand...

yeah, but to be an accurate comparison, small flaws would have to occur each time the DB was accessed, leading to eventual data corruption.

Now that is funny...

Geo · May 28, 2003

Althornin said:
geo said:

I would expect MS to sell it as the "Oracle Special Edition" at 5x the cost to large businesses who'd see that as a bargain to get better Oracle performance. Unfortunately, the competitive landscape between Oracle and MS would prevent it from happening anyway.

Click to expand...

yeah, but to be an accurate comparison, small flaws would have to occur each time the DB was accessed, leading to eventual data corruption.

[Oracle DBA perusing documentation for "Oracle Special Edition"]: "What the hell's a 'rail', and why aren't we supposed to go off it??"

Benchmark code of ethics

Hyp-X

Irregular

Myrmecophagavir

Hyp-X

Irregular

demalion

no_way

Geo

Mostly Harmless

Dave H

rwolf

Rock Star

demalion

Geo

Mostly Harmless

Dave H

Althornin

Senior Lurker

rwolf

Rock Star

Geo

Mostly Harmless

Similar threads