Benchmark code of ethics

rwolf

Rock Star
Regular
Rather than wine about all the "cheating" going on in benchmarks. We should create a benchmark code of ethics and ask the video card vendors to certify in writing on their websites which benchmarks follow the code of ethics.

eg.

1. The driver will not detect the benchmark.
2. The driver will not detect code in the benchmark and replace it with "optimized" code.
3. The driver will not alter the code path of the benchmark.

etc. etc. etc.

Is this a good idea or do you think its lame?
 
I think it's a good idea - *if* it's possible to make them agree to sign it.

Heck, maybe nVidia would be interested, because that would *very* good for their reputation IMO. But they again, they might still be living in their dream world where nobody knows that they're cheating :rolleyes: ( I seriously believe many nVidia employees, even big influent people, were taken by surprise by this )

Anyway, my point here is that IMO this would be good if somebody actually agrees to it and there aren't fifty different contracts flowing around, hehe.


Uttar
 
@rwolf:I would love to see that. I doubt that we could get them to actually keep their word though. So the benchmark would have to do what FM did and trivially change the code to see who did/didnt detect and optimize/cheat for it from time to time.

later,
epicstruggle
 
rwolf said:
1. The driver will not detect the benchmark.
2. The driver will not detect code in the benchmark and replace it with "optimized" code.
3. The driver will not alter the code path of the benchmark.

I dont think that there is a problem with true optimizations, as long as they dont alter the result. (so if you have a better shader that gives the *exact* same results, than there is no problem with doing this).
 
mat said:
I dont think that there is a problem with true optimizations, as long as they dont alter the result.

So if an IHV's drivers detect (say) 3dMark03 and replace it with a lossless video file of the benchmark being rendered at, say, 200fps, that's not a problem? It doesn't alter "the result" (i.e. the final output to the screen) one bit...

(so if you have a better shader that gives the *exact* same results, than there is no problem with doing this).

In general, there is nothing wrong with replacing one sequence of code with another that can mathematically be proven to give the exact same results. Indeed, this is exactly what optimizing compilers do.

But here's the problem: by the spec, shaders are supposed to be compiled, at run-time, from some shader language (whether a low-level language like Direct X's PS1.x or PS2.0 or OpenGL's ARB_fragment_program or NV_fragment_program, or a high-level language like Cg or Direct X's HLSL) into machine code fit for execution on the graphics card. Thus any program which seeks to benchmark shader performance is inherently also in part a compiler benchmark.

By searching for two particular shaders and replacing them with hand-coded optimized code, ATI's drivers are circumventing that facet of the benchmark. That's cheating. If, on the other hand, the compiler was smart enough to recognize in the general-case when such an optimization was legal (i.e. would not change the output) and make the change, that would be great.

If you're confused as to where exactly to draw the line between optimization and cheat, maybe this will help:

DirectX is an API, which is to say, a set of functions. As with any API, there is an implicit contract between the programmer programming in the API and the software + hardware which carries out her instructions: namely, that if the programmer calls a function with particular inputs, the computer will output the proper result as defined in the API. Ideally, the actual way the software and hardware implement the API's functions is irrelevent to the programmer. From her perspective, the API is a black box: she feeds it inputs, it provides the proper outputs.

Now, 3dMark03 is a DirectX 9 benchmark. Literally. It's reason for being is to benchmark different implementations of the DirectX 9 API (more specifically, to benchmark them rendering real-time workloads which resemble the estimated visuals of a mid-2004 PC game). 3dM03 passes off a list of DX9 commands to a renderer (generally a graphics card), and times how long it takes for the renderer to return the proper output.

So, from this perspective, what's the difference between cheating and not? It all has to do with information. Any optimization which relies only on information contained within the stream of DX9 commands which make up 3dM03 is legal. (And no foreknowledge of what that command stream will be, either.) Any optimization which relies on outside information, including information gained from knowing what commands are coming next, is cheating, as is any optimization which causes the renderer to output the wrong result.

Coming back to the ATI example: if ATI's run-time shader compiler were able to make this optimization on the fly, based only on the characteristics of the shader code which make the optimization legal, that would be totally kosher. If, on the other hand, someone at ATI had to have foreknowledge of the fact that a particular shader would come up in the set of DX9 commands that is 3dMark03, and used that foreknowledge to instruct the driver to replace it with a hand-coded optimized version, that is cheating.

Now, is that the way it happens in a game? No. A game developer would likely optimize a shader by hand if they knew the shader compiler wasn't capable of making the optimization on its own. (Or, more likely, Nvidia/ATI's devrel people would make the optimization for them.) But a game benchmark--for example a UT03 benchmark--is not a DirectX benchmark (in the sense of benchmarking DirectX, in the way 3dMark does). It is...a UT03 benchmark.
 
For a game a general optimization (which does work for the whole game, not just timedemos), might include information that is not deducted from the rendering commands, but from "knowing" the game.

It is possible that, a game developer failed to fully optimize the game due to one of the following reasons:
a.) The developer wasn't reading the optimization docs, or was unable to implement them correctly.
b.) The developer wrote the game when such optimizations weren't neccessary or beneficial for the cards around that time.
c.) The developer was working with an API or a version of an API which does not contain the features neccessary to fully optimize the game.
d.) The developer was using a general rendering (not optimized for the given card) path only or was using the general path on that card.

If an IHV circumvent these application limitations without a quality loss, then for a user perspective this is beneficial, and it isn't cheating.
(It is annoying for developers but that's a different question.)

It is however shouldn't be done for a benchmar that "wasn't ment to represent gaming conditions".
For example reordering the houses in the vilage mark is clearly cheating.

The problem are with a certain benchmark which is sometimes referred as providing info on how future games expected to run, sometimes it is stated that it only gives a card a workload that represents nothing.

It is funny seeing people jumping from one representation to other and back depending on what they want to prove (or bash). People should make up their minds. It's either one or the other, can't be both.

If it's a benchmark that ment to simulate gaming conditions, then doing the optimization above are legal as they willl be done for games as well (assuming one of the conditions a/b/c/d apply.)

It it's a benchmark that is only meant to give the card to a certain workload than such optimizations are not legal, but then please stop saying such stupid things as if the benchmark has anything to do with either present or future games.
 
Dave H said:
mat said:
I dont think that there is a problem with true optimizations, as long as they dont alter the result.

So if an IHV's drivers detect (say) 3dMark03 and replace it with a lossless video file of the benchmark being rendered at, say, 200fps, that's not a problem? It doesn't alter "the result" (i.e. the final output to the screen) one bit...

ok, thats to much ;)

BTW: a video is not really an option, since its just one resolution, one type of FSAA,...

In general, there is nothing wrong with replacing one sequence of code with another that can mathematically be proven to give the exact same results. Indeed, this is exactly what optimizing compilers do.

thats my point. if the drivers can optimize shaders than i have no problem with that.

Replacing them with a handcoded, hand optimized shader just for a benchmark may not be a very... well nice solution, but just look ath the SPEC benchmark. the last time i've read about it, the source code is fixed and everyone can choos their own compiler to optimize it as much as they want. (dont know if its still true)

Thus any program which seeks to benchmark shader performance is inherently also in part a compiler benchmark.

I'm not familliar enough with 3d coding, but when all shaders are compiled at run time, then a compiler benchmark wouldn't be a bad idea.

By searching for two particular shaders and replacing them with hand-coded optimized code, ATI's drivers are circumventing that facet of the benchmark. That's cheating.

I'm not saying that this is the right way, but i dont consider replacing a shader with something equivialent as a cheat. (as long as the new shader gives the same result with all possible inputs)

If, on the other hand, the compiler was smart enough to recognize in the general-case when such an optimization was legal (i.e. would not change the output) and make the change, that would be great.

Sure, but the other thing is the second best thing. (but of course i'd prefer optimizations for real Game engines and not Benchmarks)
 
I think that's a great idea! The code that would cover it all could be very simple:

Thou shalt not create thy drivers so that they recognize any benchmark.

In fact, a smart driver developer could stamp his drivers with a kind of corporate seal which would signify to the end user that each driver set so sealed would be free of benchmark recognition. This would be something driver developers could do quite independently of Microsoft--and they'd have to be telling the truth, of course, since such a seal would invite everyone to test it vigorously...;)

Good idea.
 
Hyp-X said:
...
If it's a benchmark that ment to simulate gaming conditions, then doing the optimization above are legal as they willl be done for games as well (assuming one of the conditions a/b/c/d apply.)

It it's a benchmark that is only meant to give the card to a certain workload than such optimizations are not legal, but then please stop saying such stupid things as if the benchmark has anything to do with either present or future games.

I disagree, and I think it is a matter of component isolation.

Components: workload for an API, drivers, hardware, developer relations.

Where we depart is that you are focusing on the benchmark being representative of games by being treated exactly as games are by IHVs. That's a bad goal for a benchmark that is trying to be representative, because devrel interaction to particular games can't at all be a successful control parameter (there is no set quality guideline for devrel's input, for starters, as they can propose lower quality alternatives if it is acceptable to the developer...and whether it is acceptable or not will vary).

What the 3dmark's goal seems to be is to be representative of the hardware and driver capabilities to handle a fixed workload for the DX API. Since games: use a workload for an API, drivers, and hardware, isolating out developer relations doesn't make it "useless" for gamers to get indication about future games (that use shaders in the DX API), it only removes the uncontrolled variable of devrel.

The developer (Futuremark) has already made the code with this goal in mind, and the devrel (the person who presumably hand-optimized the GT 4 shader code) is precluded from participating, because that factor is not universally reproducible. However, if the drivers can achieve the optimization (atleast in the case of ATI, since it is output identical) without foreknowledge, it is universally reproducible, and that focus is the key departure from a typical game.

It doesn't take "jumping from one representation to another" to hold that viewpoint. :-? All it requires is to disagree that 3dmark 03 is a "workload that represents nothing." It seems to represent atleast as much as any game's workload, and some specific thought was given to allow it to be more directly useful for comparison of hardware and drivers for the workload within one particular API.

I think our disagreement is most simply stated as a disagreement about the inclusion of the "developer relations" component in benchmarking. Devrel isn't omnipresent for every game...Futuremark is trying to remove the popularity of 3dmark 03 as a determining factor in its performance results, so it can be more successfully reproducible for games, and I only think that makes sense. :?:

Did I miss some part of this idea being addressed?

I think Dave H's comments reflect the same outlook, so replying to his post would also be illustrative for discussing disagreement with what I state.

Also, here is one place I go into more detail on my thoughts, and this can also be seen in my prior responses along these lines in this forum.
 
Ah hell, why don't they just have open-source drivers so we can tell for ourselves...

Oh wait. Yeah. That'll happen when George W. Bush does something smart...
 
Hyp-X said:
It it's a benchmark that is only meant to give the card to a certain workload than such optimizations are not legal, but then please stop saying such stupid things as if the benchmark has anything to do with either present or future games.

I know what you are saying, but if you want benchmarks to be like games, then they are no longer worth a damn as a benchmark since they lose even the rational that it is performance that it is an actual game being measured.

A timedemo of quake3 won't tell you anything about jedi knight 2 performance, so any idea that game benchmarks using game engines are somehow better is false. The engines get heavily modified, and even were they not, just running with different data means that anything you concluded before is meaningless. Even demo1 results tell you nothing about demo3 results, and both are rather trivial runs on today's cards. So, a benchmark using a game is good only for that game and only for that scene being rendered in that game. Unless the scene is very well rounded you are testing a corner of a corner of possible work that the video cards are capable of. And naturally, each card is probably running different code, making the whole exercise pointless to begin with. Not a good benchmark at all. What's good about synthetic benchmarks like 3dmark03 is that all the features of a video card get tested, not just how much fillrate it has or how well it can multitexture. In that sense it is vastly superior in telling us how future games are going to perform like than any game timedemo you care to name.

Videocard benchmarks have to do the same work for all cards and they should test all the features of the cards, go out of it's way to find new things to test, not just basic current gamelike situations. That's if you are benchmarking hardware and not benchmarking game support for a paticular video card where image quality means nothing and it's all about frames per second in games with crude graphics designed around the lower common denominator.

Bleh :)
 
Driver vendors need to work with game developers, but they shouldn't replace developers code with their own.

Dave H is bang on.

I don't want vendors producing better drivers for benchmarks that don't help real games. What is the point of that?

Nvidia's new drivers make the ut2003 benchmark run up to 30% faster, yet it crashes on my Geforce 3 every time I play it and my TNT card goes into screensaver mode and never comes back except for the mouse pointer. Is this the way it is meant to be played?

I want a better shader compiler not shaders that replace the water effect in 3DMark03.

If the vendor can't alter the benchmark they either have to make better drivers or better video cards.
 
Nvidia's new drivers make the ut2003 benchmark run up to 30% faster, yet it crashes on my Geforce 3 every time I play it and my TNT card goes into screensaver mode and never comes back except for the mouse pointer. Is this the way it is meant to be played?

Crashes with UT2003 on your GF3? I should probably try it then--43.51s have worked fine for me, but I have yet to try 44.03. Haven't heard any bad reports about it though, so I don't know what to tell you.
 
mat said:
BTW: a video is not really an option, since its just one resolution, one type of FSAA,...

Well, you could just include a bunch of videos and have the driver select the right one based on the benchmark settings. (Of course if the "videos" were uncompressed--which they would have to be--this would make the driver download somewhere in the hundreds of gigabytes...but I think you get the point...)

Replacing them with a handcoded, hand optimized shader just for a benchmark may not be a very... well nice solution, but just look ath the SPEC benchmark. the last time i've read about it, the source code is fixed and everyone can choos their own compiler to optimize it as much as they want. (dont know if its still true)

SPEC CPU provides a great perspective on these sorts of issues, as it has been dealing with them for IIRC 14 years now. SPEC CPU is a collection of 26 "application kernel" benchmarks, meaning each benchmark is a section (usually but not always the "main workload" section) of a real application.

[side note: everybody refers to 3dMark as a "synthetic" benchmark (including Futuremark themselves), but going by the terminology generally used, it is not; it's much more like an application kernel benchmark, like SPEC. Of course it's not technically an application kernel benchmark, since the code doesn't come from real applications (although one could argue 3dMark01 was a legit application kernel benchmark), so maybe "simulated application kernel" is the best terminology. But "synthetic" is totally wrong. That is, the "feature tests" of 3dMark, like the fillrate tests, are synthetic benchmarks. The more complex "feature tests", like the pixel shader tests and the ragdoll test, are pushing it, and are probably better termed "kernel" benchmarks (like Dhrystone/Whetstone on the CPU side). The game tests are not synthetic whatsoever. Just a pet peeve of mine. Anyways.]

The benchmark code itself is open source. The benchmark harness (i.e. the code that runs each benchmark, tests results for correctness, calculates the score, etc.) is closed source. The dataset is secret.

Each IHV (who, incidentally, all pay a fee to join the SPEC Consortium, which makes the rules and creates the benchmark) (oh, and double incidentally, Nvidia somehow managed to scrounge around in between the couch cushions and find the money to join the SPEC Consortium, clearly for SPEC_Viewperf) gets a copy of the test source, and a copy of the benchmark with a test dataset. Profiling compiler runs can be made on this test dataset. (i.e. some compilers can insert branch hints etc. based on performance analysis of the application running) When they make an official score run, they get a new dataset for one-time use. The test harness checks to make sure all computed answers are correct.

It is true that the IHVs often have full control over the compilers they use, but that doesn't mean the benchmarks can just be compiled with any old "compiler". The compilers used have to be real, available, general-purpose compilers. (Well, technically, they have to be available within 3 months of score publication.) Special "SPEC versions" are not allowed. Nor is it allowed to have a compiler which coincidentally only works for the SPEC benchmarks; it has to be a real, viable compiler.

Moreover, it is not legal for the compiler to recognize and special-case code from SPEC; all optimizations have to be legitimate, general-case optimizations. Now, of course this rule is difficult to enforce precisely (after all, these are all closed source compilers), and might be said to get bent somewhat. Certainly many optimizations in current compilers would not be there were they not in some way applicable to a SPEC benchmark. On the other hand, you can 100% guarantee that none of them would be fooled by trivial source code modifications like the ones Futuremark made from build 320 to build 330, nor will they know to attempt one optimization in one spot and not in another because they know it doesn't work there. (Except see below on base vs. peak.)

As a small illustrative example, about a year ago or so, Sun released a new compiler which produced spectacular gains in "art", a SPEC_FP subtest (like 600% or something). Everybody thought they were cheating. Well, apparently there was a meeting of the Consortium held and Sun had to prove that it was a general-case optimization. Which they did apparently to everyone's satisfaction. (Unfortunately the details are not public, as it's of course a proprietary optimization in their compiler.) Strangely, none of the other IHVs have managed the same results for art, but it's pretty much a given that art won't make it back for SPEC_2004.

One final nuance about SPEC CPU: there are two different scores reported, "base" and "peak". As it turns out, even truly general-case compiler optimizations don't always work; sometimes they make certain assumptions which may not be true (e.g. about data alignment, etc.); sometimes they just break things for unknowable reasons. (A comparison to shader programs is probably not fair; after all, C on a computer is an infinitely more complex language and platform combination than PS2.0 on an R3xx/NV3x fragment shader pipeline.) That's why all compilers have a large number of switches which turn on and off various optimizations; these switches facilitate trade-offs between performance, correctness and code size.

A "base" SPEC run means the same switches must be used to compile all 26 tests. Thus the compiler can "know" that it is compiling SPEC CPU, but given how huge a codebase SPEC CPU represents, that doesn't really tell it much. Conversely, a "peak" run means each subtest can use different compiler switches, so that an optimization which breaks one subtest can still be used on another. In practice, the gap between "base" and "peak" scores has steadily come down over the years, as compilers have gotten smarter and more able to differentiate between when an optimization will be legal and when it will not. Vendor PR tends to quote peak scores, but engineers tend to quote base scores. :D

So, that's how SPEC works. What can it tell us about how to benchmark graphics cards? Well, for one thing, we can say that all the cheats identified by Futuremark--both Nvidia's and ATI's--would be illegal for SPEC. And those of Nvidia's cheats which affected output quality would get automatically caught by the benchmark harness itself; the run would never even be scored. At first I thought the insertion of static clip planes would be a "legal" optimization under SPEC. After all, it is a general precept of compiler optimization that if the compiler can prove that a piece of code will never be executed, it is totally legal to cut it out. But only if it can be proven at compile time that, no matter what the input, that code will never get touched; if it gets touched for certain input but not for other input, it must be left in.

Which brings me to my next point: the camera path in a 3d benchmark is analogous to the dataset input in SPEC. Under the SPEC way of doing things, 3dMark would ship with a "test" camera path to allow vendors and everyone else to play around with it, but "official 3dMark scores" would only be obtained by using a different, secret camera path as input. (Of course, many other features could be considered "input" in addition to the camera path: the geometry rendered, the textures used, etc. It's difficult to know whether to classify shaders as "input" or "benchmark code"...)

A big lesson from SPEC CPU is that whenever a benchmark includes uncompiled code, the benchmark inherently tests the compiler as much as the hardware performance; thus the benchmark rules must clearly specify what is and is not allowed of the compiler. Since shader code must be compiled at run-time to run on graphics cards, this lesson applies to any graphics benchmark with shader code. That's why ATI's optimization was a cheat even though the exact same optimization would have been legitimate if the compiler generated it in the general-case rather than as a special-case search-and-replace for 3dMark03 GT4 only.

A sort of social lesson from SPEC CPU is that most vendors don't cry and whine and pout and throw things and have temper tantrums and quit the Consortium and smear SPEC in the media when their products don't win. Indeed, in recent years the big story with SPEC CPU has been how Intel's chips have emerged at the top of the heap, with AMD right behind, even though they cost an order of magnitude (or more) less than the competition on a CPU-by-CPU basis. (Of course, right now IBM's 1.7 GHz Power4 has vaulted to the top of the heap, but the P4 still holds the SPECint crown, and the upcoming 1.5 GHz.13u Itanium2 ("Madison") is going to blow everything out of the water.) Amazingly, HP and IBM dutifully submitted SPEC scores for their mega-expensive PA-RISC and POWER3 chips even as they were getting doubled in SPEC performance by an $80 Celeron (not that Intel submits SPEC scores for Celeron, but one can easily estimate). Despite the fact that they had a legitimate disadvantage (unlike Intel, they do not have their own compiler group and thus have to use Intel's P4-optimized compiler), AMD has never complained about SPEC (although their fanboys sure have), and indeed proudly featured their SPEC scores when Opteron launched.

Of course, there is one company that, whilst never accusing SPEC of "intentionally trying to create a scenario that makes our products look bad," has certainly chosen to never submit scores to SPEC (although they are a member, interestingly!), and instead rely on hand-created benchmarks and even comparisons of the theoretical ALU throughput of handpicked instructions! I'm referring, of course, to Steve Jobs and Apple; the SPEC scores of the G3 and G4 are technically unknown, although tests at the well-respected German tech magazine c't (incidentally, their media group company is a member of the SPEC Consortium) showed they were about equal clock-for-clock with a PIII.

Welcome to the reality distortion field, Nvidia.
 
Hyp-X said:
For a game a general optimization (which does work for the whole game, not just timedemos), might include information that is not deducted from the rendering commands, but from "knowing" the game.

It is possible that, a game developer failed to fully optimize the game due to one of the following reasons:
a.) The developer wasn't reading the optimization docs, or was unable to implement them correctly.
b.) The developer wrote the game when such optimizations weren't neccessary or beneficial for the cards around that time.
c.) The developer was working with an API or a version of an API which does not contain the features neccessary to fully optimize the game.
d.) The developer was using a general rendering (not optimized for the given card) path only or was using the general path on that card.

In the case of 3dMark, the inclusion of all the 3d IHVs as members of the beta program is meant to generally take care of issues a-c. Of course Nvidia is no longer a member, but they were there during most of the 3dMark03 development process, and obviously only have themselves to blame if, having taken their ball and gone home, the benchmark does not reflect their concerns.

As for issue d, Futuremark obvious makes a conscious decision to use a general rendering path, based around the DX9 spec, rather than use vendor-specific rendering paths. Clearly this results in an inevitable loss of "accuracy", as far as how closely 3dMark scores will reflect actual game benchmarks. On the other hand, FM is obviously put in a position where they need to choose between accuracy and fairness. I think their choice is the right one, and is the same choice made by all vendor-neutral industry-standard benchmarks. Again, this isn't to say it doesn't have consequences for accuracy which must be kept in mind.

The problem are with a certain benchmark which is sometimes referred as providing info on how future games expected to run, sometimes it is stated that it only gives a card a workload that represents nothing.

It is funny seeing people jumping from one representation to other and back depending on what they want to prove (or bash). People should make up their minds. It's either one or the other, can't be both.

Are you referring to me? I've certainly argued that 3dMark03 has some relevence in respresenting the performance characteristics of future games. I hope I've never claimed, though, that there will not be a systematic performance difference between it and those games, as if 3dMark03 is a 2004 PC game traveled back in time to today.

As for the notion that it either "provdes info on how future games are expected to run" or has a workload "that represents nothing", I reject that as far too binary a dichotomy. Instead I'll stick with what I wrote above: it is a benchmark of a vendor-neutral conception of the DX9 API "rendering real-time workloads which resemble the estimated visuals of a mid-2004 PC game". Its goal is not to estimate a mid-2004 PC game's performance as exactly as possible, but neither does the workload represent nothing.

Again, a balance between accuracy and fairness. And luckily, since most of those 2004 PC games will be written in DX9 (although of course not a vender-neutral conception of it), I think 3dMark03 at least has something useful to say about how today's cards will stack up on a broad range of tomorrow's games.

Does that mean we shouldn't bother benchmarking those games when they come out, because 3dMark03 already told us all we needed to know a year prior? No.

If it's a benchmark that ment to simulate gaming conditions, then doing the optimization above are legal as they willl be done for games as well (assuming one of the conditions a/b/c/d apply.)

Usually the best time to catch conditions a/b/c/d is in development; that's why all the IHVs have extensive developer relations departments, which continually answer questions and even go so far as to write code for game developers. Again, IMO the 3dMark beta program is meant to supply a similar role to devrel for issues a-c, with d purposely left alone for the purpose of attempting to be a vendor-neutral benchmark.

Now, if a game manages to go to production without enough devrel correction of a/b/c/d, or if an IHV figures out a cool optimization after the game is out, or for new hardware features ala b (although the IHV usually has a decent idea what their hardware will be like well ahead of time), then sure it's ok for game-specific optimizations to be built into drivers so long as they do not affect visual quality. (Although it would be better, to the extent it's possible, to have devrel go through the developer who could issue these things through a patch).

Do I seriously think 3dMark03 compromises its predictive abilities if FM refuses to allow special-case optimizations in IHV drivers because doing so fails to model this "last ditch, devrel didn't do enough to prevent a/b/c/d during game development" motivation for driver optimizations? No. Do you?

If it's a benchmark that is only meant to give the card to a certain workload than such optimizations are not legal, but then please stop saying such stupid things as if the benchmark has anything to do with either present or future games.

IMO the stupid thing to say is that, because a benchmark cannot possibly be broadly vendor-neutral but still model the optimization process of a real game 100% accurately, it therefore has nothing to do with the performance of present or future games. Rather it just sacrifices a certain amount of accuracy for a vital amount of fairness. Considering how closely coupled the various DX level specs are to the graphics hardware of their time, I don't think FM is going so far wrong in writing to the DX9 API and using some of the general techniques and rendering styles which are certain to become popular in upcoming games.
 
demalion said:
Hyp-X said:
...
If it's a benchmark that ment to simulate gaming conditions, then doing the optimization above are legal as they willl be done for games as well (assuming one of the conditions a/b/c/d apply.)

It it's a benchmark that is only meant to give the card to a certain workload than such optimizations are not legal, but then please stop saying such stupid things as if the benchmark has anything to do with either present or future games.

I disagree, and I think it is a matter of component isolation.

Components: workload for an API, drivers, hardware, developer relations.

Where we depart is that you are focusing on the benchmark being representative of games by being treated exactly as games are by IHVs. That's a bad goal for a benchmark that is trying to be representative, because devrel interaction to particular games can't at all be a successful control parameter (there is no set quality guideline for devrel's input, for starters, as they can propose lower quality alternatives if it is acceptable to the developer...and whether it is acceptable or not will vary).

I agree that a benchmark wouldn't make sense if it produced different output on different cards.

What the 3dmark's goal seems to be is to be representative of the hardware and driver capabilities to handle a fixed workload for the DX API.

We both know this is not true, it puts quite different workload to different cards.

Since games: use a workload for an API, drivers, and hardware, isolating out developer relations doesn't make it "useless" for gamers to get indication about future games (that use shaders in the DX API), it only removes the uncontrolled variable of devrel.

You mean if I optimize my shaders to run fast on NV3x, keeping mind to use as few registers as possible without increasing the number of instruction, than this is a devrel issue?

The developer (Futuremark) has already made the code with this goal in mind, and the devrel (the person who presumably hand-optimized the GT 4 shader code) is precluded from participating, because that factor is not universally reproducible. However, if the drivers can achieve the optimization (atleast in the case of ATI, since it is output identical) without foreknowledge, it is universally reproducible, and that focus is the key departure from a typical game.

The ATI optimization is a good example, as it was without a quality sacrifice. It was made with a knowledge of what shaders will be used in 3dmark.
This type of optimization will be carried out for a normal game (if they had to), and it benefits the gamers that play that game, not just pushes the timedemo result.

The fact that they could do such an optimizations, is caused by one of the following reasons:
1.) The DX9 API is not efficicient to describe the hardware (so a more optimal shader couldn't be written)
2.) A more optimal shader could be written, but FM haven't wrote it for one reason or an other.

It doesn't take "jumping from one representation to another" to hold that viewpoint. :-? All it requires is to disagree that 3dmark 03 is a "workload that represents nothing." It seems to represent atleast as much as any game's workload, and some specific thought was given to allow it to be more directly useful for comparison of hardware and drivers for the workload within one particular API.

The workload was pushed to GPU from CPU, as much as possible, even if that requires skinning the same mesh 6 or more times...
All this because users of 3dmark 2001 was complaining it was too CPU limited?

I think our disagreement is most simply stated as a disagreement about the inclusion of the "developer relations" component in benchmarking. Devrel isn't omnipresent for every game...Futuremark is trying to remove the popularity of 3dmark 03 as a determining factor in its performance results, so it can be more successfully reproducible for games, and I only think that makes sense. :?:

Futuremark is trying to remove the popularity of 3dmark03 ???
Ohh.
 
I think that a code of ethics would be great. Although I would like to see one for reviewers also.
 
Hyp-X said:
...

What the 3dmark's goal seems to be is to be representative of the hardware and driver capabilities to handle a fixed workload for the DX API.

We both know this is not true, it puts quite different workload to different cards.
...

For now, could you clarify this? What do you mean by different workload for different cards here? It is the ability to execute the workload that differs between drivers and cards.
 
Himself said:
What's good about synthetic benchmarks like 3dmark03 is that all the features of a video card get tested, not just how much fillrate it has or how well it can multitexture. In that sense it is vastly superior in telling us how future games are going to perform like than any game timedemo you care to name.

Videocard benchmarks have to do the same work for all cards and they should test all the features of the cards, go out of it's way to find new things to test, not just basic current gamelike situations. That's if you are benchmarking hardware and not benchmarking game support for a paticular video card where image quality means nothing and it's all about frames per second in games with crude graphics designed around the lower common denominator.

I agree.
It think I'd expect a synthetic benchmark to be ... well, more synthetic.
I like feature tests.
Vertex shader test, pixel shader test, fillrate test they are good to give you information about some things.

It is the game tests that are more problematic.
In making a game you want to balance the rendering pipeline.
It means (greatly simplified) you chose some workload for vertex processing, some workload for pixel processing.

Being vertex limited on one card and pixel limited on other is possible. You have to choose. In a game it's an agreement between the programmers and the artists. (Artist don't want the pipeline to be balanced they want it to look as best as possible.)

But how you choose the balance of the workload in a benchmark.
Can you tell why was the workload chosen this or that way?
Is that workload balance was chosen because it represents future games?
 
Dave H said:
Hyp-X said:
It is possible that, a game developer failed to fully optimize the game due to one of the following reasons:
a.) The developer wasn't reading the optimization docs, or was unable to implement them correctly.
b.) The developer wrote the game when such optimizations weren't neccessary or beneficial for the cards around that time.
c.) The developer was working with an API or a version of an API which does not contain the features neccessary to fully optimize the game.
d.) The developer was using a general rendering (not optimized for the given card) path only or was using the general path on that card.

In the case of 3dMark, the inclusion of all the 3d IHVs as members of the beta program is meant to generally take care of issues a-c. Of course Nvidia is no longer a member, but they were there during most of the 3dMark03 development process, and obviously only have themselves to blame if, having taken their ball and gone home, the benchmark does not reflect their concerns.

Is it doesn't reflect their concerns because they quit the beta testing,
- or -
they quit the beta testing because FM didn't what to "reflect their concerns".

As for issue d, Futuremark obvious makes a conscious decision to use a general rendering path, based around the DX9 spec, rather than use vendor-specific rendering paths.

The fun thing is they have different rendering paths, but they say they are not vendor-specific paths.
Since they do this they have no basis to justify those paths.

Clearly this results in an inevitable loss of "accuracy", as far as how closely 3dMark scores will reflect actual game benchmarks. On the other hand, FM is obviously put in a position where they need to choose between accuracy and fairness. I think their choice is the right one, and is the same choice made by all vendor-neutral industry-standard benchmarks. Again, this isn't to say it doesn't have consequences for accuracy which must be kept in mind.

So you agree that by disallowing equal quality optimizations (I mean the things ATI did, and NOT what nVidia did), they becoming less accurate at displaying future game parformance?
Cool.

Are you referring to me? I've certainly argued that 3dMark03 has some relevence in respresenting the performance characteristics of future games. I hope I've never claimed, though, that there will not be a systematic performance difference between it and those games, as if 3dMark03 is a 2004 PC game traveled back in time to today.

I didn't say it was possible to write such a benchmark.
But some people think it is.
Not neccessarily you.

The problem is not when you'll see future games give you different framerates (that's inevitable).
The problem is when you'll see future games having a different vendor bias.

And for this issue to be solved it's NOT the IHV's but the ISV's the right ones to consult with.
IHV's couldn't tell you what features will be used in future games, only maybe what they'd like to be used.

As for the notion that it either "provdes info on how future games are expected to run" or has a workload "that represents nothing", I reject that as far too binary a dichotomy. Instead I'll stick with what I wrote above: it is a benchmark of a vendor-neutral conception of the DX9 API "rendering real-time workloads which resemble the estimated visuals of a mid-2004 PC game". Its goal is not to estimate a mid-2004 PC game's performance as exactly as possible, but neither does the workload represent nothing.

Again I say if a mid-2004 PC game will run either 2x fast or 0.5x as fast on most of the video cards of that time than 3dmark, then it's good.
It's not the absolute, it's the relative performance that matters.

Do I seriously think 3dMark03 compromises its predictive abilities if FM refuses to allow special-case optimizations in IHV drivers because doing so fails to model this "last ditch, devrel didn't do enough to prevent a/b/c/d during game development" motivation for driver optimizations? No. Do you?

Well "devrel didn't do enough to prevent" is a one sided view.
What about hardware that came out half a year after the game?

What could they do?
They either ask the game developer to create a patch with a new optimization, or they try to do that in the driver.

FM will refuse the former, and the disallow the latter.

I don't think FM is going so far wrong in writing to the DX9 API and using some of the general techniques and rendering styles which are certain to become popular in upcoming games.

So there are two game test with Doom3 like visuals, but with very different behaving codepaths than Doom3. Are other developers will do the same?

Which one of the tests represent techniques present in future RTS games?

Every developer will use stencil based shadows instead of shadow buffering?

What general techniques and rendering styles are we talking about?
 
Back
Top