mat said:
BTW: a video is not really an option, since its just one resolution, one type of FSAA,...
Well, you could just include a bunch of videos and have the driver select the right one based on the benchmark settings. (Of course if the "videos" were uncompressed--which they would have to be--this would make the driver download somewhere in the hundreds of gigabytes...but I think you get the point...)
Replacing them with a handcoded, hand optimized shader just for a benchmark may not be a very... well nice solution, but just look ath the SPEC benchmark. the last time i've read about it, the source code is fixed and everyone can choos their own compiler to optimize it as much as they want. (dont know if its still true)
SPEC CPU provides a great perspective on these sorts of issues, as it has been dealing with them for IIRC 14 years now. SPEC CPU is a collection of 26 "application kernel" benchmarks, meaning each benchmark is a section (usually but not always the "main workload" section) of a real application.
[side note: everybody refers to 3dMark as a "synthetic" benchmark (including Futuremark themselves), but going by the terminology generally used, it is not; it's much more like an application kernel benchmark, like SPEC. Of course it's not technically an application kernel benchmark, since the code doesn't come from real applications (although one could argue 3dMark01 was a legit application kernel benchmark), so maybe "simulated application kernel" is the best terminology. But "synthetic" is totally wrong. That is, the "feature tests" of 3dMark, like the fillrate tests, are synthetic benchmarks. The more complex "feature tests", like the pixel shader tests and the ragdoll test, are pushing it, and are probably better termed "kernel" benchmarks (like Dhrystone/Whetstone on the CPU side). The game tests are not synthetic whatsoever. Just a pet peeve of mine. Anyways.]
The benchmark code itself is open source. The benchmark harness (i.e. the code that runs each benchmark, tests results for correctness, calculates the score, etc.) is closed source. The dataset is
secret.
Each IHV (who, incidentally, all
pay a fee to join the SPEC Consortium, which makes the rules and creates the benchmark) (oh, and double incidentally, Nvidia somehow managed to scrounge around in between the couch cushions and find the money to join the SPEC Consortium, clearly for SPEC_Viewperf) gets a copy of the test source, and a copy of the benchmark with a test dataset. Profiling compiler runs can be made on this test dataset. (i.e. some compilers can insert branch hints etc. based on performance analysis of the application running) When they make an official score run, they get a new dataset for one-time use. The test harness checks to make sure all computed answers are correct.
It is true that the IHVs often have full control over the compilers they use, but that doesn't mean the benchmarks can just be compiled with any old "compiler". The compilers used have to be real, available, general-purpose compilers. (Well, technically, they have to be available within 3 months of score publication.) Special "SPEC versions" are not allowed. Nor is it allowed to have a compiler which coincidentally only works for the SPEC benchmarks; it has to be a real, viable compiler.
Moreover,
it is not legal for the compiler to recognize and special-case code from SPEC; all optimizations have to be legitimate, general-case optimizations. Now, of course this rule is difficult to enforce precisely (after all, these are all closed source compilers), and might be said to get bent somewhat. Certainly many optimizations in current compilers would not be there were they not in some way applicable to a SPEC benchmark. On the other hand, you can 100% guarantee that none of them would be fooled by trivial source code modifications like the ones Futuremark made from build 320 to build 330, nor will they know to attempt one optimization in one spot and not in another because they know it doesn't work there. (Except see below on base vs. peak.)
As a small illustrative example, about a year ago or so, Sun released a new compiler which produced spectacular gains in "art", a SPEC_FP subtest (like 600% or something). Everybody thought they were cheating. Well, apparently there was a meeting of the Consortium held and Sun had to prove that it was a general-case optimization. Which they did apparently to everyone's satisfaction. (Unfortunately the details are not public, as it's of course a proprietary optimization in their compiler.) Strangely, none of the other IHVs have managed the same results for art, but it's pretty much a given that art won't make it back for SPEC_2004.
One final nuance about SPEC CPU: there are two different scores reported, "base" and "peak". As it turns out, even truly general-case compiler optimizations don't always work; sometimes they make certain assumptions which may not be true (e.g. about data alignment, etc.); sometimes they just break things for unknowable reasons. (A comparison to shader programs is probably not fair; after all, C on a computer is an infinitely more complex language and platform combination than PS2.0 on an R3xx/NV3x fragment shader pipeline.) That's why all compilers have a large number of switches which turn on and off various optimizations; these switches facilitate trade-offs between performance, correctness and code size.
A "base" SPEC run means the same switches must be used to compile all 26 tests. Thus the compiler can "know" that it is compiling SPEC CPU, but given how huge a codebase SPEC CPU represents, that doesn't really tell it much. Conversely, a "peak" run means each subtest can use different compiler switches, so that an optimization which breaks one subtest can still be used on another. In practice, the gap between "base" and "peak" scores has steadily come down over the years, as compilers have gotten smarter and more able to differentiate between when an optimization will be legal and when it will not. Vendor PR tends to quote peak scores, but engineers tend to quote base scores.
So, that's how SPEC works. What can it tell us about how to benchmark graphics cards? Well, for one thing, we can say that all the cheats identified by Futuremark--both Nvidia's and ATI's--would be illegal for SPEC. And those of Nvidia's cheats which affected output quality would get automatically caught by the benchmark harness itself; the run would never even be scored. At first I thought the insertion of static clip planes would be a "legal" optimization under SPEC. After all, it is a general precept of compiler optimization that if the compiler can prove that a piece of code will never be executed, it is totally legal to cut it out. But only if it can be proven at compile time that, no matter what the input, that code will never get touched; if it gets touched for certain input but not for other input, it must be left in.
Which brings me to my next point: the camera path in a 3d benchmark is analogous to the dataset input in SPEC. Under the SPEC way of doing things, 3dMark would ship with a "test" camera path to allow vendors and everyone else to play around with it, but "official 3dMark scores" would only be obtained by using a different, secret camera path as input. (Of course, many other features could be considered "input" in addition to the camera path: the geometry rendered, the textures used, etc. It's difficult to know whether to classify shaders as "input" or "benchmark code"...)
A big lesson from SPEC CPU is that whenever a benchmark includes uncompiled code, the benchmark inherently tests the compiler as much as the hardware performance; thus the benchmark rules must clearly specify what is and is not allowed of the compiler. Since shader code must be compiled at run-time to run on graphics cards, this lesson applies to any graphics benchmark with shader code. That's why ATI's optimization was a cheat even though the exact same optimization would have been legitimate if the compiler generated it in the general-case rather than as a special-case search-and-replace for 3dMark03 GT4 only.
A sort of social lesson from SPEC CPU is that most vendors don't cry and whine and pout and throw things and have temper tantrums and quit the Consortium and smear SPEC in the media when their products don't win. Indeed, in recent years the big story with SPEC CPU has been how Intel's chips have emerged at the top of the heap, with AMD right behind, even though they cost an order of magnitude (or more) less than the competition on a CPU-by-CPU basis. (Of course, right now IBM's 1.7 GHz Power4 has vaulted to the top of the heap, but the P4 still holds the SPECint crown, and the upcoming 1.5 GHz.13u Itanium2 ("Madison") is going to blow everything out of the water.) Amazingly, HP and IBM dutifully submitted SPEC scores for their mega-expensive PA-RISC and POWER3 chips even as they were getting doubled in SPEC performance by an $80 Celeron (not that Intel submits SPEC scores for Celeron, but one can easily estimate). Despite the fact that they had a legitimate disadvantage (unlike Intel, they do not have their own compiler group and thus have to use Intel's P4-optimized compiler), AMD has never complained about SPEC (although their fanboys sure have), and indeed proudly featured their SPEC scores when Opteron launched.
Of course, there is
one company that, whilst never accusing SPEC of "intentionally trying to create a scenario that makes our products look bad," has certainly chosen to never submit scores to SPEC (although they are a member, interestingly!), and instead rely on hand-created benchmarks and even comparisons of the theoretical ALU throughput of handpicked instructions! I'm referring, of course, to Steve Jobs and Apple; the SPEC scores of the G3 and G4 are technically unknown, although tests at the well-respected German tech magazine c't (incidentally, their media group company is a member of the SPEC Consortium) showed they were about equal clock-for-clock with a PIII.
Welcome to the reality distortion field, Nvidia.