To make a valid benchmark...

Discussion in 'General 3D Technology' started by Nite_Hawk, Jun 3, 2003.

  1. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    I like the seed value idea for camera paths very much. I think this would eliminate the possibility of a company doing what nVidia did with 3DM 320 pretty well. It would also allow people at home to use the same seed values as reviewers for camera tracks and compare results. It would also allow reviewers to use multiple camera tracks in a review to get a better overall performance picture of each individual scene representing the benchmarking of a certain API feature.

    As far as what one sees on the screen while the benchmark is running, I see nothing wrong whatever with the way 3D Mark currently does that. I mean, simply because an image might be "abstract" doesn't mean there is any increase in the reliability of the benchmark. The fill rate tests in 3DMark are visually abstract. So what?...;) I see nothing wrong with a "3D game" interface visually of the type 3D Mark currently employs. Seems far more fitting for the subject matter.

    The important point to me is that the benchmark avoid vendor-specific shader paths and other vendor-specific code like the plague and attempt to be as generically close to the API standards as the authors are capable of making it. If benchmarks started using vendor-specific, optimized paths and so on we'd descend to the level of the game engine as opposed to the generic API feature benchmark. The fact is that most D3d games do not use vendor specific paths and the 3D hardware manufacturers do not build maximally optimized routines into their drivers for the majority of D3d games that ship.
     
  2. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    The idea of a seed is very useful of course.

    However, I think the problem with current benchmarks is not whether particular code paths are modified. I think that is a red herring.

    I think the real problem is what benchmarks currently measure in the first place.

    I will give an example to clarify my point. Suppose there is a CPU floating point benchmark. It performs typically used floating point algorithms such as solutions to large numbers of linear equations. It is written in C and compiled using standard platform compilers.

    Now suppose there are three new processors. One with a new floating point vector processor that can run 10 times faster than the previous version, however the standard floating point is a bit slower than the others and the vector processor requires assembler. A competitive processor has no vector co-processor but has slightly faster standard floating point. A third processor can remove all latencies for a 2x performance increase but this also requires a small amount of assembler.

    Software vendors would definitely rewrite their inner loops in assembler to take advantage of the new floating point features on the two new processors.

    The question is, is the benchmark a good measure of how real world floating point applications are going to perform on the platforms in a year or so. Well the answer is obviously not. The platform that will score the highest will likely perform the worst, since software developers will not constrain their code to be written only in compiled C if they can achieve major performance benefits by making a few vendor specific optimizations.

    Benchmark writers typically do not solve this dilemma since they are not motivated to continually keep their benchmarks optimized for vendor specific hardware. They should be unbiased which means they shouldn't care how well any specific hardware performs on their benchmarks. Only the hardware vendor cares about this and, in the future, application developers. As a result benchmarks often do not reflect how hardware will actually perform which reduces their benefits.

    This motivates hardware vendors to try and find other approaches to optimize the benchmark for their hardware, since the benchmark is not written this way. However, since this is not an open practice, it creates an unlevel playing field and the benchmark becomes useless to everyone.

    Now suppose that the benchmark developer creates a new benchmark using a new benchmark paradigm. Using this paradigm, the benchmark is split into two programs: a open source model program (since it models the applications that will be used) that will run on the destination hardware and can be freely modified by anyone as desired, and a reference program that ships as a binary executable that can be run on any machine at anytime after the initial program is run (meaning there is nothing on the benchmarked hardware that can possibly effect or modify it - the reference program is not allowed to be modified). Benchmarked data can be communicated between the two either on disk or via the network.

    Hardware vendors are allowed, in fact encouraged, to modify the benchmark model program to their hearts content to optimize it for their platforms. There is one stipulation, they must make their optimizations publicly available on their website and allow others to freely use them (much like today's software development demos). In this way, the optimizations can be communicated to software developers allowing the lessons learned from them to find their way into actual applications.

    When the first (model) program is run it saves its output to disk. In the case of 3d graphics, the output should consist of randomly sampled pixels from the screen images (random sampling greatly minimizes the amount of data that needs to be collected and stored, while giving essentially the same results as keeping all the pixels which would be prohibative). The performance of the model program is measured and also stored. Then the reference program is run, that runs the same application in an idealized fashion. In the case of a 3d program it uses a maximum precision software renderer with essentially ideal AA for the entire image (textures and geometry) much like CG software. It saves precisely the same set of random pixels as the first.

    After the reference program is run, the benchmark compares the outputs of the two programs and measures the differences. With 3d, the sum of the square of the differences between the pixels (divided by the number of pixels to normalize it) gives a measure of image quality compared to the reference. Since the reference is run with idealized software AA, AA and anisotropic filtering are automatically included in the image quality metric. The better the AA, the closer it will be to the reference and the better the image quality score. Shaders, shader precision, subpixel precision, and all other image quality variables automatically show up in the image quality score.

    The final score would show both the average frame rate and the image quality score.

    There could be no such thing as cheating with such a benchmark. Any and all modifications to the first program are allowed, and there is no way for the benchmark hardware or its drivers to modify the reference program. If one hardware vendor found a way to optimize the benchmark by adding clip planes or by some other trick, the other hardware vendors would quickly see the published results and add the same trick to their own optimizations. Even more interesting, is that game developers would also see the various tricks and optimizations and add those to actual games whenever those situations arise.

    What would quickly result is a benchmark that runs optimally on all the hardware that it benchmarks. What's more, since measured image quality is an important part of the final score, hardware vendors would be motiviated to increase image quality as well as performance. What is even more interesting is that such a benchmark could become an R&D arena for finding new an interesting 3d graphics optimizations, since hardware vendors would be highly motivated to be the first to find such.

    Note that while an initial seed that creates nondeterministic code paths would make the benchmark more beneficial by requiring optimizations that are more likely to benefit interactive applications, it is not strictly needed to create a level playing field. Even static scenes and scenes on a rail would work okay as benchmarks. In this case, the hardware vendors would all simply fully optimize for those static or constrained cases.
     
  3. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,253
    Likes Received:
    13
    Location:
    Land of the 25% VAT
    Interesting idea indeed.

    My first thought, however, is: how useful would this actually be to reflect the common everyday performance in plain games when all it really does show is how clever and good IHV's are at getting every bit of performance without sacrifice image quality?

    Would we be benchmarking the cards - or the driver crew? :wink:

    Having said that I can't come up with a better idea. :oops:
     
  4. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,750
    Likes Received:
    127
    Location:
    Taiwan
    I think this depends on what you want to benchmark.

    SA's suggestion aimed for finding what the hardware can do. That is, to uncover the full potential of the hardware. Of course, existing or near-future applications are not necessarily able to exploit the hardware. This is a bit like SPEC CPU. However, IHVs are not allowed to modify the source of SPEC CPU benchmarks.
     
  5. Sharkfood

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    702
    Likes Received:
    11
    Location:
    Bay Area, California
    I'm very impressed with the thought that went behind the proposed idea. It does take a wide number of factors into consideration that might help the current situation for performance measurement.

    I'd add one additional caveat to such a proposed standard for making a more robust benchmark-

    The binaries for such benchmark should be distributed in link-lib form and include a mini-linker with elementary symbol remapping, much like commercial Unix kernel linkers provide. This would allow such a benchmark suite to be generated with a different executable name and internal symbols per run. The "seed" for other factors can be rolled into this process to help increase the difficulty of application detection by video drivers.

    At the end of the day, I don't think it's possible to make the perfect, completely secure from IHV mischief benchmark... after all, build a better mousetrap and nature builds a better mouse. BUT.. if enough steps are taken, the effort involved to get around such pitfalls may become prohibitive enough to discourage such behavior.
     
  6. WaltC

    Veteran

    Joined:
    Jul 22, 2002
    Messages:
    2,710
    Likes Received:
    8
    Location:
    BelleVue Sanatorium, Billary, NY. Patient privile
    I would prefer to find out what the hardware can likely do while running a typical game under the API, exercising the various features the hardware purports to support under the API. This would presume minimal to no vendor-specific optimization either in the benchmark or permitted in the driver code [benchmark could be patched regularly to prevent driver recognition]. I say this because I can't see much good in discovering maximum performance under ideal conditions in a benchmark when running the majority of games under the API will not mirror those conditions--or that performance. Hence, such a benchmark it seems to me would provide false expectations of routine product performance which only would compound the problem nVidia started in 320 of 3D Mark.
    I would think that false expectation of routine performance is something we'd like to get away from.
     
  7. bdmosky

    Newcomer

    Joined:
    Jul 31, 2002
    Messages:
    167
    Likes Received:
    22
    How so? Aren't most 3D intense games specifically optimized for maximum performance on popular videocards anyways? How does this not mirror real life? What game doesn't optimize for specific platforms (with the exception of perhaps less intense 3D applications).

    I think the point that SA and even Nvidia to a certain extent are trying to make is that most games are targetted for specific graphic platforms and any benchmark that tries to remain absolutely generic does not reflect "real life." Couple that with SA's goal of cooperative open source optimizations and I think you can see the "good" of all this.
     
  8. Doomtrooper

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,328
    Likes Received:
    0
    Location:
    Ontario, Canada
    No they are optimized for the installed base, UT 2003s engine is a good example.
     
  9. bdmosky

    Newcomer

    Joined:
    Jul 31, 2002
    Messages:
    167
    Likes Received:
    22
    Which would be because they were popular, no? :wink:
     
  10. Doomtrooper

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    3,328
    Likes Received:
    0
    Location:
    Ontario, Canada
    Popular meaning cheap..yep :D

    DX7 class engines is still the main target for most developers :cry:
     
  11. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    SA,

    As far as shaders, aren't high level shader languages in and of themselves rather similar to the benchmark model you propose? I realize that it varies from your intent, but bear with me as I draw some parallels please:

    DX 9 deviates, because for GPUs, the "assembly" is the GPU "microcode". The question for DX 9 HLSL and LLSL is whether IHVs can optimize for their closest profile output from the HLSL expressed in the LLSL to their GPU "assembly". If you consider the LLSL the lowest level, I think you're removing optimization opportunities of the subtlety level you're indicating.

    It's not clear to me what general-case and intra-hardware optimizations can be exposed by one benchmark. Currently, the prominent hardware performance issues don't seem to fit your example AFAICS: the prominent hardware issues depend on running a different benchmark (i.e., not floating point) and/or limitation of utilization of factors that are already known and can be expressed in the LLSL (register usage, texture ops) by the HLSL.


    However, things could possibly be otherwise.

    If/when that happens, either: the HLSL and LLSL would have to adapt, the IHV LLSL optimization ability would have to adapt, or we have to discuss a different model.

    The closest existing functional parallel to your model is Cg for nVidia hardware in OpenGL using the nVidia extension. What is different is that the party maintaining the "benchmark" is also a competitor in the benchmark, and usage of it would depend on that changing, and also would depend on OpenGL continuing its extension model to continue exposing the optimizations.
    This seems to me to say that, for shaders, subtle distinctions like what you propose either don't exist (if LLSL is your "assembly"...since benchmarks already use assembly), or aren't likely to be universally applicable (if GPU microcode is your "assembly"). The optimizations achieved would be extension and GPU specific.

    Assuming you don't necessarily mean LLSL expression as your assembly, the best form of the benchmark you describe, to me, looks like GLslang, which goes from high level -> GPU microcode under IHV control.
    The big deviation is that IHVs won't share, and have no reason to share, their optimizations, though it would indeed be informative if they all could agree to your stipulation and learn from each other. But, I think not sharing is unavoidable when trying to fit your parallel for GPUs and shaders, because the type of subtle distinction you describe depends on unique features in hardware. If you stuck to the LLSL usage, IHVs would still have reason to have hidden optimizations in translating the LLSL, and I'm not sure what new universally applicable optimizations would be exposed by your model.

    ...

    Barring a failure to understand something you meant to convey (there should be plenty of opportunities to clarify where I'm going wrong above if that is the case), it seems to me that your example is more applicable to other things, like shadowing techniques, etc., but in that case, a new benchmark would have to be written for each new focus of exposure.

    Also, it seems to me that we sort of have that already with conferences, papers on techniques, and source code for demos, etc., with input from general developers and IHVs alike. Your model would be a formalization of this, along with a standard method of comparison, and an avenue of constant growth and adaptation, but not as general as I understood you to mean.

    Am I missing something, or misunderstanding?
     
  12. bdmosky

    Newcomer

    Joined:
    Jul 31, 2002
    Messages:
    167
    Likes Received:
    22
    Sorry Doomtrooper, I couldn't help but pull this one back up :p


    Tim Sweeney:

    We optimize our games to run on popular hardware because it pleases our customers. We partner with NVidia on their "The Way It's Meant To Be Played" marketing program because it helps achieve our mutual business and marketing goals.

    Just trying to give you a hard time. :lol:
     
  13. Himself

    Regular

    Joined:
    Sep 29, 2002
    Messages:
    381
    Likes Received:
    2
    Define popular, most sold or most gabbed about? :lol:
     
  14. g__day

    Regular

    Joined:
    Jun 22, 2002
    Messages:
    580
    Likes Received:
    2
    Location:
    Sydney Australia
    Is it hard to reverse compile a driver and look for shader subsititution code triggered by a popular benchmark name or a specific commonly used demo name in the symbol table to detect driver optimisations?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...