To make a valid benchmark...

Discussion in 'General 3D Technology' started by Nite_Hawk, Jun 3, 2003.

  1. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    This was discussed before, and at the time I thought it was a good idea. Now I think it's a necessary idea. We need a benchmark that is proof from cheating. Where it is not proof from cheating, we need to find a way to make cheating much more (exponentially) difficult than fixing (if only temporary) those cheats. Here are some of the ideas put forth in the other thread, and some of my own as well.

    - First and foremost, I believe an opensource benchmark is likely the way to go. The community has more eyes than any vendor does, and I think if we take the time to approach the benchmark in a proper manner, we can find ways to overcome the more prevailent methods of cheating. I believe Having the benchmark be closed source will slow us as a community down more than it will slow any potential cheaters. Opensource has the advantage that things get exposed, both any errors in your code, and most likely attempts to subvert it.

    - Rather than approaching this problem as a graphics "demo" persay, the focus should not primarily be on painting a pretty picture, but instead as a test. If the test must look abstract to be secure, then it will look abstract.

    - We want the test to be repeatable by multiple parties. This means that if there is going to be a random component to various parts of the test, we will need to make sure that in all cases the same seeds can be used by different people so that the test can be run the exact same way.

    - Camera paths should have a randomized component with seed values inserted by the user. This gets around the issue of clipping planes.

    - The issue of shaders is a difficult one. First it should be decided if a shader which produces the exact same output as one in the benchmark is a valid optimization. If for example, the driver can dynamically reorder the instruction, it's probably valid imho. At the least, shaders should probably be created dynamically by the benchmark based on a randomized component, again with a user suplied seed value (or values) for test reproducability. Now, given that we are talking about an opensource benchmark, someone could simply look at how the shaders are produced, and insert a shader generator that produces faster shaders for that architecture. The only thing I can think to do to get around this, is to request back the output of the shader, have the program run the exact same shader on the cpu, and compare the outputs to make sure it's correct. This still doesn't get around optimizations that produce the same output. It may be necessary to allow this. If allowed, it may at the very least get vendors to talk about how to optimize a specific peice of code for their architecture.

    - Partial Precision and Full Precision both probably make sense to test, if only to highlight the difference in performance each makes on each architecture, and to note the differences in image quality.

    - A method to statistically compare the final output of the program versus a software rasterizer is probably necessary. Any significant deviations from the software rasterizer should be scrutenized. AA and AF will muck this up to a certain extent.

    - Since we are still at the design stages, we should probably think about how to convey the most information to the user. A histogram, fps/time graph, and standard deviation/errors should be presented to show how the test was performed.

    - A lot of people arn't going to understand how seeding, histograms, standard errors/deviations, etc, etc work. The program should automate everything as much as possible, and output two files. One that can be used to test using the exact same input on a different machine, and one that is loaded to show output from a specific run. We should make it as easy as possible for people reproduce and analyze.



    So, if anyone else has ideas and comments to add, lets get this ball moving. I know Humus and others came up with a lot of these ideas, and I think we can improve on them more. I want to see this actually happen.

    Thanks,
    Nite_Hawk

    Edit: I don't want this thread to focus on any specific vendor, and thus have removed a reference to one.
     
  2. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    I'm all in favor of such a benchmark, but I've got two things to note:

    I believe that to make IQ tests as precise as possible and not giving differences where those differences are actually advantages, the following precision would be required:
    Vertex Shader: FP64 ( = Double )
    Pixel Shader: FP64 ( = Double )
    Sub-pixel accuracy: FX32 ( = long )

    EDIT: Additional note: I believe that 99.9% of shaders should not show difference running FP32 or FP64, and that many should not show any or much difference between FP16 or FP32.

    Oh, sure, it would kill performance. But if we want to be serious about this, I believe that's required. I don't believe we'll get such precision in the near future, or at least not at viable speeds. Thus, this makes sure the GPU's picture can only look WORSE, not better.
    Thus, a simple error algorithm, such as the traditional one which says that you use the square of the error, would be a good idea.

    Obviously, you'd have to compare on a 8-bit output ( although using FP16 or FP32 framebuffers is alright, but you shouldn't base errors on that ) - *always*. Current screens are 8-bit ( x4 = 32-bit, but I'm too lazy to mention all that ) , thus testing IQ for higher precision is ridiculously illogical ( although in the future, 10-bit might make sense )

    For AA/AF, you could use a reference rasterizer with godly amounts of AA, and consider the jaggies as errors. Sadly, it might get annoying for AF...

    Secondly, about shaders.
    I believe the way to go is let them do whatever they want. We shouldn't care how the result is achieved - as long as it's public knowledge.

    Heck, if they're using FX12 everything, they're footing themselves in the shoot due to the IQ tests. If they figure out a way to do the same thing faster, even in a global optimization, then if they've got to release it publicly, it means other vendors could base their own paths on what other vendors did, and get similar speed boosts.

    We should persecute driver optimizations as much as possible, IMO. But letting the IHVs doing their own, *public* path, should be alright. All IMO, of course.


    Uttar
     
  3. Bjorn

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,775
    Likes Received:
    1
    Location:
    Luleå, Sweden
    It would also be VERY interesting to see what they did if they could do whatever they wanted.

    My opinion with regards to image quality is that i don't think you can measure it in a good way. Just use screenshots and/or perhaps videos and let the user decide what he/she finds acceptable.
     
  4. Nite_Hawk

    Veteran

    Joined:
    Feb 11, 2002
    Messages:
    1,202
    Likes Received:
    35
    Location:
    Minneapolis, MN
    Uttar:

    If 99.9% of the shaders in use showed no difference between FP64 and FP32, would it really be necessary to use FP64 for the reference test? I'm thinking that at this point we are probably not going to see FP64 for quite some time (Feel free to correct me if I'm wrong). I'm not sure about the FP16 versus FP32 quality issue. On one hand, you could make it so that nothing looks different, or you could probably make it so that everything looks better with FP32. In either case, a rather significant message is being sent to the enduser. Neither case seems to be terribly representative of reality either. I'd like to hear from some game developers on this and see what their opinions are. Is targeting FP16 more attractive than FP24/32?

    As you stated, the AA/AF issue is annoying. We probably could apply ungodly ammounts of AA using a well respected method, and measure the difference. It's going to muck with the tests though. We'd probably have to allow for a greater margin of error, and a company could try to use a poor AA implementation with a blur filter to hide differences in the visual output of the test. We'll need to take this into account for any analysis software.

    I like the idea of allowing vendors to submit rendering modules to the benchmark if they can prove that the output is the same as the reference code. While the two cards may not be doing the *exact* same thing, if it can arrive at the same result faster with another method, it does show that the architecture has potential for speed. Opensource vendor modules would also provide optimized code for would-be-developers to see, and I think the benefits probably would outweigh the negatives. Certainly it's better than the current situation. I'd still perfer to only see global optimizations that affect *all* games, but I think completely open code submissions by the vendor would be a halfway decent compromise. You could atleast do code audits before certifying the module.

    Nite_Hawk
     
  5. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    Different aspects of image quality can be easily quantified and measured (most easily through synthetic programs):
    1. Edge anti-aliasing quality
    2. Texture clarity

    And this one can be a bit harder to measure:
    3. Texture aliasing

    In-game screenshots, on the other hand, can be very misleading. The best example is texture clarity vs. texture aliasing: in a screenshot, it is almost invariably much better-looking to force more aggressive LOD. But once a person is actually playing that game, the visual quality will suddenly go downhill, as texture aliasing becomes apparent in motion.

    It is also typical of those taking the screenshots to primarily look at horizontal and vertical surfaces (frequently, it's just easier to do so), which obviously favors ATI hardware.

    Here's what I propose:
    1. Analyse the various aspects of 3D image quality separately.
    2. Explain to the users under which cases they'll notice the deficiencies of each type of image quality.
    3. Allow the user to make the decision as to which piece of image quality is most important to him/her, and make a decision on which one therefore looks better from that.
     
  6. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    I think that for a benchmark to be really useful it needs to include image quality results as well as performance results in the same benchmark.

    The benchmark should be built to only use a specific precision or a particular set of shader operations where it makes a difference in the results.

    The benchmark should have a randomization seed to initialize it. The seed could be published with the results so results can be compared and comfirmed by others.

    The benchmark should randomly (based on the seed) sample portions of the screen and save the samples for later image quality comparison.

    To measure the image quality the same scenes should be sampled at the same random positions using a software reference driver on a second pass. The the sum of the square of the differences would be reported as the image quality result. This eliminates the subjective nature of current image quality comparisons and requires that a driver/hardware combination produce high quality image results as well as high performance in a measurable fashion. This measured image quality would include of course the effect of anti-aliasing and anisotropic filtering. This means the reference driver should create its images using very high quality software AA techniques to generate near ideal images as a baseline.

    Software (driver) optimizations or hardware optimizations should be allowed. If it improves the performance and yields the same image quality it shouldn't matter if improvements were made in the driver or in the hardware. However, its important to build the benchmark so it already uses the most optimum methods and precisions to achieve its output and it should ensure its code paths are not static or repeatable. Any improvements that achieve the same or better image quality would therefore be assumed to be allowable.

    What should not be allowed is any changes to the code paths in the execution of the benchmark via the reference driver, since it forms the baseline for comparison.
     
  7. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    Nite_Hawk, the software rasterizer referencing has potential hurdles, related to the IEEE thread and the discussion therein.

    Here are some comments concerning image quality comparison.

    Here are some commentary concerning shader testing, and why I somewhat disagree with your seed idea.
    In short, I'm approaching the IHV cheating from the angle of removing the places where cheats can be hidden, rather than simply sacrificing repeatability. Demo files could gain reputations and evaluation associated with names ("Beyond 3D's demo files 1 through 5", for example), rather than arbitrary numbers.

    An example of why that could be important: IHVs could check seeds that allow cheats to succeed, because some seeds could easily allow cheating opportunities to remain hidden, and then IHVs could then have websites use those seeds. The number and phrase "random seeds" would lend false legitimacy, because the legitimacy would be random and easily defeatable by IHVs by simply selecting seeds to promote, and removing the virtue of randomness, though the phrase "random seed" could still be used. Human determination would make that more difficult, with the right humans.

    Something better still would be a "randomizer" function that picked its own seeds each run and applied them to specified demo files, that a human being picked with the criteria I specified in the above link. Combines the advantages of both, I think. The seeds would then be truly random, and then could be output, but only used purely for reference image generation (EDIT: notany fps figures). This does leave a margin for error, but it shouldn't be significant...that unpredictability is representative of gaming, and the demo file usage still maintains a standard of repeatability.

    ...

    Anyways,

    An old comparison from me I stumbled upon while searching, though the discussion of plain water in the analogy might not be clear...the idea was testing the ability of the coffee machine to heat water, and what it did to the water, would be useful as well before moving on to trying to taste the coffee. The taste would depend on the coffee too.

    Continuing the analogy, we already have ample "water heating" tests, and such tests are a subset of what 3dmark 03 offered (the fillrate test is a "water heating" test really), but the next step is to allow careful control of the coffee to get more directly useful results, even though there are many different types of coffee. 3dmark 03, in my view, is collection of tests where several custom coffees, carefully selected for applicability and good representation of many coffees, were used to test the coffee machine, sometimes with lots of coffee, sometimes with low coffee concentration mixed in, to expose strengths and weaknesses.

    What benchmarks other than purely "water heating" benchmarks need to do is prevent coffeemakers from tampering with coffee selection and dictating tests. Here is some discussion on why I think that "Open Source" is not quite as simple a solution to this as you propose, though the question of 3dmark's evolution is far more in doubt right now. So, the virtues of doing the extra leg work is more important, but I think my concerns are still significant.

    Note that many solutions are independent of it being Open Source, and being able to defeat those solutions, rather than optimize for them, would be made easier by that decision. This introduces a new set of challenges, and I think it is important to recognize that.
     
  8. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    No, it doesn't eliminate the subjective nature of current image quality comparisons. It does the exact same thing as most current image quality comparisons: the situation chosen will bias the image quality comparison in one direction or another.

    One cannot realistically quantify how the various aspects of image quality come together for an overall look for the average user. The user has to decide for himself/herself.
     
  9. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    Chalnoth,
    We're not talking about AA and AF, though, we're talking about shader output. That is more readily achievable as long as there is a standardized and wholly applicable basic reference available, which there is because there is more detailed specification than for AF and AA. The problem with fp16/fp24/fp32 becomes a lot more manageable in that context, though the IEEE discussion points out some hurdles that would still hinder purely mathematical comparison. But they are much simpler and directly mathematical hurdles (as opposed to implementation decisions that vary drastically in an interpixel regard), and are more likely to be surmountable by that means.

    Hmm, going back and reading SA's post, I think I'm trying to say similar things.
     
  10. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    Concerning not being able to quantify all the aspects that might satisfy a user, such is the case about all aspects of benchmarking including average frame rates. What image comparison does is provide is a simple objective measurement for discussing image quality in the same manner that average frame rates provide a simple objective measurement for discussing performance. Average frame rates have many faults that do not adequately address many users concerns about performance such as how slow things get, how long do they stay slow, etc. Even providing the variance in the frame rate or frame rate histories may not necessarily help.

    However, the average frame rate is a simple objective measurement and it is much better than just subjective opinions about performance. Likewise, a sum of squared differences comparison of images against a reference image has traditionally been used by many in graphics as a simple objective measurement of image quality.
     
  11. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    My personal view,

    Complex game like benchmarks have little value other than, to measure how fast a combination of hardware runs the specific benchmark.

    I wonder if a small set of tests, with known well defined results, where an IHV provides the "best case" code (available for public inspection) to produce those results, wouldn't be a more useful test.

    The other problem I see is that any benchmark that becomes popular not only try's to measure current generation hardware, but it's architecture to some extent dictates how hardware evolves, and I'm not sure that's a good thing.
     
  12. SA

    SA
    Newcomer

    Joined:
    Feb 9, 2002
    Messages:
    100
    Likes Received:
    2
    I think the problem with benchmarking revolves around what is being measured and why.

    If someone wants to know how well some hardware performs with current applications, the best thing is to measure it using those applications. In fact, the best thing is to buy the hardware, try it using those applications, and it you don't like it, take it back.

    If someone wants to know how well some hardware will perform when applications are written that use new aspects of the hardware that are not used today, you need some sort of application that uses those new aspects. Otherwise you'll never know. Of course you could wait a couple of years before buying the hardware when applications become generally available that use the new features, but then you'll always be buying two year old hardware.

    This is the benefit of a hardware benchmark. A benchmark can be upgraded much more quickly than a full blown sophisticated application. It also does not require a minimum hardware install base. As a result, it can keep up with the lastest hardware advancements.

    The problem is, that it may not be representative of how developers may write their future applications using various vendor's hardware.

    This is why vendors need to be able to optimize a benchmark for their hardware. Since it is in their interest to make their hardware shine as much as possible, each hardware vendor will ensure that the benchmark uses their hardware optimally.

    The important thing therefore is to construct a benchmark in a way that allows hardware vendors to do this while providing meaningful benchmark comparisons.

    What I mentioned above relies on the following paradigm. You hand each of the hardware vendors a program that produces a widely variable set of possible outputs based on some input value. In addition, you also supply a program that produces correponding reference outputs. You then allow the hardware vendors to optimize the first program, however they wish, with the caveat that their output will be compared to the reference using some objective measurement. Their final rating will be a combination of the measured performance and the measured quality of their output compared to the reference.

    Optimizations to the benchmark that are found to apply generally across most hardware without sacrificing quality could be added to the benchmark on the next update.

    This paradigm would not only provide a means of benchmarking hardware in a somewhat equitable fashion, but if the various vendor optimizations were made public, it would allow future application programmers insight as to how best to take advantage of the new hardware features for each specifc vendor's hardware.
     
  13. demalion

    Veteran

    Joined:
    Feb 7, 2002
    Messages:
    2,024
    Likes Received:
    1
    Location:
    CT
    ERP,
    I see merit in the gist of that, but: the guidelines need to be established by an independent body, and the output needs to be comparable to be useful.

    The first is already done. It is even already done to a degree that will allow IHV advantages to be exposed in a standard way, in DX HLSL, and it looks like it will be done again for glslang.

    The second is the key issue that requires work focus and vigilance.

    As for hardware evolution: with standardized expression in a language for pixel and vertex processing, the metric is speed of execution of the tokens in that language while maintaining output quality. This is a parallel to CPU evolution. Things like AF and AA are separate concerns that are not easily represented in this way, and likely need to be addressed distinctly.

    Anyways, all popular benchmarks have the influence you propose, but a shader benchmark has more generally applicable repeatability and extrapolation potential than any other situation.
    The only solution apparent to me for popular benchmarks is to have more than one popular benchmark, and we do. However, some dedicated benchmarks can be better than others for being more popular/used as a reference, because they offer more opportunities to expose weaknesses and strengths in and of themselves.
    Vulpine seems a poor benchmark (or should I say "benchmark suite"?) to me in this regard, same with CodeCreatures. Glexcess seems better in theory, and closer to 3dmark. Rightmark also seems like it will be evolving in this direction, once the Cg issues are sorted out.
     
  14. micron

    micron Diamond Viper 550
    Veteran

    Joined:
    Feb 23, 2003
    Messages:
    1,189
    Likes Received:
    12
    Location:
    U.S.
    Reading through these post's, I'm seeing alot of the complexities that are involved in actually making a valid benchmark, which is the topic of this thread. Does it really have to be as hard as you all are making it seem?
     
  15. gkar1

    Regular

    Joined:
    Jul 20, 2002
    Messages:
    614
    Likes Received:
    7
    Code:
    
    //---------------------------------
    function bool nVidia driver optimizations 
    //---------------------------------
    
    if ( bUsingBF1942=1 )
    {
    OutputQuality = rofl;
    ClipPlane = 0.1;
    }
    else if ( bUsingUnreal2=1 )
    {
    OutputQuality = kekekeke;
    ClipPlne = 0.002;
    }
    else
    {
    OutputQuality = LOL;
    ClipPlane = 0.34;
    }
    
    Sorry i couldn't help myself
     
  16. Colourless

    Colourless Monochrome wench
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,274
    Likes Received:
    30
    Location:
    Somewhere in outback South Australia
    No wonder why nvidia's drivers have so many problems. The comparisons are broken. They are using the assign operator, not the equality operator. An as such, they always think they are running Battlefield 1942. :)
     
  17. Himself

    Regular

    Joined:
    Sep 29, 2002
    Messages:
    381
    Likes Received:
    2
    Does one benchmark have to be all things for all people? Instead of trying to create the ultimate all in one test, what's wrong with just testing one specific thing and doing it well?
     
  18. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    It still applies, only it's less meaningful.

    That is, that FP32 is more precise than FP24 is obvious (and so on and so on). It is easy to create a benchmark to show these precision differences.

    It is not easy to show how these precision differences will relate to real games.
     
  19. Dio

    Dio
    Veteran

    Joined:
    Jul 1, 2002
    Messages:
    1,758
    Likes Received:
    8
    Location:
    UK
    It's not generally useful as a performance metric. Most of the people here are focused on the web media, but although those in-depth benchmarks that run 40 different tests are all very well, print media doesn't have the time or page space to run thousands of tests. There, there is a desire for a single number that can compare one piece of hardware to another - originally just the Quake2 score.

    As a result, for a benchmark to gain wide acceptance it needs to test a wide range of things and boil them down to one number. 3dmark and GameGauge are the two main ways tried to do this so far.
     
  20. Babel-17

    Veteran Regular

    Joined:
    Apr 24, 2002
    Messages:
    1,004
    Likes Received:
    245
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...