Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

 
Old 03-Jun-2003, 18:32   #1
Nite_Hawk
Senior Member
 
Join Date: Feb 2002
Location: Minneapolis, MN
Posts: 1,202
Send a message via ICQ to Nite_Hawk Send a message via AIM to Nite_Hawk Send a message via MSN to Nite_Hawk
Default To make a valid benchmark...

This was discussed before, and at the time I thought it was a good idea. Now I think it's a necessary idea. We need a benchmark that is proof from cheating. Where it is not proof from cheating, we need to find a way to make cheating much more (exponentially) difficult than fixing (if only temporary) those cheats. Here are some of the ideas put forth in the other thread, and some of my own as well.

- First and foremost, I believe an opensource benchmark is likely the way to go. The community has more eyes than any vendor does, and I think if we take the time to approach the benchmark in a proper manner, we can find ways to overcome the more prevailent methods of cheating. I believe Having the benchmark be closed source will slow us as a community down more than it will slow any potential cheaters. Opensource has the advantage that things get exposed, both any errors in your code, and most likely attempts to subvert it.

- Rather than approaching this problem as a graphics "demo" persay, the focus should not primarily be on painting a pretty picture, but instead as a test. If the test must look abstract to be secure, then it will look abstract.

- We want the test to be repeatable by multiple parties. This means that if there is going to be a random component to various parts of the test, we will need to make sure that in all cases the same seeds can be used by different people so that the test can be run the exact same way.

- Camera paths should have a randomized component with seed values inserted by the user. This gets around the issue of clipping planes.

- The issue of shaders is a difficult one. First it should be decided if a shader which produces the exact same output as one in the benchmark is a valid optimization. If for example, the driver can dynamically reorder the instruction, it's probably valid imho. At the least, shaders should probably be created dynamically by the benchmark based on a randomized component, again with a user suplied seed value (or values) for test reproducability. Now, given that we are talking about an opensource benchmark, someone could simply look at how the shaders are produced, and insert a shader generator that produces faster shaders for that architecture. The only thing I can think to do to get around this, is to request back the output of the shader, have the program run the exact same shader on the cpu, and compare the outputs to make sure it's correct. This still doesn't get around optimizations that produce the same output. It may be necessary to allow this. If allowed, it may at the very least get vendors to talk about how to optimize a specific peice of code for their architecture.

- Partial Precision and Full Precision both probably make sense to test, if only to highlight the difference in performance each makes on each architecture, and to note the differences in image quality.

- A method to statistically compare the final output of the program versus a software rasterizer is probably necessary. Any significant deviations from the software rasterizer should be scrutenized. AA and AF will muck this up to a certain extent.

- Since we are still at the design stages, we should probably think about how to convey the most information to the user. A histogram, fps/time graph, and standard deviation/errors should be presented to show how the test was performed.

- A lot of people arn't going to understand how seeding, histograms, standard errors/deviations, etc, etc work. The program should automate everything as much as possible, and output two files. One that can be used to test using the exact same input on a different machine, and one that is loaded to show output from a specific run. We should make it as easy as possible for people reproduce and analyze.



So, if anyone else has ideas and comments to add, lets get this ball moving. I know Humus and others came up with a lot of these ideas, and I think we can improve on them more. I want to see this actually happen.

Thanks,
Nite_Hawk

Edit: I don't want this thread to focus on any specific vendor, and thus have removed a reference to one.
Nite_Hawk is offline  
Old 03-Jun-2003, 19:00   #2
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,882
Default

I'm all in favor of such a benchmark, but I've got two things to note:

Quote:
A method to statistically compare the final output of the program versus a software rasterizer is probably necessary. Any significant deviations from the software rasterizer should be scrutenized. AA and AF will muck this up to a certain extent.
I believe that to make IQ tests as precise as possible and not giving differences where those differences are actually advantages, the following precision would be required:
Vertex Shader: FP64 ( = Double )
Pixel Shader: FP64 ( = Double )
Sub-pixel accuracy: FX32 ( = long )

EDIT: Additional note: I believe that 99.9% of shaders should not show difference running FP32 or FP64, and that many should not show any or much difference between FP16 or FP32.

Oh, sure, it would kill performance. But if we want to be serious about this, I believe that's required. I don't believe we'll get such precision in the near future, or at least not at viable speeds. Thus, this makes sure the GPU's picture can only look WORSE, not better.
Thus, a simple error algorithm, such as the traditional one which says that you use the square of the error, would be a good idea.

Obviously, you'd have to compare on a 8-bit output ( although using FP16 or FP32 framebuffers is alright, but you shouldn't base errors on that ) - *always*. Current screens are 8-bit ( x4 = 32-bit, but I'm too lazy to mention all that ) , thus testing IQ for higher precision is ridiculously illogical ( although in the future, 10-bit might make sense )

For AA/AF, you could use a reference rasterizer with godly amounts of AA, and consider the jaggies as errors. Sadly, it might get annoying for AF...

Secondly, about shaders.
I believe the way to go is let them do whatever they want. We shouldn't care how the result is achieved - as long as it's public knowledge.

Heck, if they're using FX12 everything, they're footing themselves in the shoot due to the IQ tests. If they figure out a way to do the same thing faster, even in a global optimization, then if they've got to release it publicly, it means other vendors could base their own paths on what other vendors did, and get similar speed boosts.

We should persecute driver optimizations as much as possible, IMO. But letting the IHVs doing their own, *public* path, should be alright. All IMO, of course.


Uttar
Arun is offline  
Old 03-Jun-2003, 20:22   #3
Bjorn
Senior Member
 
Join Date: Feb 2002
Location: LuleƄ, Sweden
Posts: 1,775
Default

Quote:
But letting the IHVs doing their own, *public* path, should be alright.
It would also be VERY interesting to see what they did if they could do whatever they wanted.

My opinion with regards to image quality is that i don't think you can measure it in a good way. Just use screenshots and/or perhaps videos and let the user decide what he/she finds acceptable.
__________________
"Yeah, well, i'm gonna build my own theme park, with Black Jack, and hookers. In fact, forget the park"

//Bender - Futurama - episode 2
Bjorn is offline  
Old 03-Jun-2003, 20:32   #4
Nite_Hawk
Senior Member
 
Join Date: Feb 2002
Location: Minneapolis, MN
Posts: 1,202
Send a message via ICQ to Nite_Hawk Send a message via AIM to Nite_Hawk Send a message via MSN to Nite_Hawk
Default

Uttar:

If 99.9% of the shaders in use showed no difference between FP64 and FP32, would it really be necessary to use FP64 for the reference test? I'm thinking that at this point we are probably not going to see FP64 for quite some time (Feel free to correct me if I'm wrong). I'm not sure about the FP16 versus FP32 quality issue. On one hand, you could make it so that nothing looks different, or you could probably make it so that everything looks better with FP32. In either case, a rather significant message is being sent to the enduser. Neither case seems to be terribly representative of reality either. I'd like to hear from some game developers on this and see what their opinions are. Is targeting FP16 more attractive than FP24/32?

As you stated, the AA/AF issue is annoying. We probably could apply ungodly ammounts of AA using a well respected method, and measure the difference. It's going to muck with the tests though. We'd probably have to allow for a greater margin of error, and a company could try to use a poor AA implementation with a blur filter to hide differences in the visual output of the test. We'll need to take this into account for any analysis software.

I like the idea of allowing vendors to submit rendering modules to the benchmark if they can prove that the output is the same as the reference code. While the two cards may not be doing the *exact* same thing, if it can arrive at the same result faster with another method, it does show that the architecture has potential for speed. Opensource vendor modules would also provide optimized code for would-be-developers to see, and I think the benefits probably would outweigh the negatives. Certainly it's better than the current situation. I'd still perfer to only see global optimizations that affect *all* games, but I think completely open code submissions by the vendor would be a halfway decent compromise. You could atleast do code audits before certifying the module.

Nite_Hawk
Nite_Hawk is offline  
Old 04-Jun-2003, 00:34   #5
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,679
Default

Quote:
Originally Posted by Bjorn
It would also be VERY interesting to see what they did if they could do whatever they wanted.

My opinion with regards to image quality is that i don't think you can measure it in a good way. Just use screenshots and/or perhaps videos and let the user decide what he/she finds acceptable.
Different aspects of image quality can be easily quantified and measured (most easily through synthetic programs):
1. Edge anti-aliasing quality
2. Texture clarity

And this one can be a bit harder to measure:
3. Texture aliasing

In-game screenshots, on the other hand, can be very misleading. The best example is texture clarity vs. texture aliasing: in a screenshot, it is almost invariably much better-looking to force more aggressive LOD. But once a person is actually playing that game, the visual quality will suddenly go downhill, as texture aliasing becomes apparent in motion.

It is also typical of those taking the screenshots to primarily look at horizontal and vertical surfaces (frequently, it's just easier to do so), which obviously favors ATI hardware.

Here's what I propose:
1. Analyse the various aspects of 3D image quality separately.
2. Explain to the users under which cases they'll notice the deficiencies of each type of image quality.
3. Allow the user to make the decision as to which piece of image quality is most important to him/her, and make a decision on which one therefore looks better from that.
Chalnoth is offline  
Old 04-Jun-2003, 00:36   #6
SA
Member
 
Join Date: Feb 2002
Posts: 100
Default

I think that for a benchmark to be really useful it needs to include image quality results as well as performance results in the same benchmark.

The benchmark should be built to only use a specific precision or a particular set of shader operations where it makes a difference in the results.

The benchmark should have a randomization seed to initialize it. The seed could be published with the results so results can be compared and comfirmed by others.

The benchmark should randomly (based on the seed) sample portions of the screen and save the samples for later image quality comparison.

To measure the image quality the same scenes should be sampled at the same random positions using a software reference driver on a second pass. The the sum of the square of the differences would be reported as the image quality result. This eliminates the subjective nature of current image quality comparisons and requires that a driver/hardware combination produce high quality image results as well as high performance in a measurable fashion. This measured image quality would include of course the effect of anti-aliasing and anisotropic filtering. This means the reference driver should create its images using very high quality software AA techniques to generate near ideal images as a baseline.

Software (driver) optimizations or hardware optimizations should be allowed. If it improves the performance and yields the same image quality it shouldn't matter if improvements were made in the driver or in the hardware. However, its important to build the benchmark so it already uses the most optimum methods and precisions to achieve its output and it should ensure its code paths are not static or repeatable. Any improvements that achieve the same or better image quality would therefore be assumed to be allowable.

What should not be allowed is any changes to the code paths in the execution of the benchmark via the reference driver, since it forms the baseline for comparison.
SA is offline  
Old 04-Jun-2003, 00:53   #7
demalion
Senior Member
 
Join Date: Feb 2002
Location: CT
Posts: 2,024
Default

Nite_Hawk, the software rasterizer referencing has potential hurdles, related to the IEEE thread and the discussion therein.

Here are some comments concerning image quality comparison.

Here are some commentary concerning shader testing, and why I somewhat disagree with your seed idea.
In short, I'm approaching the IHV cheating from the angle of removing the places where cheats can be hidden, rather than simply sacrificing repeatability. Demo files could gain reputations and evaluation associated with names ("Beyond 3D's demo files 1 through 5", for example), rather than arbitrary numbers.

An example of why that could be important: IHVs could check seeds that allow cheats to succeed, because some seeds could easily allow cheating opportunities to remain hidden, and then IHVs could then have websites use those seeds. The number and phrase "random seeds" would lend false legitimacy, because the legitimacy would be random and easily defeatable by IHVs by simply selecting seeds to promote, and removing the virtue of randomness, though the phrase "random seed" could still be used. Human determination would make that more difficult, with the right humans.

Something better still would be a "randomizer" function that picked its own seeds each run and applied them to specified demo files, that a human being picked with the criteria I specified in the above link. Combines the advantages of both, I think. The seeds would then be truly random, and then could be output, but only used purely for reference image generation (EDIT: notany fps figures). This does leave a margin for error, but it shouldn't be significant...that unpredictability is representative of gaming, and the demo file usage still maintains a standard of repeatability.

...

Anyways,

An old comparison from me I stumbled upon while searching, though the discussion of plain water in the analogy might not be clear...the idea was testing the ability of the coffee machine to heat water, and what it did to the water, would be useful as well before moving on to trying to taste the coffee. The taste would depend on the coffee too.

Continuing the analogy, we already have ample "water heating" tests, and such tests are a subset of what 3dmark 03 offered (the fillrate test is a "water heating" test really), but the next step is to allow careful control of the coffee to get more directly useful results, even though there are many different types of coffee. 3dmark 03, in my view, is collection of tests where several custom coffees, carefully selected for applicability and good representation of many coffees, were used to test the coffee machine, sometimes with lots of coffee, sometimes with low coffee concentration mixed in, to expose strengths and weaknesses.

What benchmarks other than purely "water heating" benchmarks need to do is prevent coffeemakers from tampering with coffee selection and dictating tests. Here is some discussion on why I think that "Open Source" is not quite as simple a solution to this as you propose, though the question of 3dmark's evolution is far more in doubt right now. So, the virtues of doing the extra leg work is more important, but I think my concerns are still significant.

Note that many solutions are independent of it being Open Source, and being able to defeat those solutions, rather than optimize for them, would be made easier by that decision. This introduces a new set of challenges, and I think it is important to recognize that.
demalion is offline  
Old 04-Jun-2003, 01:03   #8
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,679
Default

Quote:
Originally Posted by SA
To measure the image quality the same scenes should be sampled at the same random positions using a software reference driver on a second pass. The the sum of the square of the differences would be reported as the image quality result. This eliminates the subjective nature of current image quality comparisons and requires that a driver/hardware combination produce high quality image results as well as high performance in a measurable fashion. This measured image quality would include of course the effect of anti-aliasing and anisotropic filtering.
No, it doesn't eliminate the subjective nature of current image quality comparisons. It does the exact same thing as most current image quality comparisons: the situation chosen will bias the image quality comparison in one direction or another.

One cannot realistically quantify how the various aspects of image quality come together for an overall look for the average user. The user has to decide for himself/herself.
Chalnoth is offline  
Old 04-Jun-2003, 01:23   #9
demalion
Senior Member
 
Join Date: Feb 2002
Location: CT
Posts: 2,024
Default

Chalnoth,
We're not talking about AA and AF, though, we're talking about shader output. That is more readily achievable as long as there is a standardized and wholly applicable basic reference available, which there is because there is more detailed specification than for AF and AA. The problem with fp16/fp24/fp32 becomes a lot more manageable in that context, though the IEEE discussion points out some hurdles that would still hinder purely mathematical comparison. But they are much simpler and directly mathematical hurdles (as opposed to implementation decisions that vary drastically in an interpixel regard), and are more likely to be surmountable by that means.

Hmm, going back and reading SA's post, I think I'm trying to say similar things.
demalion is offline  
Old 04-Jun-2003, 01:43   #10
SA
Member
 
Join Date: Feb 2002
Posts: 100
Default

Concerning not being able to quantify all the aspects that might satisfy a user, such is the case about all aspects of benchmarking including average frame rates. What image comparison does is provide is a simple objective measurement for discussing image quality in the same manner that average frame rates provide a simple objective measurement for discussing performance. Average frame rates have many faults that do not adequately address many users concerns about performance such as how slow things get, how long do they stay slow, etc. Even providing the variance in the frame rate or frame rate histories may not necessarily help.

However, the average frame rate is a simple objective measurement and it is much better than just subjective opinions about performance. Likewise, a sum of squared differences comparison of images against a reference image has traditionally been used by many in graphics as a simple objective measurement of image quality.
SA is offline  
Old 04-Jun-2003, 02:13   #11
ERP
Moderator
 
Join Date: Feb 2002
Location: Redmond, WA
Posts: 3,322
Default

My personal view,

Complex game like benchmarks have little value other than, to measure how fast a combination of hardware runs the specific benchmark.

I wonder if a small set of tests, with known well defined results, where an IHV provides the "best case" code (available for public inspection) to produce those results, wouldn't be a more useful test.

The other problem I see is that any benchmark that becomes popular not only try's to measure current generation hardware, but it's architecture to some extent dictates how hardware evolves, and I'm not sure that's a good thing.
ERP is offline  
Old 04-Jun-2003, 03:22   #12
SA
Member
 
Join Date: Feb 2002
Posts: 100
Default

I think the problem with benchmarking revolves around what is being measured and why.

If someone wants to know how well some hardware performs with current applications, the best thing is to measure it using those applications. In fact, the best thing is to buy the hardware, try it using those applications, and it you don't like it, take it back.

If someone wants to know how well some hardware will perform when applications are written that use new aspects of the hardware that are not used today, you need some sort of application that uses those new aspects. Otherwise you'll never know. Of course you could wait a couple of years before buying the hardware when applications become generally available that use the new features, but then you'll always be buying two year old hardware.

This is the benefit of a hardware benchmark. A benchmark can be upgraded much more quickly than a full blown sophisticated application. It also does not require a minimum hardware install base. As a result, it can keep up with the lastest hardware advancements.

The problem is, that it may not be representative of how developers may write their future applications using various vendor's hardware.

This is why vendors need to be able to optimize a benchmark for their hardware. Since it is in their interest to make their hardware shine as much as possible, each hardware vendor will ensure that the benchmark uses their hardware optimally.

The important thing therefore is to construct a benchmark in a way that allows hardware vendors to do this while providing meaningful benchmark comparisons.

What I mentioned above relies on the following paradigm. You hand each of the hardware vendors a program that produces a widely variable set of possible outputs based on some input value. In addition, you also supply a program that produces correponding reference outputs. You then allow the hardware vendors to optimize the first program, however they wish, with the caveat that their output will be compared to the reference using some objective measurement. Their final rating will be a combination of the measured performance and the measured quality of their output compared to the reference.

Optimizations to the benchmark that are found to apply generally across most hardware without sacrificing quality could be added to the benchmark on the next update.

This paradigm would not only provide a means of benchmarking hardware in a somewhat equitable fashion, but if the various vendor optimizations were made public, it would allow future application programmers insight as to how best to take advantage of the new hardware features for each specifc vendor's hardware.
SA is offline  
Old 04-Jun-2003, 03:29   #13
demalion
Senior Member
 
Join Date: Feb 2002
Location: CT
Posts: 2,024
Default

ERP,
I see merit in the gist of that, but: the guidelines need to be established by an independent body, and the output needs to be comparable to be useful.

The first is already done. It is even already done to a degree that will allow IHV advantages to be exposed in a standard way, in DX HLSL, and it looks like it will be done again for glslang.

The second is the key issue that requires work focus and vigilance.

As for hardware evolution: with standardized expression in a language for pixel and vertex processing, the metric is speed of execution of the tokens in that language while maintaining output quality. This is a parallel to CPU evolution. Things like AF and AA are separate concerns that are not easily represented in this way, and likely need to be addressed distinctly.

Anyways, all popular benchmarks have the influence you propose, but a shader benchmark has more generally applicable repeatability and extrapolation potential than any other situation.
The only solution apparent to me for popular benchmarks is to have more than one popular benchmark, and we do. However, some dedicated benchmarks can be better than others for being more popular/used as a reference, because they offer more opportunities to expose weaknesses and strengths in and of themselves.
Vulpine seems a poor benchmark (or should I say "benchmark suite"?) to me in this regard, same with CodeCreatures. Glexcess seems better in theory, and closer to 3dmark. Rightmark also seems like it will be evolving in this direction, once the Cg issues are sorted out.
demalion is offline  
Old 04-Jun-2003, 03:47   #14
micron
Diamond Viper 550
 
Join Date: Feb 2003
Location: U.S.
Posts: 1,189
Default

Reading through these post's, I'm seeing alot of the complexities that are involved in actually making a valid benchmark, which is the topic of this thread. Does it really have to be as hard as you all are making it seem?
micron is offline  
Old 04-Jun-2003, 04:01   #15
gkar1
Member
 
Join Date: Jul 2002
Posts: 497
Default

Code:
//---------------------------------
function bool nVidia driver optimizations 
//---------------------------------

if ( bUsingBF1942=1 )
{
OutputQuality = rofl;
ClipPlane = 0.1;
}
else if ( bUsingUnreal2=1 )
{
OutputQuality = kekekeke;
ClipPlne = 0.002;
}
else
{
OutputQuality = LOL;
ClipPlane = 0.34;
}
Sorry i couldn't help myself
gkar1 is offline  
Old 04-Jun-2003, 05:53   #16
Colourless
Monochrome wench
 
Join Date: Feb 2002
Location: Somewhere in outback South Australia
Posts: 1,257
Send a message via ICQ to Colourless Send a message via MSN to Colourless
Default

Quote:
Originally Posted by gkar1
Code:
//---------------------------------
function bool nVidia driver optimizations 
//---------------------------------

if ( bUsingBF1942=1 )
{
OutputQuality = rofl;
ClipPlane = 0.1;
}
else if ( bUsingUnreal2=1 )
{
OutputQuality = kekekeke;
ClipPlne = 0.002;
}
else
{
OutputQuality = LOL;
ClipPlane = 0.34;
}
Sorry i couldn't help myself
No wonder why nvidia's drivers have so many problems. The comparisons are broken. They are using the assign operator, not the equality operator. An as such, they always think they are running Battlefield 1942.
__________________
-Colourless

D3D FSAA Viewer 5.4
Words by Cat - Truely Intelligent Viewing
Colourless is offline  
Old 04-Jun-2003, 06:29   #17
Himself
Member
 
Join Date: Sep 2002
Posts: 381
Default

Does one benchmark have to be all things for all people? Instead of trying to create the ultimate all in one test, what's wrong with just testing one specific thing and doing it well?
Himself is offline  
Old 04-Jun-2003, 07:05   #18
Chalnoth
 
Join Date: May 2002
Location: New York, NY
Posts: 12,679
Default

Quote:
Originally Posted by demalion
Chalnoth,
We're not talking about AA and AF, though, we're talking about shader output. That is more readily achievable as long as there is a standardized and wholly applicable basic reference available, which there is because there is more detailed specification than for AF and AA. The problem with fp16/fp24/fp32 becomes a lot more manageable in that context, though the IEEE discussion points out some hurdles that would still hinder purely mathematical comparison.
It still applies, only it's less meaningful.

That is, that FP32 is more precise than FP24 is obvious (and so on and so on). It is easy to create a benchmark to show these precision differences.

It is not easy to show how these precision differences will relate to real games.
Chalnoth is offline  
Old 04-Jun-2003, 09:08   #19
Dio
Senior Member
 
Join Date: Jul 2002
Location: UK
Posts: 1,758
Default

Quote:
Originally Posted by Himself
Does one benchmark have to be all things for all people? Instead of trying to create the ultimate all in one test, what's wrong with just testing one specific thing and doing it well?
It's not generally useful as a performance metric. Most of the people here are focused on the web media, but although those in-depth benchmarks that run 40 different tests are all very well, print media doesn't have the time or page space to run thousands of tests. There, there is a desire for a single number that can compare one piece of hardware to another - originally just the Quake2 score.

As a result, for a benchmark to gain wide acceptance it needs to test a wide range of things and boil them down to one number. 3dmark and GameGauge are the two main ways tried to do this so far.
Dio is offline  
Old 05-Jun-2003, 01:50   #20
Babel-17
Member
 
Join Date: Apr 2002
Posts: 445
Default

http://www.benchemall.com

Would this utility be of any help to anyone? Sorry if I'm going OT with this.
Babel-17 is offline  
Old 05-Jun-2003, 02:12   #21
WaltC
Senior Member
 
Join Date: Jul 2002
Location: BelleVue Sanatorium, Billary, NY. Patient privileges: Internet access
Posts: 2,694
Default

I like the seed value idea for camera paths very much. I think this would eliminate the possibility of a company doing what nVidia did with 3DM 320 pretty well. It would also allow people at home to use the same seed values as reviewers for camera tracks and compare results. It would also allow reviewers to use multiple camera tracks in a review to get a better overall performance picture of each individual scene representing the benchmarking of a certain API feature.

As far as what one sees on the screen while the benchmark is running, I see nothing wrong whatever with the way 3D Mark currently does that. I mean, simply because an image might be "abstract" doesn't mean there is any increase in the reliability of the benchmark. The fill rate tests in 3DMark are visually abstract. So what?... I see nothing wrong with a "3D game" interface visually of the type 3D Mark currently employs. Seems far more fitting for the subject matter.

The important point to me is that the benchmark avoid vendor-specific shader paths and other vendor-specific code like the plague and attempt to be as generically close to the API standards as the authors are capable of making it. If benchmarks started using vendor-specific, optimized paths and so on we'd descend to the level of the game engine as opposed to the generic API feature benchmark. The fact is that most D3d games do not use vendor specific paths and the 3D hardware manufacturers do not build maximally optimized routines into their drivers for the majority of D3d games that ship.
WaltC is offline  
Old 05-Jun-2003, 05:07   #22
SA
Member
 
Join Date: Feb 2002
Posts: 100
Default

The idea of a seed is very useful of course.

However, I think the problem with current benchmarks is not whether particular code paths are modified. I think that is a red herring.

I think the real problem is what benchmarks currently measure in the first place.

I will give an example to clarify my point. Suppose there is a CPU floating point benchmark. It performs typically used floating point algorithms such as solutions to large numbers of linear equations. It is written in C and compiled using standard platform compilers.

Now suppose there are three new processors. One with a new floating point vector processor that can run 10 times faster than the previous version, however the standard floating point is a bit slower than the others and the vector processor requires assembler. A competitive processor has no vector co-processor but has slightly faster standard floating point. A third processor can remove all latencies for a 2x performance increase but this also requires a small amount of assembler.

Software vendors would definitely rewrite their inner loops in assembler to take advantage of the new floating point features on the two new processors.

The question is, is the benchmark a good measure of how real world floating point applications are going to perform on the platforms in a year or so. Well the answer is obviously not. The platform that will score the highest will likely perform the worst, since software developers will not constrain their code to be written only in compiled C if they can achieve major performance benefits by making a few vendor specific optimizations.

Benchmark writers typically do not solve this dilemma since they are not motivated to continually keep their benchmarks optimized for vendor specific hardware. They should be unbiased which means they shouldn't care how well any specific hardware performs on their benchmarks. Only the hardware vendor cares about this and, in the future, application developers. As a result benchmarks often do not reflect how hardware will actually perform which reduces their benefits.

This motivates hardware vendors to try and find other approaches to optimize the benchmark for their hardware, since the benchmark is not written this way. However, since this is not an open practice, it creates an unlevel playing field and the benchmark becomes useless to everyone.

Now suppose that the benchmark developer creates a new benchmark using a new benchmark paradigm. Using this paradigm, the benchmark is split into two programs: a open source model program (since it models the applications that will be used) that will run on the destination hardware and can be freely modified by anyone as desired, and a reference program that ships as a binary executable that can be run on any machine at anytime after the initial program is run (meaning there is nothing on the benchmarked hardware that can possibly effect or modify it - the reference program is not allowed to be modified). Benchmarked data can be communicated between the two either on disk or via the network.

Hardware vendors are allowed, in fact encouraged, to modify the benchmark model program to their hearts content to optimize it for their platforms. There is one stipulation, they must make their optimizations publicly available on their website and allow others to freely use them (much like today's software development demos). In this way, the optimizations can be communicated to software developers allowing the lessons learned from them to find their way into actual applications.

When the first (model) program is run it saves its output to disk. In the case of 3d graphics, the output should consist of randomly sampled pixels from the screen images (random sampling greatly minimizes the amount of data that needs to be collected and stored, while giving essentially the same results as keeping all the pixels which would be prohibative). The performance of the model program is measured and also stored. Then the reference program is run, that runs the same application in an idealized fashion. In the case of a 3d program it uses a maximum precision software renderer with essentially ideal AA for the entire image (textures and geometry) much like CG software. It saves precisely the same set of random pixels as the first.

After the reference program is run, the benchmark compares the outputs of the two programs and measures the differences. With 3d, the sum of the square of the differences between the pixels (divided by the number of pixels to normalize it) gives a measure of image quality compared to the reference. Since the reference is run with idealized software AA, AA and anisotropic filtering are automatically included in the image quality metric. The better the AA, the closer it will be to the reference and the better the image quality score. Shaders, shader precision, subpixel precision, and all other image quality variables automatically show up in the image quality score.

The final score would show both the average frame rate and the image quality score.

There could be no such thing as cheating with such a benchmark. Any and all modifications to the first program are allowed, and there is no way for the benchmark hardware or its drivers to modify the reference program. If one hardware vendor found a way to optimize the benchmark by adding clip planes or by some other trick, the other hardware vendors would quickly see the published results and add the same trick to their own optimizations. Even more interesting, is that game developers would also see the various tricks and optimizations and add those to actual games whenever those situations arise.

What would quickly result is a benchmark that runs optimally on all the hardware that it benchmarks. What's more, since measured image quality is an important part of the final score, hardware vendors would be motiviated to increase image quality as well as performance. What is even more interesting is that such a benchmark could become an R&D arena for finding new an interesting 3d graphics optimizations, since hardware vendors would be highly motivated to be the first to find such.

Note that while an initial seed that creates nondeterministic code paths would make the benchmark more beneficial by requiring optimizations that are more likely to benefit interactive applications, it is not strictly needed to create a level playing field. Even static scenes and scenes on a rail would work okay as benchmarks. In this case, the hardware vendors would all simply fully optimize for those static or constrained cases.
SA is offline  
Old 05-Jun-2003, 11:41   #23
LeStoffer
Senior Member
 
Join Date: Feb 2002
Location: Somewhere not *that* rotten in Denmark
Posts: 1,197
Default

Quote:
Originally Posted by SA
Hardware vendors are allowed, in fact encouraged, to modify the benchmark model program to their hearts content to optimize it for their platforms. There is one stipulation, they must make their optimizations publicly available on their website and allow others to freely use them (much like today's software development demos). In this way, the optimizations can be communicated to software developers allowing the lessons learned from them to find their way into actual applications.
Interesting idea indeed.

My first thought, however, is: how useful would this actually be to reflect the common everyday performance in plain games when all it really does show is how clever and good IHV's are at getting every bit of performance without sacrifice image quality?

Would we be benchmarking the cards - or the driver crew?

Having said that I can't come up with a better idea.
__________________
Best regards, LeStoffer
LeStoffer is offline  
Old 05-Jun-2003, 11:48   #24
pcchen
Moderator
 
Join Date: Feb 2002
Location: Taiwan
Posts: 2,358
Default

I think this depends on what you want to benchmark.

SA's suggestion aimed for finding what the hardware can do. That is, to uncover the full potential of the hardware. Of course, existing or near-future applications are not necessarily able to exploit the hardware. This is a bit like SPEC CPU. However, IHVs are not allowed to modify the source of SPEC CPU benchmarks.
pcchen is offline  
Old 05-Jun-2003, 12:00   #25
Sharkfood
Member
 
Join Date: Feb 2002
Location: Bay Area, California
Posts: 702
Default

I'm very impressed with the thought that went behind the proposed idea. It does take a wide number of factors into consideration that might help the current situation for performance measurement.

I'd add one additional caveat to such a proposed standard for making a more robust benchmark-

The binaries for such benchmark should be distributed in link-lib form and include a mini-linker with elementary symbol remapping, much like commercial Unix kernel linkers provide. This would allow such a benchmark suite to be generated with a different executable name and internal symbols per run. The "seed" for other factors can be rolled into this process to help increase the difficulty of application detection by video drivers.

At the end of the day, I don't think it's possible to make the perfect, completely secure from IHV mischief benchmark... after all, build a better mousetrap and nature builds a better mouse. BUT.. if enough steps are taken, the effort involved to get around such pitfalls may become prohibitive enough to discourage such behavior.
Sharkfood is offline  

 

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
HD-DVD on xbox360 bad? bbot Console Technology 83 05-Aug-2005 09:32
Stand-alone benchmark and games with same engine Reverend Beyond3D News 9 24-Mar-2005 17:50
Benchmark Utilities - Farcry and HL2 Unknown Soldier PC Games 2 23-Dec-2004 14:24
How to benchmark the UT2k4 demo. digitalwanderer PC Games 16 21-Feb-2004 02:03
Q3 as benchmark pascal 3D Architectures & Chips 21 06-Apr-2002 21:11


All times are GMT +1. The time now is 07:36.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.