The state of videocard reviews

Nite_Hawk

Veteran
Hi guys,

So I've been working on something of a research project lately, and this research project has required that I go through about 50 different reviews from various review sites, and recording several thousand benchmark scores. After all this, I've come to a couple of conclusions:

1) Almost every reviewer in this industry save a very select few seems incapable of listing whether or not they benchmarked with trilinear, bilinear, or brilinear filtering.

2) Fairly often reviewers neglect to mention what map or demo they used with a specific bencmark. UT2003 and Quake3 benchmarks specifically suffer from this. Many reviews of UT2003 for example will say "botmatch" or "flyby", but not specify the map. Others will give the map, but not specify if it was benchmarked with botmatch, flyby, or in some cases, HardOCP's benchmarking utility.

3) Some reviewers don't mention what kind of memory their system is running. Of the ones that actually report the memory type (i.e PC2700), most neglect to state at what speed that memory is actually running at. (i.e if the PC2700 is actually running at 2.7GB/s, or say 2.1GB/s).

4) Some reviewers reuse benchmark scores from previous reviews without stating that they are doing such. On the surface this seems reasonable, but when looking for distinct recordings this can skew comparisons if recorded more than once.

5) It's pretty tough to get a good impression of how older cards (GF4 series) fair with newer drivers/cpus in comparison to newer cards. R300 based cards are in better shape simply because they have been so resilient over the last 18 months.

6) The same arguement can be said for benchmarks. UT2003 (and to a certain extent SS:SE) seem to be the only ones that really have stood up to time.

7) Entering in the attributes (CPU, CPU Speed, Ram speed, GPU, GPU Speed, VRAM speed, Driver, display settings, benchmark, scores, etc) for about 4000 scores is really really tiring.

Nite_Hawk
 
Yeah, some benchmarks become kind of useless in such conditions IMO. The worst thing amonst those IMO is that reviewers so often don't mention the map - seems quite frequent in NV40 reviews...
If you still have the data available, it might be interesting to see which sites took the msot care in those details, and which not, to make sure not to believe all of those not-serious sites around. :)

Uttar
 
I'm not suprised at all. Web review sites don't seem to be made up of professionals who are qualified in either technical journalism / tech product consultancy / electronic engineers etc etc, just joe bloggs who decided to open his PC case and create a web page.
They get snipets of gossip and convert it into headline news, copy reviews from other websites and just add a few twists to make it look a bit different, and go over and over the same info time and time again so as to fill up pages. (Nodding dog journalism)

One of the exceptions is B3D ;) (and I think HardOCP's new format is ideal for the end user who just wants to know what it would be like in his PC)

Personally I feel the "least" that should be done is the author not only put his name to it but also his credentials. Also I'd like to see journalism that digs into the failings of the technologies.
 
I'm pretty disappointed actually. I had forgotten how bad many review sites are about specifying how they did their testing (I've been spoiled by B3D). Even here though, there are reviews that are missing information. Many of Marco's old reviews for example don't state whether or not trilinear was used, and some of Dave's reviews seem to reuse the same numbers (I can understand why he'd want to do this as it's a lot of work, but it would be *very* helpful if it was labelled when this happened!)

Over all the reviewing industry seems to be in a pretty sorry state. Do you guys happen to know any other sites that actually post all of the settings they use? Dave and Brent's articles so far have head and sholders been better than almost all of the others I've read, though I have picked up a couple of other decent ones.

Nite_Hawk
 
Nite_Hawk said:
Hi guys,

So I've been working on something of a research project lately, and this research project has required that I go through about 50 different reviews from various review sites, and recording several thousand benchmark scores. After all this, I've come to a couple of conclusions:

1) Almost every reviewer in this industry save a very select few seems incapable of listing whether or not they benchmarked with trilinear, bilinear, or brilinear filtering.

Many sites are unfamiliar with complex technical terms.

2) Fairly often reviewers neglect to mention what map or demo they used with a specific bencmark. UT2003 and Quake3 benchmarks specifically suffer from this. Many reviews of UT2003 for example will say "botmatch" or "flyby", but not specify the map. Others will give the map, but not specify if it was benchmarked with botmatch, flyby, or in some cases, HardOCP's benchmarking utility.

Many sites think that "botmatch" and "flyby" are actually maps themselves, or that the name of the map is the actual test itself, or vice-versa.

3) Some reviewers don't mention what kind of memory their system is running. Of the ones that actually report the memory type (i.e PC2700), most neglect to state at what speed that memory is actually running at. (i.e if the PC2700 is actually running at 2.7GB/s, or say 2.1GB/s).

Many sites are unfamiliar with complex technical terms.


4) Some reviewers reuse benchmark scores from previous reviews without stating that they are doing such. On the surface this seems reasonable, but when looking for distinct recordings this can skew comparisons if recorded more than once.

Many sites use whatever data is handy: data which has resisted deletion, or data conveniently obtained by cutting and pasting from other sites.

5) It's pretty tough to get a good impression of how older cards (GF4 series) fair with newer drivers/cpus in comparison to newer cards. R300 based cards are in better shape simply because they have been so resilient over the last 18 months.

Many sites over the past 18 months have been paid by nVidia to "forget" the GF4 in favor of GF FX.

6) The same arguement can be said for benchmarks. UT2003 (and to a certain extent SS:SE) seem to be the only ones that really have stood up to time.

Yes, many sites think that <400 fps indicates an inferior product, and that resolutions >512x384 are a waste of good cpu power.

7) Entering in the attributes (CPU, CPU Speed, Ram speed, GPU, GPU Speed, VRAM speed, Driver, display settings, benchmark, scores, etc) for about 4000 scores is really really tiring.

Many sites are unfamiliar with complex technical terms. Perhaps you are being too demanding?

:D
 
...

WaltC: You arn't bitter or anything are you? :)

Hrm... It should be interesting to run apriori and association analysis on the data I've got. I wonder what interesting rules should pop up?

hrm hrm rhm...

Nite_Hawk
 
Not only does most review sites lack the technical knowledge to do a proper analysis of scores (I sometimes still see 3dMark01 fillrate numbers presented as if they actually measured fillrate for example), but they lack knowledge about proper scientific method.

For example, if you're doing an image quality analysis, the proper way to do it is to have one guy take screenshots from the different cards, number them and then have some other guy judge the IQ without knowing which shot belongs to which card. I've never seen this done anywhere.

Another example is judging things that probably vary a lot between individual cards like "overclockability" by using a single sample. You need many samples in order to make any sort of conclusion regarding overclockability. A great example of this (which doesn't involve videocards) was when Tom's or Anand reviewed a whole bunch of PSUs and judged reliability from how well they worked under load. One of the PSUs failed and that manufacturer was determined to have faulty products. Of course, it's possible that the other manufacturers had a higher rate of failure but just got lucky. But of course, determining that would take more work so they just *guessed* that their sample of one was representative.
 
Gamecat:

Doesn't it drive you nuts? Given how technical this field is, I'm really suprised that more excellence isn't demanded from the field. I guess that so many gamers are probably younger and haven't learned about stats or proper testing methods and don't really care. It's too bad though, because things could be a lot better.

Despite this, I'm actually managing to get semi-interesting results working with the data. We'll have to see how it goes.

Nite_Hawk
 
You have to remember I haven't done a review in almost 8 months, so the overall quality of reviews on the Net is bound to fall.

8) :LOL:
 
GameCat said:
For example, if you're doing an image quality analysis, the proper way to do it is to have one guy take screenshots from the different cards, number them and then have some other guy judge the IQ without knowing which shot belongs to which card. I've never seen this done anywhere.

I have to kind of agree and disagree with you all at once here. Although I see where you're coming from and kind of like the idea, I would argue a couple of points:

1. Provided you are publishing the screenshots taken for the comparison in their full form, then you can basically let the reader decide for themselves. From a simple 'Does it look good?' perspective, that simply can't be beaten, letting a potential buyer decide for himself is always the best way in a such a subjective field.

2. From a technical perspective, I would argue that it helps, not hinders, to know which card each shot came from. If you understand how the architecture and drivers for a board works, then you know what to look out for regarding possible caveats in its image quality.

GameCat said:
Another example is judging things that probably vary a lot between individual cards like "overclockability" by using a single sample. You need many samples in order to make any sort of conclusion regarding overclockability. A great example of this (which doesn't involve videocards) was when Tom's or Anand reviewed a whole bunch of PSUs and judged reliability from how well they worked under load. One of the PSUs failed and that manufacturer was determined to have faulty products. Of course, it's possible that the other manufacturers had a higher rate of failure but just got lucky. But of course, determining that would take more work so they just *guessed* that their sample of one was representative.

You're totally right here, but how many review sites can demand a dozen boards and have the time to overclock each one? Hell, even getting a single board is difficult enough! ;) I think the best any site can do here is simply remind users that this is showing the overclockability of a single sample, and is not necessarily representative of all boards on sale.
 
Ofcourse, letting the reader judge IQ is great. But most often, reviewers also give their own opinions.

In that last case, a blind test is certainly preferable. As soon as someone knows which cards it is, he will probably 'hunt' for those specifics. For example look for that one triangle which he knows will get a little less AF because of the angle. And maybe 'forget' to notice blurring on lots of other polygons?

It is very easy to start to focus the reader on extremely miner issues, while negating big IQ issues. Doesn't even have to be done on purpose.
And ofcourse, especially when two images look different, a lot of bias can be found in deciding which one is correct!


As to statistics... As soon as you post a review where list power supplies that did, and did not fail, you immediately suggest to the viewer that this is typical. Doesn't even matter if you include a disclaimer, saying it's not necessarily representative. If you really believe it might not be representative, than you simply should not publish it!!
Actually in the case of only one PSU per brand test, you know NOTHING about it's reliabilty. Doesn't matter if it failed or not.


Speaking about statistics. it would also be nice to have a reviewer comment on the typical error of a benchmark. How realible are the numbers? Ever seen error bars in the reviewers graphs? What happens if you run the benchmarks a second time? Are they within 1%? 5%? Big difference when videocards are pronounced king when they're leading with 1fps...

You might extend that a little... While doing a videocard review, you might think about the what happens if you take a AMD 3200+ in place of the P4 3.2GHz. Windows 9x and not XP. While I have seen CPU scaling, you almost never see CPU replacement.

You don't have to do that for every review ofcourse, but doing it a couple of times should give a feel for the typical error of the benchmark suite. And then you can say that for example, videocards that perform within 5% of each other are considered equal.


Big problem in this discussion is ofcourse, that we assume that reviewer want to be objective....

Frankly, I don't think that's the case. There lots that are extremely biased. Then there's a group that doesn't give a shit. They're only interested in mass-producing reviews.
We've allready established a big group simply lack the knowledge...

What's left, is a very tiny group that really wants to be objective. They'll be hampered by lack of resource (time, money) to accomplish their goal completely.
 
Ylandro said:
What's left, is a very tiny group that really wants to be objective. They'll be hampered by lack of resource (time, money) to accomplish their goal completely.

Heh, you certainly won't hear me arguing that point. :p
 
Hanners said:
1. Provided you are publishing the screenshots taken for the comparison in their full form, then you can basically let the reader decide for themselves. From a simple 'Does it look good?' perspective, that simply can't be beaten, letting a potential buyer decide for himself is always the best way in a such a subjective field.

While this is a good point, to make a valid unbiased decision, the potential buyer really needs to judge the qulity of the respective cards without knowing which is which. Once they have deemed Image 1 to have superior quality compared to Image 2, thay can find out which image belongs to which card. If you don't do this, the comparison will be biased. Since people have heard about ATis superior AA quality, they will conclude that the ATi pic probably looks better. Of course if the readers want labelled images (and my guess is they do), then by all means supply that, but don't use that flawed method for your own testing.

Hanners said:
2. From a technical perspective, I would argue that it helps, not hinders, to know which card each shot came from. If you understand how the architecture and drivers for a board works, then you know what to look out for regarding possible caveats in its image quality.

The problem is that you shouldn't "know what to look out for". If one card has better IQ than another, that should be judged based on... IQ! Not something else. After the comparison it's perfectly possible to try and rationalise your decision by a technical analysis, but if it really is IQ you want to compare, you can't let your preconceived notions affect the comparison.

This really isn't rocket science, thousands of scientific experiments and studies have shown that well meaning, serious people that really try to be objective still show observer bias. There is a simple way a round it that for some reason people don't use. I know lots of audiophiles that claim that expensive high quality digital cables from their cd-deck (with a very expensive gazillion bit DAC mind you) to their amp gives more "dynamic sound". This is obviously bogus, but they trick themselves because they know when they're using the $100 digital cable and when they're using the $10 one.


Hanners said:
You're totally right here, but how many review sites can demand a dozen boards and have the time to overclock each one? Hell, even getting a single board is difficult enough! ;) I think the best any site can do here is simply remind users that this is showing the overclockability of a single sample, and is not necessarily representative of all boards on sale.

Well yeah of course, performing a proper analysis with large samples of each board is impractical. But the problem is that trying it on a single board and adding a disclaimer is sort of like saying "So, I measured the height of my dutch friend the other day and he was only 165cm, so I conclude dutch men are pretty short. Of course, this was only a single sample, so take the above with a grain of salt, I'm just throwing it out there." It is kind of silly, no?

Talking about overclockability when retail boards are out and lots of people have tried overclocking them is fine with me. I also understand why sites test stuff like this and show a disclaimer, it's just that for anyone that knows anything about statistics and scientific method the results are pretty meaningless.

But I'm probably the only one anal enough to care about this, so I doubt things will improve much ;)
 
GameCat said:
While this is a good point, to make a valid unbiased decision, the potential buyer really needs to judge the qulity of the respective cards without knowing which is which. Once they have deemed Image 1 to have superior quality compared to Image 2, thay can find out which image belongs to which card. If you don't do this, the comparison will be biased. Since people have heard about ATis superior AA quality, they will conclude that the ATi pic probably looks better. Of course if the readers want labelled images (and my guess is they do), then by all means supply that, but don't use that flawed method for your own testing.

You know, I really like that idea of allowing the reader to view the images without labels, and then turn them on after comparing them. Not sure about the technical feasability of it, but I think I'll look into it.

As for the reviewer not comparing labelled images, I've always fired up both images and flipped between the two without labels (or looking at filenames to start with), so I'm with you on that point.

GameCat said:
Talking about overclockability when retail boards are out and lots of people have tried overclocking them is fine with me. I also understand why sites test stuff like this and show a disclaimer, it's just that for anyone that knows anything about statistics and scientific method the results are pretty meaningless.

But I'm probably the only one anal enough to care about this, so I doubt things will improve much ;)

Again, a good point, can't really argue with that.

I don't think your the only who's a bit 'anal' about this kind of stuff. Certainly nothing wrong with being a perfectionist. ;)
 
GameCat said:
But I'm probably the only one anal enough to care about this, so I doubt things will improve much ;)

Comparing our comments, I think I've just proved, you're not the only one. :D


But improving things...... :|
 
Ylandro said:
GameCat said:
But I'm probably the only one anal enough to care about this, so I doubt things will improve much ;)

Comparing our comments, I think I've just proved, you're not the only one. :D


But improving things...... :|

There are others too... :)

Btw, I might have something semi-publishable coming down the pipe soon. I'm taking scores from many different reviews (atleast trying to) and performing classification and association analysis on them. I'm hoping to get something of a generalized model for predicting scores given a number of attributes (My current model predicts with a 8% relative absolute error and a root relative squarred error of 10.6%). The eventual goal with this is to make an online system in which people can enter in configurations and get a prediction for a score with a certain benchmark.

P.S. This is using a cross-validation technique with 10 folds. a 66% split (training/testing) does slightly worse (I think like 11/14%).

Nite_Hawk
 
Back
Top