New Slide Leaked!!!

Nite_Hawk · May 4, 2004

.... From my 3D graphics data mining presentation!

Seriously, I'm hoping to have a paper and my slides published within the next two weeks. It's taken a long time, but I have a working model, and some interesting results. Hopefully sometime in the near future we'll be able to start predicting scores for videocards.

Nite_Hawk

dr3amz · May 4, 2004

i hope your PC dies before the official reviews come out!!!!!

jk! i don't really

but thats such a lame topic!!!

grrrrr

Nite_Hawk · May 4, 2004

Heh, atleast you replied!

So many people are getting so wound up in the R420 vs NV40 thing, that I thought I would have a little fun.

Anyway, the slide is showing about 1000 sample points for SS:SE scores taken from about 40 different reviews. It could be any combination of AA/AF/Resolution/CPU/GPU/Memory/etc settings, and what each of those graphs show is the actual score that the review recorded versus a predicted score generated from a model (each graph shows a different model). The closer each dot is to the dashed line, the closer the model was to predicting the true score recorded in the review. The idea is that I want to be able to predict how a given computer configuration should score on a given benchmark using various quality settings.

Nite_Hawk

Entropy · May 4, 2004

Heh.
I did something similar but not as focussed three years ago or so.
I used games and 3DMark data, and created a model using 3DMark synthetic scores as hypothetical set of predictors (Principal Component Analysis - does a good job with sparse matrices). It turned out that at the time, performance was dominated almost completely by fillrate.
Since reviewing systems varied, host system performance was an awkward factor.

Never did anything serious with it - it simply confirmed my opinion at the time that fill rate ruled supreme. Had no idea such studies could be published anywhere - too quick and simplistic to be called research, and to complex to be suitable for hardware sites. Plus, I was a chemist playing Q3 with no interest in graphics other than utilitarian.
Where's this going up?

Nite_Hawk · May 4, 2004

Entropy:

I'm planning on putting it on subpixel.org (which actually doesn't exist yet) once I give my presentation next week. This is a semester project I did for my data mining class, and also an independent study project. My goal is to actually re-implement everything I did as a webpage with a DB backend so that you can do on-the-fly analysis over the web. What I have found so far:

Trying to lump benchmarks together to create a "holy grail" model of sorts does not work well with either REP Trees or M5 rules (otherwise known as model trees which are basically decision trees with linear functions at the leaves) unless the benchmarks are fairly close together in range. For this reason, including 3DMark scores with framerate scores ends up being a bad idea (though I would have guessed the M5 Rules would have dealt with it better). Just lumping games that give results in FPS produced a fairly good model, but splitting the data into seperate inputs based on benchmark produced the best models. The kind of silly thing is that you could build a single model by combining these models with a small tree at the root.

The other thing I noticed, is that if you discount 3DMark scores, a number of different attributes all play a reasonably important role in determining the score:

Data Set: FPS subset of the subpixel dataset.

Single Attribute Run, 10 fold cross validation

Attrb | Model | Corr | Mean Abs Err | Root mean Sqrd
------------------------------------------------------------------------
CPU | M5R | 0.0564 | 99.7596 | 99.7596
CPU | Rep | 0.021 | 99.9694 | 99.9894
------------------------------------------------------------------------
CPU Speed | M5R | 0.2019 | 97.9157 | 97.9005
CPU Speed | Rep | 0.2244 | 97.6269 | 97.4192
------------------------------------------------------------------------
RAM Speed | M5R | 0.181 | 98.4306 | 98.3067
RAM Speed | Rep | 0.222 | 97.4102 | 97.4651
------------------------------------------------------------------------
Chipset | M5R | 0.281 | 96.2699 | 95.9391
Chipset | Rep | 0.261 | 96.0951 | 95.7875
------------------------------------------------------------------------
GPU | M5R | 0.2719 | 95.6033 | 96.1946
GPU | Rep | 0.2779 | 95.3838 | 96.0191
------------------------------------------------------------------------
GPU Speed | M5R | 0.3492 | 93.3441 | 93.6861
GPU Speed | Rep | 0.3503 | 93.3131 | 93.6448
------------------------------------------------------------------------
VRAM Speed | M5R | 0.3765 | 92.1799 | 92.6464
VRAM Speed | Rep | 0.3831 | 91.8541 | 92.365
------------------------------------------------------------------------
Driver | M5R | 0.3263 | 94.8986 | 94.4894
Driver | Rep | 0.3428 | 94.1999 | 93.9053
------------------------------------------------------------------------
Resolution | M5R | 0.3848 | 91.5119 | 92.2587
Resolution | Rep | 0.3848 | 91.5018 | 92.261
------------------------------------------------------------------------
Filtering | M5R | -0.0616 | 100 | 100
Filtering | Rep | 0.0657 | 99.6141 | 99.7404
------------------------------------------------------------------------
AntiAliasing | M5R | 0.4192 | 90.502 | 90.7483
AntiAliasing | Rep | 0.419 | 90.5026 | 90.7561
------------------------------------------------------------------------
Anisotropic | M5R | 0.3531 | 93.3136 | 93.5177
Anisotropic | Rep | 0.3533 | 93.3129 | 93.5093
------------------------------------------------------------------------
Benchmark | M5R | 0.5615 | 82.8817 | 82.7143
Benchmark | Rep | 0.5611 | 82.8895 | 82.7369
------------------------------------------------------------------------
Map | M5R | 0.5824 | 83.2276 | 81.2546
Map | Rep | 0.5777 | 84.971 | 82.7494
------------------------------------------------------------------------

All Attributes, 10 fold cross validation

Attrb | Model | Corr | Mean Abs Err | Root mean Sqrd
------------------------------------------------------------------------
All Attributes | M5R | 0.9699 | 22.4636 | 24.3429
All Attributes | Rep | 0.9111 | 36.5765 | 41.4929
------------------------------------------------------------------------

Nite_Hawk

Nite_Hawk · May 4, 2004

In the above post, when you look at each attribute, it has a correlation coefficient (along with some error calculations). What this table tells you is how well each attribute by itself predicts the score. A correlation coefficient of 0 means that there is no correlation between the attribute and the score (meaning it doesn't really predict it well at all) while a value of 1 means that there is a strong (well, the strongest) correlation, and a value of -1 means there is a negative correlation. As you can see by the table, out of all of those attributes, the benchmark being run seems to most strongly predict the score, but a number of other factors (such as AA, Anisotropic filtering, GPU Speed, VRAM speed, etc) are also playing important roles.

Nite_Hawk

KimB · May 4, 2004

What does the "Full Data Model" represent?

Nite_Hawk · May 4, 2004

Chalnoth: 6 different models were built. 3 Are using M5 Rules, and 3 are using Reduced Error Pruning Trees. The three different models are:

Full Data Model - All samples (scores) from all benchmarks are used including 3DMark01 and 3DMark03 scores. This means that there is a mix of FPS type scores along with 3DMark "scores".

FPS Data Model - All FPS type samples are used, which are all scores but those from 3DMark01 and 3DMark03.

SS:SE Data Model - Only sample points from data pertaining to SS:SE is used.

The FPS Data Model and the SS:SE data model do pretty well when using M5 Rules, but it appears that the SS:SE model worked the best (and this is true for other benchmarks as well).

Nite_Hawk

KimB · May 4, 2004

Nite_Hawk said:
Full Data Model - All samples (scores) from all benchmarks are used including 3DMark01 and 3DMark03 scores. This means that there is a mix of FPS type scores along with 3DMark "scores".

Thanks. But I'm wondering: How in the world could any data tree predict (nearly) the exact same score for many different systems? It seems to me that something is throwing off the results. Perhaps using the 3DMark "scores?"

Nite_Hawk · May 4, 2004

Hi,

Actually, you hit the nail pretty much on the head. The range of scores when 3DMark01 is included goes from about 8-20,000. When this is the case, variances of oh say 60fps appear to be small (less than 1%!). But if you look at it from the perspective of Serious Sam where the average framerate is somewhere around 80-90fps, a 60fps variance is huge. For this reason, the model generated for the Full Data Set is likely sub-optimal. The FPS Data Set, and the SS:SE Data Sets by comparison are doing much better, even though they have less sample points. Another good illustration of this is that for the Full Data Set, there is a correlation coefficient of something like .91 when predicting the score just from the benchmark attribute. That appears to be really good until you actually look at graphs like the one in that slide and realize that the model being built is junk.

P.S. To answer your question a bit more specifically, I think what's happening is that it's taking the average SS:SE framerate as a base from which to build the model, and then normally it would add or subtract weights depending on certain attributes. Because when the 3DMark scores are added in the range goes so high, it likely thinks it has already done an incredibly good job (the variance is under 1%!) and just stops without really doing much else.

Nite_Hawk

Simon F · May 4, 2004

Entropy said:
Heh.
I did something similar but not as focussed three years ago or so.
I used games and 3DMark data, and created a model using 3DMark synthetic scores as hypothetical set of predictors (Principal Component Analysis - does a good job with sparse matrices). It turned out that at the time, performance was dominated almost completely by fillrate.

Interesting. Thanks!

I must bookmark the link to your post so I can reference it the next time some zealot claims the Kyro was no good because it didn't have a T&L unit

Entropy · May 4, 2004

To mess with your research a bit Nite_Hawk - aren't you bound to get some oddities with this kind of model? Such as chipsets seemingly having a greater impact on the score than the CPU speed? (Which makes sense from a modelling standpoint since for instance an Opteron chipset will be a more solid predictor than CPU GHz across CPU families at this point.)

I don't know if this is useful to you in any way, but back then, results basically depended on two parameters:
* Fillrate
* Host system performance

The main conclusion I could draw at the time was that all those graphs, in all those reviews, could basically have been replaced by a single run of a game at low resolution, and one run at the highest possible, and you would be able to make OK predictions about untested games. Adding the data from 5-7 additional games at different resolutions would contribute very little to the predictive power for games other than these specific titles. Of course there would have been fewer pretty graphs to look at in the reviews, so the entertainment value would have been limited.

Today, the shader abilities of the GPUs start to matter, and that means that transferability of results between applications will get worse than it was. For reviewers, this means that testing a larger number of titles will get more relevant, sadly because the conclusions to be drawn from the results will be more tightly confined to that particular set of applications.

I still feel that making statistical studies of results can be useful. They can give an indication of what matters and what doesn't overall, and could perhaps help reviewers focus on the areas where their efforts pay off better from an informational point of view. I doubt reviewers and readers are very interested though. Computer reviewing is typically mostly "hardware entertainment" for the readers. Information density is pretty irrelevant in that context.

Entropy · May 4, 2004

Simon F said:
I must bookmark the link to your post so I can reference it the next time some zealot claims the Kyro was no good because it didn't have a T&L unit

Interesting that you should mention the Kyro, because that chip was what made it clear to me that it made more sense to use a high resolution benchmark run as an indicator of fillrate rather than a synthetic fillrate number due to differences in how chips handled overdraw. The Kyro was an outlier in that respect of course, but in my opinion the conclusion still holds even though the differences in the handling of overdraw between cards on the market today is probably smaller.

Nite_Hawk · May 4, 2004

Entropy: Hi,

You make a good observation; a number of the attributes are not independent in this data set. Things that otherwise wouldn't be good predicters are actually encoding information from other attributes (like your example with chipsets actually encoding information about cpu type/speed). This is actually the primary reason why I haven't been using Baysian filtering as it can't deal with these kinds of problems.

To get around this for "really" telling how well the chipset predicts the score you could first examine how well cpu and chipset predict the score individually (probably along with memory speed!), and then see how well they predict the score together. You should be able to then see how much redundent information there is encoded in each attribute.

As for prediction, I'm really curious how a multivariate regression model would work out. Unfortunately my statistics backround is pretty limited so I need to take some more classes or atleast study up on things a bit more. I'm actually wondering if M5 Rules is doing something like this (I believe you can accomplish multivariate regression by using a number of linear regression models in tandem?)

Something I've noticed is that across the benchmarks I've studied (UT2003, SS:SE, Quake3, RTCW, 3DMark01, 3DMark03), some are cpu dependent, and others are gpu dependent. Memory speed plays into things too, and of course the various quality settings have an effect. When I have time, It would be nice to go through and to exhaust the total possible pairings of all attributes (15!) and then to so when certain attributes start becoming less important at predicting the score. This should help tell when information is becoming redundant, and what attributes really are the good predictors.

Nite_Hawk

New Slide Leaked!!!

Nite_Hawk

dr3amz

Nite_Hawk

Entropy

Nite_Hawk

Nite_Hawk

KimB

Nite_Hawk

KimB

Nite_Hawk

Simon F

Tea maker

Entropy

Entropy

Nite_Hawk

Similar threads