G80 & R600 Unified Architectures, what's the LCD?

^eMpTy^

Newcomer
I'm the owner/operator of GPUReview.com and I wanted to get some input from the B3D crowd on how best to represent the latest generation of video cards in my database.

Up till now it's been easy, everything could be described in terms of ROPs, fragment pipelines, and texture units. There were, of course, architectural differences that affected performance not described by these numbers, but the effect was not so great as to make the numbers lose all meaning. So having all the specifications to look at was useful.

Now this all seems to have changed, we have R600 with 320 stream processors losing (?) to G80 with 128. So obviously I need to adapt my database to this change but I'm unsure of how to do so.

So my question is: what's the new least common denominator? What specifications can I use to describe these new cards in a way that conveys some information about their actual power? Is the number of stream processors and their clock speeds sufficient to describe dx10 generation cards at least as well as the number of fragment pipes and clock speeds described dx9 cards? Or are R600 and G80 so different that this kind of comparison just doesn't make sense anymore? Or should I go down to a lower level?
 
In this regard, you have to distinguish the things in two major "domains": what is the output capacity, and what is the processing power, and subdivide so on, depending on the basis.
That should keep the confusion between different architectures and streamline the comparison.
 
I don't think there's a generic and perfect solution to this. What I thought we might want to do for Beyond3D's 3D Tables is dissociate the chips and the architectures, and what we'd have to enter for the chip info would be architecture-dependent. In addition to that, it would be possible to create "architecture revisions" with only slight changes (such as an increased stencil rate for G84). For example, you'd have NV3x as an architecture, with NV30/NV31/NV34 as the main derivatives, while there would be a revision (which shares most characteristics) for NV35/NV36.

In the end, it's practically impossible to compare two different architectures based on their unit counts. You could compare them based on some rates, though. That's very feasible for TMUs and ROPs (although, as you will find out on Monday, different formats might be handled quite differently on two different architectures!) but not so much for ALUs, where organization and efficiency can vary even more wildly.
 
So my question is: what's the new least common denominator? What specifications can I use to describe these new cards in a way that conveys some information about their actual power? Is the number of stream processors and their clock speeds sufficient to describe dx10 generation cards at least as well as the number of fragment pipes and clock speeds described dx9 cards? Or are R600 and G80 so different that this kind of comparison just doesn't make sense anymore? Or should I go down to a lower level?

IMO, the best way to compare two architectures is by benchmarks, in fact, that is the purpose of them!
 
IMO, the best way to compare two architectures is by benchmarks, in fact, that is the purpose of them!

No that's the best way to compare the performance of two architectures which is obviously not what he's interested in.

I think the LCD is somewhat dependent on how you define a shader. If a shader is the unit that works on one object as a time as we're accustomed to doing then G80 has 128 of them and R600 has 64. But then you would have to qualify that with the number of components per shader - G80 is 1, R600 is 5. B3D has done this in the past with the additional qualification of co-issue capability. Granted, R600's co-issue capabilities are far beyond the limited set of options available to prior architectures so it might be difficult to succintly communicate that fact.

An easier approach might be to consider a shader as a unit that executes a scalar instruction. That way you can do 128 G80 and 320 R600. But then do you want to count G80's missing MUL making it 256 G80? Or do you want to count just MADD instructions? Also, G80 has a separate special function unit and I think one of R600's 5 slots serves that purpose. So will you end up understating or overstating theoretical performance by using naive unit counts? It will get pretty hairy if you're trying to be overly accurate.

It's going to be hard to find a homogenous set of categories like Arun said. Maybe you can have different charts for different architectures with annotations explaining their peculiarities.
 
IMO, the best way to compare two architectures is by benchmarks, in fact, that is the purpose of them!

Oh I completely agree...that's why I go to great pains to list every review I can find for any given card on its associated page in my database. The idea being that you can get the at-a-glance specifications to give you a ballpark figure of the card's performance...and then you can go digging through the reviews to get more specific information.

When you're trying to make a decision between two cards, often you don't need to read several 10 page reviews to figure out what's better. The specs are often more than enough to figure things out.
 
No that's the best way to compare the performance of two architectures which is obviously not what he's interested in.

I think the LCD is somewhat dependent on how you define a shader. If a shader is the unit that works on one object as a time as we're accustomed to doing then G80 has 128 of them and R600 has 64. But then you would have to qualify that with the number of components per shader - G80 is 1, R600 is 5. B3D has done this in the past with the additional qualification of co-issue capability. Granted, R600's co-issue capabilities are far beyond the limited set of options available to prior architectures so it might be difficult to succintly communicate that fact.

An easier approach might be to consider a shader as a unit that executes a scalar instruction. That way you can do 128 G80 and 320 R600. But then do you want to count G80's missing MUL making it 256 G80? Or do you want to count just MADD instructions? Also, G80 has a separate special function unit and I think one of R600's 5 slots serves that purpose. So will you end up understating or overstating theoretical performance by using naive unit counts? It will get pretty hairy if you're trying to be overly accurate.

It's going to be hard to find a homogenous set of categories like Arun said. Maybe you can have different charts for different architectures with annotations explaining their peculiarities.

Yeah, I'm beginning to question if what I'm trying to accomplish is even possible. If I have to describe things in too fine a detail to retain accuracy, then it will take longer for people to understand the specs than it would to read a full review...and that hardly makes sense.

Thanks for the excellent feedback!
 
So, after staring at this for a while, it seems like G80 has 128 scalar shader processors which do 1 MADD and 1 MUL per clock. And then each shader processor can also do a special op every 4 clock cycles (sin, cos, log, exp, ?).

R600 seems to have 64 5-way superscalar shader processors that can do 5 MADDs per clock plus 1 special op (presumably also every 4 clocks).

So, breaking this down into the number of MADDs, MULs, and special ops per second based on clock speeds of the 8800GTX and the 2900XT we get:

R600
743MHz * 5 MADDs/SP * 64 SPs = 237760 MADDs/sec
743MHz * 1 spec-op/SP * 64 SPs = 47552 spec-ops/sec

G80
1350MHz * 1 MADD/SP * 128 SPs = 172800 MADDs/sec
1350MHz * 1 spec-op/sp * 128 SPs = 172800 MADDs/sec
1350MHz * 1 MUL/SP * 128 SPs = 172800 MADDs/sec

From what I've read, the MUL on G80 may not even be operational, so that leads to one of two conclusions to explain R600 being slower than G80:

1. Special operations are a bottleneck (seems unlikely)
2. The scalar processors just really are that much more efficient (also seems unlikely)

What do you guys think? Assuming that performance is mostly based on MADDs, R600 would have to be close to 50% efficient on average to lose to G80 the way we've been seeing in leaked benchmarks. Does this effectively prove that the MUL is doing something? Or is this just really poor conjecture on my part?
 
Naive counts of "things X clockspeed" hasn't given a really solid indication of performance since DX7 level. DX8 cards still followed the general trends well enough to be useful to look at, but by DX9 hardware with math processing disconnected from texturing and ROP's there was no easy way to count, add, multiply, and predict performance.
 
Naive counts of "things X clockspeed" hasn't given a really solid indication of performance since DX7 level. DX8 cards still followed the general trends well enough to be useful to look at, but by DX9 hardware with math processing disconnected from texturing and ROP's there was no easy way to count, add, multiply, and predict performance.

At this point, I'd be satisfied to just have the numbers loosely make sense. I gave up on predicting performance using specs long ago, I just need to represent R600 on my site in 'naive' terms, and have it not be complete nonsense.

Right now it seems to make sense to just list the pipeline layout and then give a rough count of MADDs/second and that should be sufficient from my standpoint. Except when I do the math there, the numbers seem way off, so I'm wondering if there is a better way to represent R600/G80.

I guess what I'm really asking is, what do you guys perceive to be the bottleneck on R600? It seems to have the shader power, and memory bandwidth to beat G80, but it's not happening...any idea why?

I realize I may be asking the impossible...
 
Hard to say. A knee jerk reaction/answer with no real investigation of the hardware will have some people saying it has insufficient texturing abilities.

Early pre-NDA benchmarks and the frequent alpha/beta driver releases might have some calling out immature drivers with ATI not yet having a full grasp of the new architecture.

In depth post NDA investigations might uncover something completely different.

At this point your guess is as good as anyones. I'm sure we'll have a slightly better grasp of the situation after reviews come out. However, I suspect a lot of reviews are going to point us at red herrings and a lot of reviews will be quick to blame one thing or another.

It might be months before it's truly known what is holding it back or if anything at all is holding it back. After all just look at G80, we still don't know if we've seen it's full capabilities or not or whether it will be a marginal or decent first gen dx10 part. Not to mention Nvidia still doesn't seem to have a complete grasp of it, or at least how to make a fully stable Vista 64 driver for it.

Regards,
SB
 
Hard to say. A knee jerk reaction/answer with no real investigation of the hardware will have some people saying it has insufficient texturing abilities.

Early pre-NDA benchmarks and the frequent alpha/beta driver releases might have some calling out immature drivers with ATI not yet having a full grasp of the new architecture.

In depth post NDA investigations might uncover something completely different.

At this point your guess is as good as anyones. I'm sure we'll have a slightly better grasp of the situation after reviews come out. However, I suspect a lot of reviews are going to point us at red herrings and a lot of reviews will be quick to blame one thing or another.

It might be months before it's truly known what is holding it back or if anything at all is holding it back. After all just look at G80, we still don't know if we've seen it's full capabilities or not or whether it will be a marginal or decent first gen dx10 part. Not to mention Nvidia still doesn't seem to have a complete grasp of it, or at least how to make a fully stable Vista 64 driver for it.

Regards,
SB

Excellent points. I think I'll just let it slide for now and see how things shake out over the next couple of weeks.
 
In my opinion, the answer is to neglect theoretical specs altogether, and produce a single composite benchmark score. While this will, unfortunately, be subject to the specific benchmarks chosen, it would be a nice way to compress the information.

One way to do this would be to take one card as the "reference" hardware, and just use the % difference from this reference in each game.

For the 2900 XT benchmarks, for instance, given the price a good reference hardware choice would be the 8800 GTS. So, let's imagine, for a moment, that the benchmark results are as follows (in 3 different games):

8800 GTS: 24fps, 52fps, 32fps
2900 XT: 32fps, 42fps, 20fps

(this is totally made up, by the way: this is just for the purpose of illustration)

Here, if we normalize the scores to the GTS result, we get:
8800 GTS: 1.0, 1.0, 1.0
2900 XT: 1.33, 0.81, 0.63

Averaged, we get:
8800 GTS: 1.0
2900 XT: 0.92

One could compile similar summary scores for various AA and AF settings.

And yes, this is going to be arbitrary and inaccurate, but there is no way around this problem, so, given the difficulties in picking out one number to use to compare different architectures, this seems, to me, to be a decent way to go about it.
 
Well my problem is I don't want to get into the hardware review business, I simply don't have time. My ideal solution would be to set up 3 different test rigs of various CPU horsepower, upgrade them infrequently, and write a program that can run game benchmarks in batches. Then I could just pop in a card, set up the driver, run the batch scripts, collect the results, and have tons of relevant data to compare to every other card I've tested on that platform.

I've really wanted to do something like this for a while, but it would take months of full time work to get it working smoothly, and I just have too much other stuff going on. *sigh* maybe someday though.

For now I think I'm just going to present it in terms of shader processors and when anyone wonders why 128 SPs on G80 are beating 320 SPs on R600, I'll tell them it's because R600's superscalar architecture is far less efficient, and G80's shader clock is nearly double that of R600.

Hopefully this will be at least mildly useful.
 
Or, you could just not worry about running the benchmarks yourself, but instead use the results of other websites (with any relevant citations, of course...don't want to plagiarize here).
 
Or, you could just not worry about running the benchmarks yourself, but instead use the results of other websites (with any relevant citations, of course...don't want to plagiarize here).

I've considered this as well, but then you run into the problem of different drivers, different hardware, different settings, different games, and all kinds of other random factors which would make the information difficult to present. Though I am very seriously considering doing precisely this.

Also I would assume it would be necessary to seek permission from any site I wanted to borrow data from, which could prove problematic.
 
Well, if the point of the original chart was to give some guidance on theoretical betterness, then you could presumably continue to talk about theoretical peaks without discussing the number and speed of the internal units. Fillrate, bandwidth, shader-ops, filtering, etc.
 
Also I would assume it would be necessary to seek permission from any site I wanted to borrow data from, which could prove problematic.
Well, no, not at all. Just cite the source. You aren't, after all, copying their results. You're using their results to do your own analysis.
 
I think Damien does a nice job of breaking it down into very basic shader, sample, and filtering rates and bandwidth. But there's still fillrate (pixels) and I guess Z/stencil rate, too. Pretty hairy stuff.

Chal's suggestion is probably the most accurate single reference, but it's still very general and requires a lot of legwork, whereas theoretical #s aren't as labor-intensive but leave more to the (educated) reader's imagination (but even they don't know where every game is bottlenecked).
 
I think Damien does a nice job of breaking it down into very basic shader, sample, and filtering rates and bandwidth. But there's still fillrate (pixels) and I guess Z/stencil rate, too. Pretty hairy stuff.

Chal's suggestion is probably the most accurate single reference, but it's still very general and requires a lot of legwork, whereas theoretical #s aren't as labor-intensive but leave more to the (educated) reader's imagination (but even they don't know where every game is bottlenecked).

Hmm, interesting chart. Why does it show the 7900 GTX and the X1950 XTX as having 8 vec5 pipelines? I thought the X1900 series was vec4 and the 7900s were 4way superscalar...not to mention R600 isn't vec5, it's 5way superscalar...or am I just tripping over sematics?

As for compiling benchmarks, it's not that I think I couldn't do it. I've had very good results with all the different scripts I've written to compile data for gpureview thus far. And I think I could successfully extract benchmark data relatively efficiently. The problem is, in the end, I don't think it would really be all that useful because none of the data would be comparable since it would all be very heterogeneous in terms of drivers/software/system specs.
 
Back
Top