G7x vs R580 Architectural Efficiency

Razor1 · Mar 7, 2006

Moderator note: The discussion in this thread was split from the discussion occuring in this thread

well the thing is not all 48 alu's aren't free because of the texture units, if you go with the 1 to 7 ratio for some of the newer games coming out well over all games today are going around 1 to 3 ratio, nV's ratio's seems to be a bit more on the lax side which won't really hurt them for another 6 or so months, and even then nV is keeping up with ATi handly thats with a much weaker memory controller. Thats the scary part.

KimB · Mar 7, 2006

Bear in mind that, for example, if you're doing nothing but MAD's, the R580 and the G70 are on par. If you make use of more ADD instructions, then the R580 pulls ahead (to the tune of double, if you use enough of 'em). If you instead make use of some FP16 normalize instructions (to the tune of no more than 1 per 2 MAD's) the G70 pulls ahead (quite possibly also to the tune of double: nrm's are fairly expensive).

So it's not horribly misleading. It is, in fact, how nVidia has characterized their own pipelines (even before ATI put forward the R5xx), as having ALU0 and ALU1 in each pipeline.

But all of this stuff really is unimportant. What's important is real-world performance.

Ailuros · Mar 7, 2006

radeonic2 said:
Phew.. I was a bit worried nvidia really did have 48 alus.
Nvidia marketing never ceases to amaze me.
First the NV30 had 8 pipes and now the G71 has 48 alus.

For G7x you can either say it has 24 ALUs with 8 MADDs/16FLOPs each or 48 ALUs with 4 MADDs/8 FLOPs each.

The claim that it has 48 ALUs (which would be valid for G70 as well) isn't wrong per se, trouble being that they also sign R580 with 48 ALUs as well, whereby each ALU there shoots out 12 FLOPs instead of the 8 of a G7x ALU.

Moloch · Mar 7, 2006

Ailuros said:
For G7x you can either say it has 24 ALUs with 8 MADDs/16FLOPs each or 48 ALUs with 4 MADDs/8 FLOPs each.

The claim that it has 48 ALUs (which would be valid for G70 as well) isn't wrong per se, trouble being that they also sign R580 with 48 ALUs as well, whereby each ALU there shoots out 12 FLOPs instead of the 8 of a G7x ALU.

I know but like you said they were saying the R580 has 48 alus.
I knew something was up with the amount of trannies they said it had...

ANova · Mar 7, 2006

trinibwoy said:
Well, 24-pipelines was fine for marketing when they were up against 16 pipelines from ATi. But now that ATi has jumped to 48-pipes/alu's, marketing jumped all over G70's second ALU improvements.

There was a discussion between myself and andypski on this subject a couple days back. It is no more fair to compare G70's 24-pipes to R580's 16-pipes than it is to compare G70's 24-pipes to R580's 48-ALU's or G70's 48-ALU's to R580's 48-ALU's or G70's 48-ALUs to R580's 96. There's no perfect match there due to architectural differences and from what I've seen, what you consider more "right" depends on which side you're batting for.

Good thing none of that matters and it's the performance that counts in the end.

It is very misleading though. Most people will think the G71 doubles shader performance when they see that number. Nvidia is trying to compare their 48 ALUs to ATI's 48 shader processors.

As you mentioned, performance (as well as image quality) is all that matters but it's become common to compare these skewed architectural "advantages."

superguy · Mar 7, 2006

Bear in mind that, for example, if you're doing nothing but MAD's, the R580 and the G70 are on par. If you make use of more ADD instructions, then the R580 pulls ahead (to the tune of double, if you use enough of 'em). If you instead make use of some FP16 normalize instructions (to the tune of no more than 1 per 2 MAD's) the G70 pulls ahead (quite possibly also to the tune of double: nrm's are fairly expensive).

Umm, this makes it seem like they are eqaul. That's false.

ATI pipes are weaker than NVidia pipes, but they're often pretty close. One figure I've seen batted around on B3D that sounds about right is 86%. In other words an ATI pipe is 86% in R580 as good as a G70 pipe. This sounds about right to me.

For example, clock for clock the 16 pipe X800 held up pretty well to the 6800 ULTRA. THe pipes are comparable, though Nvidia does have an edge.

A site also compared R520 and G70 by disabling two qauds in the G70, clocking them the same and benching games. The idea was to get a clock for clock comparison. Again in Direct 3d games (actually Half Life 2) they were pretty close, maybe 10-20% difference tops. In Open GL Nvidia pulled away, but that's the case anyway.

So 48 pipes in R580 will trounce 24 in G71 in shaders. Not double, but close.

Jawed · Mar 7, 2006

satein said:
It sounds like a self fail-proof design for chip manufacturing . I wonder that during pass six months period of delay of R520, ATi may come up with this idea as how to make yeild satisfied (adding condition that the final chip should work at high clock rate to total quatities passed per wafer). It may sound cumbersome at first, but it would become more better on the next design.

This kind of redundancy has to be designed-in right from the beginning. Perhaps there are tweaks (say at the RAM level, because RAM's relatively simple) once statistics on a new process become available.

But the overall redundancy architecture needs to be in place so I can't see how it could be "tacked-on" in 6 months.

R580/RV530, with their 3:1 design, though, present an extra level of redundancy if it's really a 4:1 design.

Xenos, as I like to point out, appears to be a 4:1 design with 1 shader array lost to redundancy:

In this case redundancy is something like 8% extra die area for one entire shader array. What sort of redundancy there is over the remaining 66% of the die, who knows?... Lots of RAM (those black bits), which is another 8-10% I suppose.

Obviously, these continue to be my guesses...

Jawed

jb · Mar 7, 2006

Razor1 said:
...and even then nV is keeping up with ATi handly thats with a much weaker memory controller. Thats the scary part.

Scary???? Given the fillrate (and MB) advantage that these new parts have over the ATI parts, its not scary at all...

geo said:
Where did you get that, please? I'm aware of exactly one credible source for that, and frankly that one struck me as an ATI rep having a "bad speculation day" post NV40 transistor count head-scratching.

Geo, Dave said that ATI and NV counted transister differently. It seemed like it was some time ago thus it may or may not be still relevent....

Razor1 · Mar 7, 2006

jb said:
Scary???? Given the fillrate (and MB) advantage that these new parts have over the ATI parts, its not scary at all...

The scary part is that nV is still more optimal for shader per mhz effeciency with having less ALU's to work with and having less ADD/MUL operation capabilities. I don't think ATi has a single win if there is no aa and af involved. Well maybe one HL2, even that is essentially a tie without aa and af. With aa nad af in th mix it ends up to a slight win for the x1900xtx and this would be the 7800 gtx 512 vs the x1900xtx. With ATi focusing in on a new memory controller and nV not changing thier mem controller for quite sometime thats tells you how efficient nV's pipelines are or are programmed for.

For the r600 is ATi going to go through another overhaul of thier pipelines and leave thier new memory controller as it is with some tweaks of course. For the g80 is nV going to increase shader power per pipe again, and increase effciency of thier memory controller.

no-X · Mar 7, 2006

Razor1 said:
The scary part is that nV is still more optimal for shader per mhz effeciency. I don't think ATi has a single win if there is no aa and af involved. Well maybe one HL2, even that is essentially a tie with aa and af.

For shader per mhz? So you are saying, that R580 is shader-limited in current games? Whether R580 isn't shader limited, but fillrate limited, your findings are nonsense, because we usually don't measure shader efficiency in fillrate-limited situations...

Razor1 · Mar 7, 2006

no-X said:
For shader per mhz? So you are saying, that R580 is shader-limited in current games? Whether R580 isn't shader limited, but fillrate limited, your findings are nonsense, because we usually don't measure shader efficiency in fillrate-limited situations...

How can it be fillrate limited if your resolutions are 800x600?

http://www.xbitlabs.com/articles/video/display/radeon-x1900xtx.html

All games other then COD2 and HL2 lost cost with HDR and SCCT, nV leads with no aa and af. There is no other limitation other then a possible shader limit. Now think about a 7800 512 GTX clocked at 650. The lead should be more pronounced don't you think, when its not CPU limited at least. FEAR which so far is the most shader intensive game nV leads all the way up to 1600x1200 this is a similiar effect with SCCT. So yeah to some degree ATi has soon a bit of less shader limitation I would agree, but that will not happen much with an increase in clocks from nV's side. 5% lead on shader limited titles isn't much concidering nV is about to get a 18% boost in clocks. And add to the fact thier new drivers are going to give quite a huge boost in over all shader perfromance.

andypski · Mar 7, 2006

Razor1 said:
The scary part is that nV is still more optimal for shader per mhz effeciency with having less ALU's to work with and having less ADD/MUL operation capabilities.

Uh-huh?

If you want to look at some aspects of "shader efficiency" I'll first refer you back to this thread -
http://www.beyond3d.com/forum/showthread.php?t=28497

...all sorts of interesting discussions there as to how to view shader execution and pipeline execution, and how difficult it can actually be to reach a 'fair' conclusion. There's many different ways of looking at that, as we explored.

Nevertheless, I will point out one example from that thread in particular, and ignore all the interesting side issues of "how do we scale this" and "how many ALUs does this architecture really have", to look at one example of per-clock shader performance.

paraphrased from somewhere on page 3 said:
Cook Torrence with partial precision, scaling for difference in clocks -

R580 performance = 332.4 * 550/650 = 281.3 fps
Per clock performance for R580 versus G70 = 281.3/226.1 * 100 = 124%
So X1900's performance (per clock) seems to be about 24% faster than G70

Now with full precision:
R580 versus G70 = 281.3/162.3 * 100 = 173.3%
So X1900's performance (per clock) seems to be about 73% faster than G70.

So, even when allowed to run entirely in 16-bit precision a G70 at the same clock on this highly ALU-limited shader case loses by around 25%, and when running at the same 32-bit precision as R580 it loses by nearly 75%.

Remarkable/scary? Maybe.

Hmm... maybe I should go back and look at some more of the "texture heavy" cases in the article that is referenced from that thread, with a view of investigating the relative per-clock efficiency of 16 texture units versus 24. That might also be an interesting exercise.

I don't think ATi has a single win if there is no aa and af involved. Well maybe one HL2, even that is essentially a tie without aa and af. With aa nad af in th mix it ends up to a slight win for the x1900xtx and this would be the 7800 gtx 512 vs the x1900xtx.

You seem to find it remarkable when the 7800 manages to do well with no AA or AF (when in many current benchmarks these cards are actually limited by the CPU more than anything else), believing that this somehow points to great shader efficiency (why...?), and yet, when anisotropic filtering is involved (and texturing therefore becomes a much bigger component) you simultaneously don't find it remarkable that a design that apparently has 50% less texture units actually pulls ahead?

Okay...

And then when AA is involved (and memory bandwidth becomes a bigger component) you also seemingly don't find it "scary" or remarkable that an X1900XTX (which theoretically has 10% _less_ memory bandwidth) pulls ahead of the 7800GTX-512.

Intriguing...

Anyway, really I'm just trying to point out here that there are different ways of viewing things, and there are certainly many factors that contribute to current benchmark performance, shaders being just one. The X1900 shader design is meant to be forward-looking, and our belief is that shaders in games are only going to become increasingly ALU-heavy moving forwards, and that they will also tend to become a more frequent limiting factor. Under those circumstances I think that an X1900s overall per-clock shader power is largely unrivalled.

Razor1 · Mar 7, 2006

andypski said:
Uh-huh?

If you want to look at some aspects of "shader efficiency" I'll first refer you back to this thread -
http://www.beyond3d.com/forum/showthread.php?t=28497

...all sorts of interesting discussions there as to how to view shader execution and pipeline execution, and how difficult it can actually be to reach a 'fair' conclusion. There's many different ways of looking at that, as we explored.

Nevertheless, I will point out one example from that thread in particular, and ignore all the interesting side issues of "how do we scale this" and "how many ALUs does this architecture really have", to look at one example of per-clock shader performance.

So, even when allowed to run entirely in 16-bit precision a G70 at the same clock on this highly ALU-limited shader case loses by around 25%, and when running at the same 32-bit precision as R580 it loses by nearly 75%.

Remarkable/scary? Maybe.

Hmm... maybe I should go back and look at some more of the "texture heavy" cases in the article that is referenced from that thread, with a view of investigating the relative per-clock efficiency of 16 texture units versus 24. That might also be an interesting exercise.

You seem to find it remarkable when the 7800 manages to do well with no AA or AF (when in many current benchmarks these cards are actually limited by the CPU more than anything else), believing that this somehow points to great shader efficiency (why...?), and yet, when anisotropic filtering is involved (and texturing therefore becomes a much bigger component) you simultaneously don't find it remarkable that a design that apparently has 50% less texture units actually pulls ahead?

Okay...

And then when AA is involved (and memory bandwidth becomes a bigger component) you also seemingly don't find it "scary" or remarkable that an X1900XTX (which theoretically has 10% _less_ memory bandwidth) pulls ahead of the 7800GTX-512.

Intriguing...

Anyway, really I'm just trying to point out here that there are different ways of viewing things, and there are certainly many factors that contribute to current benchmark performance, shaders being just one. The X1900 shader design is meant to be forward-looking, and our belief is that shaders in games are only going to become increasingly ALU-heavy moving forwards, and that they will also tend to become a more frequent limiting factor. Under those circumstances I think that an X1900s overall per-clock shader power is largely unrivalled.

Well Since you work at ATi, add I already noted that ATi has spent a good deal of reasources to improve thier memory controller, it seems that went over your head when you read my post, how old is nV's current memory controller btw? Has it changed much for the last 3 or 4 years?

Also the x1900 shader design is forward looking to what extent? It can't compete with games today and shaders that are going to be used in the next year or so, when the x1900 will not be around. You tell me how doing a full screen of occulision parrallex mapping affects the x1900 and then tell me the frame rates that are achieved when doing this. Every single pixel covered going from low res to high res. And tell me if the x1900 is capable at pushing this kind of high level shader in real time, in games that will use a full screen of this shader and then have overdraw do it particle effects and shadows etc.

Are you saying ATi's new shader array's are weaker in older games then they will show thier power in newer games, that doesn't really make sense does it? A shader is a shader if one is being used, and certain hardware is more powerful pushing a certain shader it shows. Well if thats the case I would think we would have seen hints of that in FEAR and SCCT, didn't I mention that? Of course you over looked that. But then you see you have to factor in clock deficiency which won't be seen with the 7900 gtx will it?

This is the whole sm 3.0 thing with the gf 6 all over again isn't it?

ATi has one advantage that we have seen so far and only one its new memory controller. It still hasn't figure out a way to over come nV's pixel shader performance, even though it has the ability on paper. And thats it so far on paper. Just like the Fx looked good on paper.

Also the example you gave in the other thread, steep parrallex and fur(I think that one has it too), well they use dynamic branching, lets leave that out for now since we already know ATi also spent a good deal of resources on it, and nV hasn't. So it all comes down to this. ATi spent a good deal of effort in increasing dynamic branching, but didn't pay much attention to anything else. When dynamic branching shaders won't be used in the short term.

Also how can xbitlabs results be cpu limited when they scale depending on resolution?

Edit: So are you saying ATi has created a new GPU that will run synthetic shader tests great, but can't match up with comparitve product in a real world game? Thinking of the long term is good, but not when you can't compete with whats out there now. I don't think this is ATi's goal at all. I'm talking about real world games, you are talking about shader tests. nV has the lead in shader limited situations in real world games, why is that?

I see how you got your figures you are concidering the x1900 as a 16 pipe card. Yes it is but each one of its pipe does 3 times the work. So if you go by 48 ALU's vs, 48 ALU's it would be a better guestimate of power. Taking your values and cutting them down by 1/3 would be a better assessment of over all effeciency per clock. Since you can't compare the pipelines as you did doing equal amount of work.

Going by your calculations using # of ALU's instead of pipelines this is what I get

PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 291 * 550 / 625 * 24 / 16 = 384.1 fps
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 462 * 100 = 83.1%
X1900 at same clock rate as G70 with same pipe count = 373 * 550/650 * 24/16 = 473.4 fps
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 462 * 100 = 102.5% 68.3%

Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 632 * 550 / 625 * 24 / 16 = 834.2 fps
Per pipe performance for X1800 compared to 7800GTX = 834.2/766* 100 = 109%
X1900 at same clock rate as G70 with same pipe count = 683 * 550/650 * 24/16 = 866.9 fps
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 766 * 100 = 113% 75.3%

G70 wins one test at partial precision by about 20% and loses the other by 9% against an X1800 per-clock per-pipe
By the same metric it loses by 2.5% in one test and 13% in the other against X1900

PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 384.1 / 412 * 100 = 93.2%
Per pipe performance for X1900 compared to 7800GTX = 473.4 / 412 * 100 = 114.9% 76.6%

Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 834.2/713* 100 = 117%
Per pipe performance for X1900 compared to 7800GTX = 866.9 / 713 * 100 = 121% 80.7%

G70 wins one test by 7% over X1800 and loses the other by 17% per-pipe per clock
By the same metric it loses to X1900 by 15% and 21% respectively.

And now the "ALU intensive" versions

PS2 parallax mapping (partial precision)
X1800 at same clock rate as G70 with same pipe count = 256 * 550 / 625 * 24 / 16 = 338 fps
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 470* 100 = 71.9%
X1900 at same clock rate as G70 with same pipe count = 619 * 550/650 * 24/16 = 785.7 fps
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 470 * 100 = 167.2% 111.46%

Frozen Glass (partial precision)
X1800 at same clock rate as G70 with same pipe count = 663 * 550 / 625 * 24 / 16 = 875.2 fps
Per pipe performance for X1800 compared to 7800GTX = 875.2/877* 100 = 99.8%
X1900 at same clock rate as G70 with same pipe count = 1035 * 550/650 * 24/16 = 1313.7 fps
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 877 * 100 = 149.8% 99.8%

At partial precision per-pipe per-clock G70 wins one test against X1800 (which runs at full precision) by around 40%, and basically ties the other case.
By the same metric it loses both tests against X1900 by 67% in one test and 50% in the other.

PS2 parallax mapping (full precision)
Per pipe performance for X1800 compared to 7800GTX = 338.1 / 353 * 100 = 95.8%
Per pipe performance for X1900 compared to 7800GTX = 785.7 / 353 * 100 = 222.6% 148.4%

Frozen Glass (full precision)
Per pipe performance for X1800 compared to 7800GTX = 875.2/773* 100 = 113%
Per pipe performance for X1900 compared to 7800GTX = 1313.7 / 773 * 100 = 170% 113%

Shader power is a relative term, since there can be bottlenecks that involve texture ops also. Looking at these numbers it shows why nV leads over all in real world gaming. ATi looses all tests that are relative to texture limited shaders, and leads marginally in 2 of the 4 ALU intensive shaders and wins one of them outright, and ties the other. Shaders won't change much from the bottem line amount of texture ops they have to do from this point on so that will be our shader bottleneck until more ALU intesive shaders over come this. I don't see that happening anytime soon. So nV's 24 tmu's are doing the work neccessary to hold on to shader performance. And when ALU intesive shaders overcome the texture bottleneck, these cards aren't powerful enough to handle them.

karlotta · Mar 7, 2006

Razor1 said:
...if you go with the 1 to 7 ratio for some of the newer games coming out ....

what games? i just wonder where you get the 1 to 7. any links?

Razor1 said:
...It can't compete with games today ...

I dont understand? the link to xbit you gave has the 1900xtx as top dog? I want your crystalball....

Razor1 · Mar 7, 2006

karlotta said:
what games? i just wonder where you get the 1 to 7. any links?

I dont understand? the link to xbit you gave has the 1900xtx as top dog? I want your crystalball....

Read the edited post

1 to 7 is pretty close to what Crysis is doing from what I'm hearing.

x1900xtx is top dog I would disagree with that, but when it comes to over all shader power per clock it isn't, this comes down to texture ops vs, ALU calculations as noted in the editted section of my post.

andypski · Mar 7, 2006

Razor1 said:
Well Since you work at ATi, add I already noted that ATi has spent a good deal of reasources to improve thier memory controller, it seems that went over your head when you read my post, how old is nV's current memory controller btw? Has it changed much for the last 3 or 4 years?

I'm not privy to inside information about nVidia's memory controller design - I work for ATI, not nVidia, as apparently you have noticed.

Also the x1900 shader design is forward looking to what extent? It can't compete with games today and shaders that are going to be used in the next year or so, when the x1900 will not be around.

Seems to compete pretty well as far as I can see.

You tell me how doing a full screen of occulision parrallex mapping affects the x1900 and then tell me the frame rates that are achieved when doing this. Every single pixel covered going from low res to high res. And tell me if the x1900 is capable at pushing this kind of high level shader in real time, in games that will use a full screen of this shader and then have overdraw do it particle effects and shadows etc.

Well, let's try for a more interesting case shall we?

Our Toyshop demo makes extensive use of OPM along with other highly complex shaders, as I'm sure you're aware. In addition to this it has particle effects, transparencies, shadows, overdraw, physical simulation on the GPU, and renders it all with extended dynamic range - all things that I would regard as relevant. (It also manages to be just about the most impressive real-time rendering demonstration of which I am aware, but I'm naturally biased towards the great work of our demo team).

On my single X1900XTX here, using a FRAPS runthrough I get an average of 36 fps at 1600x1200 resolution. With 4xAA at the same resolution this drops to 33.5 fps. the minimum frame rate is 25.

I would say that that is eminently playable for many game genres, although perhaps not a twitch-shooter. If you want to include Crossfire in the mix then I would expect we could scale to 60fps+, which would seem enough for twitchy FPS play as well.

I believe that it would be intriguing and enlightening to see the frame rate of a 7800GTX-512 running this scene with the same level of quality, with a version of the demo as optimised as possible for that architecture. It might be less enlightening to see it run our version of the shaders - they do use dynamic branching after all, so...

Are you saying ATi's new shader array's are weaker in older games then they will show thier power in newer games, that doesn't really make sense does it? A shader is a shader if one is being used, and certain hardware is more powerful pushing a certain shader it shows. Well if thats the case I would think we would have seen hints of that in FEAR and SCCT, didn't I mention that?

It makes perfect sense as far as I can see.

"A shader is a shader" - what a comment - two different architectures will have very different characteristics running two different programs - as I recall from the way-back machine if you took a Pentium and ran it against a Pentium Pro(/2) on pure hand-tuned floating-point math code, guess what - the Pentium was often faster than the newer CPU at the same clock. Why? Because the latencies of FP instructions increased on the Pentium Pro, but the throughput remained the same. But then you run them on a more typical code mix, or code not specifically hand-tuned for either and the Pentium was often heavily beaten. Why? Because the Pentium Pro had out-of-order execution, and better branch prediction etc. so on general code it won convincingly.

A shader is not just a shader. Context and instruction mix are very important, and as shaders get longer then you may get to see more elements of pure shader performance coming through in final benchmarks.

That being said I don't think that I really overlooked that much - I mentioned that shader performance is only one of many factors dictating performance in current benchmarks, and then made a statement of my beliefs of how the balance will change in the future (beliefs, not a statement of fact - I think I was reasonably clear on this point).

Of course you over looked that. But then you see you have to factor in clock deficiency which won't be seen with the 7900 gtx will it?

In the example that I used above I scaled the performance by the differences in clock-rate for the purpose of the comparison. Maybe you overlooked that?

Also the example you gave in the other thread, steep parrallex and fur(I think that one has it too), well they use dynamic branching, lets leave that out for now since we already know ATi also spent a good deal of resources on it, and nV hasn't. So it all comes down to this. ATi spent a good deal of effort in increasing dynamic branching, but didn't pay much attention to anything else. When dynamic branching shaders won't be used in the short term.

How convenient to just leave it out - after all, "a shader is just a shader". How convenient to suddenly choose to think only in the short term, and ignore the future at the same time as implying that we are not being forward looking with X1900.

Non-sequitur.

The Cook-Torrence example is pure ALU - no branching whatsoever, so I would like to believe that we paid plenty of attention to arithmetic performance as well. And you yourself indicated that we do well with anisotropic filtering, which would seem to mean that we paid reasonable attention to that. Remind me what's left in terms of shading performance again, and where didn't we pay attention?

Also how can xbitlabs results be cpu limited when they scale depending on resolution?

If the benchmarks in question are scaling then they are obviously not completely CPU limited (I didn't quote any specific benchmark in my previous post). However, if a benchmark scales then is it necessarily highly shader limited? Doom3 scales at very high-res, but is not really heavily shader-limited, it's heavily shadow-volume rendering limited.

There can and are many other potential limitations that come into play. We believe, based on our research, that shading will become more important over the lifetime of the X1900 architecture. Time will tell if we were right.

[edit] I see you later added a whole load of the numbers from that thread. I said in my original post that I was ignoring the cases from the original thread where we were discussing "What counts as an ALU" or "How many pipelines do they have" due to the aforementioned difficulties in reaching a common consensus on those points. I believe that the Cook-Torrence example that I used from that thread was simply X1900 versus G70, scaling for clockrate alone - X1900 versus G70 on shader execution clock-for-clock.[/edit]

Razor1 · Mar 7, 2006

I'm not privy to inside information about nVidia's memory controller design - I work for ATI, not nVidia, as apparently you have noticed.

Seems to compete pretty well as far as I can see.

Well, let's try for a more interesting case shall we?

Our Toyshop demo makes extensive use of OPM along with other highly complex shaders, as I'm sure you're aware. In addition to this it has particle effects, transparencies, shadows, overdraw, physical simulation on the GPU, and renders it all with extended dynamic range - all things that I would regard as relevant. (It also manages to be just about the most impressive real-time rendering demonstration of which I am aware, but I'm naturally biased towards the great work of our demo team).

On my single X1900XTX here, using a FRAPS runthrough I get an average of 36 fps at 1600x1200 resolution. With 4xAA at the same resolution this drops to 33.5 fps. the minimum frame rate is 25.

So thats fast enough to run on todays hardware with all other neccessity a game will need? With Crossfire in the mix it is, I don't see that happening with everything else going full blown in a real world game. We are in the process of making a demo with the Cry engine and using regular Parrallex bump mapping it takes its toll.

It makes perfect sense as far as I can see.

"A shader is a shader" - what a comment - two different architectures will have very different characteristics running two different programs - as I recall from the way-back machine if you took a Pentium and ran it against a Pentium Pro(/2) on pure hand-tuned floating-point math code, guess what - the Pentium was often faster than the newer CPU at the same clock. Why? Because the latencies of FP instructions increased on the Pentium Pro, but the throughput remained the same. But then you run them on a more typical code mix, or code not specifically hand-tuned for either and the Pentium was often heavily beaten. Why? Because the Pentium Pro had out-of-order execution, and better branch prediction etc. so on general code it won convincingly.

A shader is not just a shader. Context and instruction mix are very important, and as shaders get longer then you may get to see more elements of pure shader performance coming through in final benchmarks.

Also I mentioned if a GPU is made for a certain shader it will show

If the benchmarks in question are scaling then they are obviously not completely CPU limited (I didn't quote any specific benchmark in my previous post). However, if a benchmark scales then is it necessarily highly shader limited? Doom3 scales at very high-res, but is not really heavily shader-limited, it's heavily shadow-volume rendering limited.

There can and are many other potential limitations that come into play. We believe, based on our research, that shading will become more important over the lifetime of the X1900 architecture. Time will tell if we were right.

The problem is you did all your numbers on a per pipeline bases, that can't be taken that way since each pipeline is doing a different amount of work. And this will come down to newer GPU's will need both more TMU's/ALU's taht can do texture ops + more ALUs. And thus the x1900 pipelines although look great on paper, don't have the ablility to perform enough texture ops to really out pace the g70's per clock. Not every game engine is made like the toy shop demo

. They will be using many more effects then what was shown there.

RobertR1 · Mar 7, 2006

Razor, just because nvidia mem controller is "old" it should be ignored and just because nvidia hasn't spent a lot of time on dynamic branching, it also should be ignored???

If you're going to argue, atleast to be reasonable enough to take everything applicable into consideration. If nvidia has a lead with AA/AF and that lead turn into a big loss with AA/AF applied, that's not a convincing arugment for an efficeint architecture. Your idea of efficeincy is to pick and choose areas where Nvidia does well but you're quick to disregard it's flaws or Ati's strong points since they do not support your effeceincy theroies.

We all know that nvidia helped your company with developement where as Ati didn't even reply back to the request but you gotta let that go at some point......

andypski · Mar 7, 2006

Razor1 said:
The problem is you did all your numbers on a per pipeline bases, that can't be taken that way since each pipeline is doing a different amount of work. And this will come down to newer GPU's will need both more TMU's/ALU's taht can do texture ops + more ALUs. And thus the x1900 pipelines although look great on paper, don't have the ablility to perform enought texture ops to really out pace the g70's per clock.

Read it again - The Cook-Torrance case that I used in this thread is not scaled by pipeline count, only clocks. You are the one who quoted all the results from the other thread where scaling was being done for pipeline count, which I explicitly left out of this discussion for reasons explained above.

If you would like to look at another example of pure ALU performance you can do the same scaling experiment with the Perlin Noise test from the same review.

Razor1 · Mar 7, 2006

andypski said:
Read it again - The Cook-Torrance case that I used in this thread is not scaled by pipeline count, only clocks. You are the one who quoted all the results from the other thread where scaling was being done for pipeline count, which I explicitly left out of this discussion for reasons explained above.

If you would like to look at another example of pure ALU performance you can do the same scaling experiment with the Perlin Noise test from the same review.

Ok then don't muliply and divide your results to recieve precentages by the pipeline count see what you get. You will end up with the same precentages I just got

since I used 48/48 where you used 24/16 multiplying by 1 won't change the outcome

G7x vs R580 Architectural Efficiency

Razor1

KimB

Ailuros

Epsilon plus three

Moloch

God of Wicked Games

ANova

superguy

Jawed

jb

Razor1

no-X

Razor1

andypski

Razor1

karlotta

pifft

Razor1

andypski

Razor1

RobertR1

Pro

andypski

Razor1

Similar threads