Mind helping me interpret some #s?

Pete · Mar 15, 2006

Mind helping me interpret AoE3 #s?

I'm losing my bearings again, b/c I can't figure out why ATI does so much worse than NV at AoE3. I thought the X1900 proved that AoE3 liked shader power, given that it outperformed the X1800 at the same clocks. Let's say the X1800 was maxing out its shader units but not its texture units. Surely the situation would be reversed with the X1900? If so, won't the similarly 3:1 ALU:TMU X1600 leave some shaders dangling while maxing its pittance of texture units? Yet we see the 6600GT outperforming the X1600XT and the 7600GT doing the same to the X1800GTO.

Looking at HW.fr's GPU table for clues, how can I mesh the math and texture ops #s to determine AoE3's bottleneck(s) WRT mainstream parts? If X1900 (HW.fr table) gains fps by adding 3x the math ops, then can we assume all of X1600XT's texture units are in use and so it's TMU constrained? So, let's compare the X1600XT with the 6600GT. The X1600XT gets 2k tex ops/s and 7k math ops/s. Comparing relative to tex ops, the 6600GT would get 2k tex ops/s and 4k math ops/s.

Keeping Xbit's X1900 vs. X1800 #s in mind (roughly similar improvements both w/ and w/o AA+AF), how is the X1600XT's extra shader power not putting on a better show in HW.fr's #s? Is it down to NV's more flexible shader pipe allowing more tex ops, packing in more shader ops with more flexible co-issue (3+1/2+2 vs. ATI's 2+2), FP16 optimizations, better shader compiler, (better) app-specific optimizations, or what?

Let's say the 6600GT is faster b/c of more tex fetches, which is possible with 16xAF. Let's scale according to framerates. The GT is 40-50% faster than the XT, so let's say 3k tex ops/s for the GT. That leaves just 2k math ops/s. If tex ops are the limitation and we're seeing fewer math than tex ops, then why isn't the 16*625MHz=10k X1800XT as fast as the 24*430=10k 7800GTX, and why is it slower than the X1900XT/X?

The 4xAA #s further confound me, as NV loses up to three times as much performance as ATI. ATI's 256b X1800GTO and its 128b X1600XT each drops a curiously meager 10%, while NV's 128b 7600GT and 6600GT each drops 30% and its 256b 6800GS drops 16-20%. This is a clue, I'm sure, I just don't know how to interpret it. Is the fact that 16xAF's on for all tests a problem? I could guess that helps NV's more TMU-endowed GPUs to shine w/o AA, but that doesn't explain it dropping more with AA--unless this is ATI's memory controller showing off.

Maybe there's just something wrong with the game engine, as Xbit said.

Mintmaster · Mar 15, 2006

This game has very peculiar performance characteristics.

First of all, all arguments related to fillrate, such as bandwidth, shader rate, fillrate, etc. couldn't even account for half of the differences. Look at the resolution scaling between 1024x768 and 1280x1024. 67% more pixels at 1280 only gives a 25-30% boost to the 1024 score for NVidia and even less for the X1600XT.

There seems to be a very heavy per frame cost, whether it is vertex related, CPU/driver related, a fixed size cube map/shadow map, or whatever. I have no idea what's going on.

As for AA, the problem is likely due to NVidia doing 2.25 supersampling instead of 4x multisampling, but AFAIK that should only apply to the HDR mode.

_xxx_ · Mar 15, 2006

Bad programming maybe?

DOGMA1138 · Mar 15, 2006

AoE3 has the wierdest preformance ever seen in a game.
its preformance just cant count for any of card's 'known' strenghts/weaknesses. so either it uses alot of complex rendering modes, or just realy bad coding.

Pete · Mar 16, 2006

Thanks for the replies. Perhaps the "pending performance improvements" are still pending. Or maybe they're already in and I didn't notice the framerate improvements. Eric said in p. 3 of that thread that he doubts it was a shader change but didn't know for sure.

Whatever the case, this probably didn't deserve a second thread, but I a) couldn't help myself and b) honestly forgot that thread was so recent.

Apropo to Mintmaster's fillrate comment, here's a fillrate graph akin to B3D's. Edit: Look a few posts down.

DOGMA1138 · Mar 16, 2006

X600XT doing better then GF6800, X700XT doing better then GF6800U X850XT doing better then X1800XT... this game is simply nuts.

Mintmaster · Mar 16, 2006

Well that has to do with PS3.0 vs. PS2.0 shading paths. The game looks very different in the two modes, and it's not just for speedups (obviously).

Reviewers should really test the game in both modes rather than the default.

Pete · Mar 17, 2006

SM2 vs SM3 IQ differences in AoE3, for Dogma and others.

defaultluser · Mar 17, 2006

You're missing something Pete:

So, let's compare the X1600XT with the 6600GT. The X1600XT gets 2k tex ops/s and 7k math ops/s. Comparing relative to tex ops, the 6600GT would get 2k tex ops/s and 4k math ops/s.

The 6600 GT actually can do 4000 texture ops per second, same as the number of shader ops. This is the advantage of Nvidia's design, and explains why their cards are faster. Take a look at this difference: the Nvidia card is capable of 69.5% more texture ops per second, but a lot less shader ops, and scores a framerate that is 50.5% faster.

We can assume that it is the shader units holding back the 6600 GT, because if it were texture units, we would have seen something closer to the 69.5% performance improvement. Even taking into account real-world overhead, that's still too large a gap to just dismiss.

Thus, the game IS shader limited, but only slightly. This explains why with the extremely aggressive ratio of 3:1 ps to tex, we see only a small improvement (28.5% x1900 XTX over x1800 XT at same resolution).

In fact, based on the 28.5% performance improvement of the x1900 XTX over the x1800 XT, and the data discussed above, we can assume that the game has a ratio of approximately 1.2-1.3 shaders to textures.

Ailuros · Mar 17, 2006

So, let's compare the X1600XT with the 6600GT. The X1600XT gets 2k tex ops/s and 7k math ops/s. Comparing relative to tex ops, the 6600GT would get 2k tex ops/s and 4k math ops/s.

Those 2k tex OPs for NV43 are obviously a typo.

RV530: 86.4 GFLOPs/s, 2.4 GTexels/s bilinear
NV43: 48 GFLOPs/s, 4.0 GTexels/s bilinear

Looking once more at XbitLabs's results and the no AA /AF scores I can see in 1600:

X1900XTX = 48.4 (374 GFLOPs/s, 10.4 GTexels/s bi)
X1800XT = 28.8 (120 GFLOPs/s, 10 GTexels/s bi)
7800GTX256 = 47.5 (165 GFLOPs/s, 10.32 GTexels/s bi)
7800GTX512 = 60.5 (211 GFLOPs/s, 13.2 GTexels/s bi)

And I think defaultluser's assumption is right on track above. I don't see anything mysterious. There's a similar trend in SS2, with the difference that reviewers for an unknown reason tend to skip TC in that one as if it's some sort of abonimation to use it.

Pete · Mar 18, 2006

Edit: This was typed before Ail's post, so it might be redundant or pointless. That hasn't stopped me from posting before, so.... ;^)

Thanks, dl. I don't know what I was thinking. OK, I may know what I was thinking. I probably confused the ALUs, thinking ALU1 had the MADD. (I was and still am counting only a MUL/MADD as a "shader op" for the sake of simplicity. That probably isn't doing my analysis any favors.) 6600GT can indeed do 4k tex ops (using ALU1) and also 4k shader ops (ALU2). *That's* what I should compare to X1600XT's 2k tex and 7k shader.

(Now I get to figure in Mint's enlightening reminder that AF doesn't take up more bandwidth but rather more cycles, in which case you would think that would partially alleviate a shader bottleneck by allowing for more math to be crunched during those extra texture fetch cycles.)

So, excellent point about being shader-limited beyond a set fillrate. But I guess I'm forgetting bandwidth, which may be limiting the 6600GT and may explain why the 7600GT, with 30% greater fillrate than the 6800GS, is only 10% faster.

And yet compare the X1800GTO (12 pipes @ 500MHz) with the 7900GT (24 pipes @ 450MHz). Here we have the GT with ~2x the texture rate as the GTO (but a similar shader rate with full tex ops) and just 1.5x the bandwidth scoring ~2x higher. Or look at the GTO and the 6800GS (12 pipes @ 425MHz), perhaps a more apt comparison b/c of equal bandwidth. The GS still scores 15% higher despite a lower fillrate and less shader power.

Add HDR, and the GTO ties the GS (possibly due to bandwidth, as the GTO is pushing FX10 and the GS FP16), yet the GT remains ~2x faster (surely not due to bandwidth, as it has just 30% more yet is pushing 2x the framerate with FP16).

CB.de's AoE3 HDR #s make comparison trickier still, mainly b/c HDR seems to shift the ratio toward shader ops (see the 7600GT distance itself from the 6800GS in HW.fr's #s) and the apparent CPU limitation (X1900XTX stuck at 36fps). But we can see some things in the non-AA #s. One, the X1900XT separates itself from the X1800XT at 16x12. It goes from 20% to a whopping 50% faster. (That would seem to indicate a math:tex ratio closer to 2:1. Or maybe AF and resolution relative to texture MIP-maps further complicate things, as in the HW.fr review you can also see the 3:1 X1600XT also drop relatively less framerate moving up in resolution.) Two, check out the X1300P vs the 7300GS. Both have ~600MHz cores, 4 TMUs, and 400MHz RAM, but the 7300 has just half the memory bus width and yet ties the X1300. Curious. Three, dig the X1600XT tying the 6600GT in both resolutions. Craziness!

OK, looking at Ail's sensible analysis, the two points that stand out are X1800XT trailing X1900XTX and the GTX256 blowing the doors off of the X1800XT. Again, just looking at the two ATI cards, it seems a safe assumption that FLOPs are the thing. But look at GTX256, with the same fillrate but 37.5% higher FLOPs outscore the X1800XT by 65%. That's still what's puzzling me. Maybe ATI's FLOPs impinge on its texture units more than NV's? <-- clueless guess

Ail, "TC?" Edit: Texture compression?

Some more fillrate graphs, for purple fans:

defaultluser · Mar 18, 2006

What you're asking for at this point is almost impossible to answer, because there is almost assured to be optimized rendering paths for ATI and Nvidia hardware. I mean, just implementing HDR as Nvidia and ATI want requires two different methods.

Maybe the optimized Nvidia path handles HDR better, but we'll never know. Maybe, like many games, the path for ATI cards can't make full use of all those pixel shader units, even if it is hungry for more PS processing. We'll likely never know.

Basically, I would stick to comparing between Nvidia cards, or comparing between ATI cards, but not cross those barriers.

As for your comparison between the 7600 GT and 6800 GS, I'd say you're right that it is bandwidth limited. The 6600 GT (ignoring access latency) could read / write 16 32-bit values each core clock. Even with 8 texture reads and 8 Z tests and 4 pixels written to the framebuffer, you're only a little memory constrained, and during normal operation you're never going to use all that.

The 7600 GT, on the other hand, has increased texture and Z performance by 70%, and doubled the number of ROPs, but the memory is only 40% faster. Obviously, Nvidia is betting on the same thing that ATI is: that games already use more pixel shaders than textures, and will have more pixel shader usage in the future...otherwise, the 7600 GT wouldn't be doing as well as it is.

So, like I expected, in some games the 7600 GT barely surpasses the 6800 GS, but in others it outperforms it handily...and HDR, a shader-hungry operation, works VERY well on the 7600 GT.

Xmas · Mar 18, 2006

Most methods of HDR rendering are totally bandwidth limited (i.e. adding them shifts the balance towards more bandwidth required).

Ailuros · Mar 19, 2006

Pete said:
[
OK, looking at Ail's sensible analysis, the two points that stand out are X1800XT trailing X1900XTX and the GTX256 blowing the doors off of the X1800XT. Again, just looking at the two ATI cards, it seems a safe assumption that FLOPs are the thing. But look at GTX256, with the same fillrate but 37.5% higher FLOPs outscore the X1800XT by 65%. That's still what's puzzling me. Maybe ATI's FLOPs impinge on its texture units more than NV's? <-- clueless guess

I guess that defaultuser's point that the game might have a 1.2-1.3 arithmetic to texture ratio is closer to a reasonable explanation. Remember I did put MT fillrates right next to the theoretical FLOP numbers.

Ail, "TC?" Edit: Texture compression?

Yes. Which by the way skipping TC in SS2 hurts by a large margin memory size and texture bandwidth on GPUs for no sensible reason.

Xmas · Mar 19, 2006

Ailuros said:
Yes. Which by the way skipping TC in SS2 hurts by a large margin memory size and texture bandwidth on GPUs for no sensible reason.

Ailuros · Mar 19, 2006

Xmas said:

http://www.3declipse.com//content/view/16/9/

According scripts to evaluate settings are:

http://www.3declipse.com/images/stories/ailuros/mediumq.doc

http://www.3declipse.com/images/stories/ailuros/highq.doc

Xmas · Mar 19, 2006

Could you rephrase that sentence? Because I don't believe the way I understand it is the way you want it to be understood. Or if it is, it doesn't make sense to me.

Pete · Mar 20, 2006

I think he means sites bench SS2 with TC disabled and it offers no appreciable IQ benefit, so all it does is impact framerates for no good reason. Sounds like Ultra Quality in D3/Q4.

Ailuros · Mar 20, 2006

Exactly. And yes that sentence was awefully phrased, but since Pete actually re-phrased it an edit is redundant.

Mind helping me interpret some #s?

Pete

Moderate Nuisance

Mintmaster

_xxx_

DOGMA1138

Pete

Moderate Nuisance

DOGMA1138

Mintmaster

Pete

Moderate Nuisance

defaultluser

Ailuros

Epsilon plus three

Pete

Moderate Nuisance

defaultluser

Xmas

Porous

Ailuros

Epsilon plus three

Xmas

Porous

Ailuros

Epsilon plus three

Xmas

Porous

Pete

Moderate Nuisance

Ailuros

Epsilon plus three