ATI's decision concerning TMUs

Mintmaster · Apr 22, 2006

Gateway2 said:
And that's one reason it has a die nearly HALF the size.

If you have twice as big a die, you ought to be KICKING BUTT in performance, plain and simple. You can put double pipelines in double size.

If Nvidia had 32 pipes theyd probably be spanking R580 at still way less size. They just miscalculated their refresh cycle needs, which was a mistake they cant do anything about now, but probably wont ever happen again. A missed opportunity for ATI that you do not get many of.

Die size is mostly a result of ATI's design goal to make dynamic branching fast. ATI had a very compact shader pipeline in R300 and R420, and adding NV-level PS3.0 functionality wouldn't be that much more. Instead, ATI completely revamped the way they did pixel shading for dynamic branching. Doing different things on small batches is much less efficient than doing the same thing one large batch, whether you're shading pixels or running a manufacturing business.

This would be great if dynamic branching was a feature that could be quickly thrown into pixel shaders. That isn't the case though. Its very different from all the other features in the past.

We've been doing DP3 lighting since the NV10, and using FP precision and longer shaders to extend the lighting models was quite simple. We've seen post-processing techniques for a long time too, especially in the console space, which translated into HDR techniques. That's why VS1.1, PS1.1, PS2.0, and FP blending were very important.

SM3.0 brings techniques that enable very new algorithms with dynamic branching and vertex texturing. NVidia is very slow at these, but hey, it doesn't matter. They made the right decision, because they've got the checkmark, and nobody's using these features - the hallmarks of SM3.0 - in games.

Jawed · Apr 22, 2006

geo said:
Well, then that begs the question, Jawed --what is R580's performance constraint? If xbit is right, and gddr4 is only going to add 15%? (which ain't bad for a refresh, but that's not the point I'm after here)

I think it's about the bandwidth mix between the TMUs and ROPs. It's an architecture designed so that everything can be turned on.

ATI seems to show a preference for sacrificing low-eye candy (no-AA/AF) performance in favour of maximising eye-candy, AA/AF, performance

Jawed

Geo · Apr 22, 2006

Dynamic branching is included in SM4, yes?

We've heard speculation that NV might encourage devs to compile as many shaders as they can as SM3 in order to continue taking advantage of _pp hints even in putative "sm4" games.

Might ATI do the same in order to get dynamic branching advantages for R5xx-class cards?

Is that even practical?

compres · Apr 22, 2006

I am sure I am wrong, but it seems to me at ATI decides like the engineer and nVidia like the salesman or marketeer. Like regarding SM3, ATI does like the engineer and makes a good implementation, however nVidia does like the salesman and barely implements it and then go out and make banners saying they have the feature.

LOL am I that off?

DemoCoder · Apr 22, 2006

geo said:
Yeah? Since Cat5.13? Linkage, please?

You asking me? Go read Anand's DVD codec comparisons complete with screenshots showing artifacts that occur. Even if they have improved in recent drivers, where's the evidence that video engine is "much more competent" than NVidias? Seems like a pretty bold statement to make, especially since they were playing catchup with NVidia's PureVideo codec. Would you care to explain the competency issue differences between ATI and NVidia's HW and SW video processing? (not you geo, but Tahir)

BRiT · Apr 22, 2006

DemoCoder said:
You asking me? Go read Anand's DVD codec comparisons complete with screenshots showing artifacts that occur. Even if they have improved in recent drivers, where's the evidence that video engine is "much more competent" than NVidias? Seems like a pretty bold statement to make, especially since they were playing catchup with NVidia's PureVideo codec. Would you care to explain the competency issue differences between ATI and NVidia's HW and SW video processing? (not you geo, but Tahir)

Link to Anand's article? Since all the uptodate reviews of multimedia video quality I've seen put ATI's Avivo far ahead of Nvidia's PureVideo.

DemoCoder · Apr 22, 2006

BRiT said:
Link to Anand's article? Since all the uptodate reviews of multimedia video quality I've seen put ATI's Avivo far ahead of Nvidia's PureVideo.

Well, that's after they updated their driver. Back when I bought a 6600GT for a cheap HTPC, this or this or this review page for example, showed that PureVideo did better than AVivo. Probably as a result of reviews like this, ATI fixed their codec, and does better for example now in the color test and cadence test. No doubt, NVidia will fix theirs. This is a software issue (cadence detection is not done by some fixed function HW unit that forever dooms you as unfixable)

This is an area where quality is going to flipflop. Just 4 months ago, people in AVSForum were praising PureVideo over ATI. I honestly haven't kept up, since I already bought my 6600GT back then. But most people understood the problem to be software, and would obviously be fixed.

With experience building h264 codecs still in infancy, they are going to be numerous issues with the first batch of HDDVD/BR movies and you are going to see ATI and nVidia go back and forth.

That's why I object to the claim of much more competent video engine since it seems to suggest a fundamental hardware inferiority. This is not at all like quality differences in AF or AA. If back in November, if I had say that NVidia's video engine is much superior to ATIs, there would be tons of "but but... wait for updated drivers!" posts. And of course, B3D moderators would rush in with quotes from ATI engrs to assure people.

To me, when you talk about video engine, you're talking about performance and accuracy of HW acceleration parts of your HW, like motion-comp, DCT/iDCT, etc. These fixed function units can have precision differences, which if a problem is found, can't be fixed except by switching to software decoding for that part of the codec pipeline, or some other "workaround" like maybe using GPGPU techniques.

superguy · Apr 22, 2006

I think it's about the bandwidth mix between the TMUs and ROPs. It's an architecture designed so that everything can be turned on.

ATI seems to show a preference for sacrificing low-eye candy (no-AA/AF) performance in favour of maximising eye-candy, AA/AF, performance

Jawed

Huh?

This basically overcomplicates things..I dont even know what it is supposed to mean.

"bandwidth mix between TMU's and ROPs". What does that mean specifically?

As far as I'm concerned the problem is lack of TMU's. I've seen no evidence to suggest otherwise, whereas many including Xbit labs to suggest that is the problem.

I'm going to look at some X1900 no AA benches now.., if I'm not mistaken it does a lot better than X1800 there..which I figured was due to the X1800 being pretty light on brute power. it's true the architecture was designed for AA, but hey they put a huge memory controller in there so thats paid for too, and has nothing to do with the lack of TMU's.

When you're 85% bigger than the other guy, there's plenty of extras to go around.

The sad thing is I look at all these guys at Hardocop, granted a bit of an Nvidia haven probably, they all love their big stupid SLI rigs by the thousands..in order to get those guys you have to be a LOT better if you're ATI, not a little or, Nvidia is close enough. You beat nvidia by 40% those guys HAVe to take notice. You beat em by 5% here and lose by 2% there, they dont care.

ATI had the perfect opportunity to open up a huge performance lead and blew it..

superguy · Apr 22, 2006

ATI seems to show a preference for sacrificing low-eye candy (no-AA/AF) performance in favour of maximising eye-candy, AA/AF, performance

Jawed

And even though this is true..so what?

Nvidia comes within a few percent most of the time with AA and AF.

You cant even say ATI own FEAR, Call of Duty 2, etc anymore. Nvidia polishes up some drivers and it's..close enough, win a few lose a few.

So if ATI maximised for AA and AF..they dont have enough to show for it, because Nvidia is right there trading blows with them with AA and AF turned on. And of course, still easily beating them with it off.

Sure ATI has the image qaulity lead..but I'd say at a certain point "good enough" is a better strategy, aka what Nvidia has done. The extra image qaulity is not worth an 85% bigger die. It's worth maybe I dunno, 20%?

I dont see ATI doing gangbuster sales over image qaulity, so if the consumer doesn't care so much they need to give the consumer what the consumer wants, more speed at the expense of image qaulity a little bit, apparantly.

I dont know what the sales are but you look at Hardocp's Nvidia forum is much busier, the valve steam surveys indicate Nvidia doing well, and the top ten video card sales list at Tiger Direct is mostly Nvidia cards.

Humus · Apr 22, 2006

Gateway2 said:
Texture requirments in future games will go up as well, which is bottlenecking R580 currently. So the nicest thing you can say is R580 will slow down slower than the other guy (because it will be less bottlenecked by shaders). Not exactly a ringing endorsment.

Rings good to me. You may want to go back and try playing some new games on R300 and NV30. The more forward-looking architecture of the R300 made it slow down more slowly than the plays-current-games-well NV30. If you got the R300, you can still enjoy most new games.

Of course texture requirements will go up, but not nearly as fast as ALU. Just look back a few years. You may remember a time when "multitexturing" was the latest and greatest. That was when most games did 'texture * lightmap', which would be 1:2 ALU:TEX ratio. Since then lighting has become more advanced and requiring more textures (bumpmap, gloss, shadow etc), but the ALU ratio has gone up much faster and the typical shader now has more ALU than textures. This trend will most likely continue. The original Radeon had 3 TMUs, the 8500 had 2, the 9700 had one. Now we're essentially at 1/3 TMU per ALU. The texturing capacity has gone up too, but ALU has increased way more. This is all due to how shading has progressed over the years.

superguy · Apr 22, 2006

Humus said:
Rings good to me. You may want to go back and try playing some new games on R300 and NV30. The more forward-looking architecture of the R300 made it slow down more slowly than the plays-current-games-well NV30. If you got the R300, you can still enjoy most new games.

Of course texture requirements will go up, but not nearly as fast as ALU. Just look back a few years. You may remember a time when "multitexturing" was the latest and greatest. That was when most games did 'texture * lightmap', which would be 1:2 ALU:TEX ratio. Since then lighting has become more advanced and requiring more textures (bumpmap, gloss, shadow etc), but the ALU ratio has gone up much faster and the typical shader now has more ALU than textures. This trend will most likely continue. The original Radeon had 3 TMUs, the 8500 had 2, the 9700 had one. Now we're essentially at 1/3 TMU per ALU. The texturing capacity has gone up too, but ALU has increased way more. This is all due to how shading has progressed over the years.

I understand that.

R580 needs more than 16 TMU's.

It has 48 pixel shaders, and is basically on par with the previous part that had 16 pixel shaders.

Humus · Apr 22, 2006

It's more like 20-30% faster on average. By the end of its lifetime it will probably be something like 2-3x faster.

superguy · Apr 22, 2006

Humus said:
It's more like 20-30% faster on average. By the end of its lifetime it will probably be something like 2-3x faster.

Aaanndd nobody's gonna care then. Nobody will buy them then either..they will buy them or not based on todays benchmarks..actually most of the benchmarks are already done at release..

I mean it's nice and all to have future performance, but again especially when it can only go down from here it's not that exciting, and it's a minor consideration compared to current performance..

Just the mere fact there is 2x-3x there, and by your numbers we see 20-30% of it (might be on the high side) tells you something is very wrong!

Ailuros · Apr 22, 2006

I haven't seen a "crappy architecture" for quite a long time from the two major IHVs. One would have an extremely hard time to prove that either R580 or G71 are unbalanced architectures. Both have their advantages and disadvantages (well uhhmm what a huge surprise heh...), meaning that the R580 can at times get limited by multitexturing fillrates, while the G71 can at times get bandwidth limited by it's way larger MT fillrates. If that oversimplyfication shouldn't work for you replace it with something else that fits; at the end of the day what matters is what comes out at the other end and from that perspective both seem to do more than just fine.

Recently a fellow user sent me performance results of two S3 boards vs. a 6600@450MHz for which I wrote a write-up. Guess what the S27 loses only wherever the NV43's MT fillrate can push ahead; other than that the S27 does a pretty decent showing due to it's 700MHz clock frequency:

http://www.3declipse.com//content/view/17/2/

Mintmaster · Apr 22, 2006

Gateway2 said:
Just the mere fact there is 2x-3x there, and by your numbers we see 20-30% of it (might be on the high side) tells you something is very wrong!

And why is that? ATI increased die size by 20% and board cost by maybe 10-15%. 20%-30% performance gain is awesome. I don't know why you keep harping on the 3x ALU when the transistor cost is a fraction of that.

Get this through your head, Gateway2: Texture units cost much more than ALUs. No way does 8 more R520-like shader units fit in 60M transistors.

superguy · Apr 22, 2006

One would have an extremely hard time to prove that either R580 or G71 are unbalanced architectures.

It's already been proved over and over.

R580=3x shader power=1.15-1.3X performance of R520.

superguy · Apr 22, 2006

Mintmaster said:
And why is that? ATI increased die size by 20% and board cost by maybe 10-15%. 20%-30% performance gain is awesome. I don't know why you keep harping on the 3x ALU when the transistor cost is a fraction of that.

Get this through your head, Gateway2: Texture units cost much more than ALUs. No way does 8 more R520-like shader units fit in 60M transistors.

This mean little..does die size normally correlate with a percent performance increase?

Fine texture units cost more than ALU's..then either do a Nvidia-like architecture, or dont bother on the ALU's.

36 pixel shaders would have been more than enough..actually 20 would have been.

This is all irrelevant anyway..The R580 is TMU bound to a ridiculous extent. That's fact. The benchmarks do not lie. It is sitting on FORTY EIGHT of the same pixel pipes that hold up reasonably well clock for clock against every Nvidia pipeline in the past.

This shouldn't be a fight, it should be R580 winning by 60% or more, in every single benchmark on the market today.

Honestly, ATI may have been better off releasing a 80nm speed bumped R520, and forgoing R580 altogether. The gains would have been the same 15% of performance, and you save a lot of trouble and engineering cost along with a smaller die.

Put it in perspective for you?

Mintmaster · Apr 22, 2006

Gateway2 said:
It's already been proved over and over.

R580=3x shader power=1.15-1.3X performance of R520.

R580 = same texturing power as R520 = 1.15-1.3X performance of R520
R580 = same pixel fillrate as R520 = 1.15-1.3X performance of R520
R580 = 10-15% more board cost = 1.15-1.3X performance of R520
G71 = 50% faster texturing than R520 = a bit slower performance than R580

And these numbers are improving all the time. Plus, you have an enormous marketing advantage over R520, which would have been toast compared to G71.

superguy · Apr 22, 2006

Get this through your head, Gateway2: Texture units cost much more than ALUs. No way does 8 more R520-like shader units fit in 60M transistors.

How do you know this?

Even so, something like 20 TMU's 36 shaders would have been a better balance.

Anything would have been a better balance.

And when your die is that huge, adding a little more to actually make it see anywhere near the performance it should, is more than worth it. What's 60 million transistors? 20%? That 20% should see R580 250% speed increase.

superguy · Apr 22, 2006

G71 = 50% faster texturing than R520 = a bit slower performance than R580

G71=~60% shader power of R580
G71=~55% die size R580
G71=~perfomance of R580

Hmm..maybe cus it's not texture bottlenecked?

And these numbers are improving all the time. Plus, you have an enormous marketing advantage over R520, which would have been toast compared to G71.

They're actually getting worse over time. Not long ago Nvidia released drivers that made FEAR a tossup, Nvidia typically winning at 1280X with 4XAA8XAF. Oblivion is almost a tossup.

Being Open Gl belongs to Nvidia, being older games belong to Nvidia, being non-AA benches belong to Nvidia, it probably wins 80% of benchmarks against R580. The 20% R580 wins are probably more important, true, but it doesn't change the results.

ATI's decision concerning TMUs

Mintmaster

Jawed

Geo

Mostly Harmless

compres

DemoCoder

BRiT

(>• •)>⌐■-■ (⌐■-■)

DemoCoder

superguy

superguy

Humus

Crazy coder

superguy

Humus

Crazy coder

superguy

Ailuros

Epsilon plus three

Mintmaster

superguy

superguy

Mintmaster

superguy

superguy