ATI's decision concerning TMUs

Gateway - you're rapidly turning this discussion into a childish debate, either reason your arguments sensibly or don't bother at all.
 
Gateway2 said:
This is all irrelevant anyway..The R580 is TMU bound to a ridiculous extent. That's fact. The benchmarks do not lie. It is sitting on FORTY EIGHT of the same pixel pipes that hold up reasonably well clock for clock against every Nvidia pipeline in the past.
NO IT DOES NOT YOU FOOL. Only the ALU units are increased. They are NOT the same pixel pipes.

Not every shader is the same. Example: If a third of a game's rendering time on R520 is in shaders that are heavily mathbound, then R580 boosts those shaders by a factor of 3. Your framerate goes up 29%. If you had only 32 math units, it's a 20% increase.

Translation: Just because framerate isn't increased 200% doesn't mean you're never using all 48 math units. Every single component in a graphics chip will have times it is sitting idle and performing at less than its peak rate, even within one scene of one game.

Honestly, ATI may have been better off releasing a 80nm speed bumped R520, and forgoing R580 altogether. The gains would have been the same 15% of performance, and you save a lot of trouble and engineering cost along with a smaller die.
And what if the process isn't ready? Both NVidia and ATI have access to the same processes. You can put R580 on the same process too if you want. This is a completely useless argument that does nothing to support your case.

Bottom line is this: R580 is more competitive in every aspect than R520, and the increase in ATI's cost for a 1900XT is far less than the increase in performance (present and future), marketability, technology, efficiency, etc. Every one of your points is completely irrelevant in these contexts, which are the only things that matter to ATI.
 
Gateway2 said:
How do you know this?

Even so, something like 20 TMU's 36 shaders would have been a better balance.

Anything would have been a better balance.

And when your die is that huge, adding a little more to actually make it see anywhere near the performance it should, is more than worth it. What's 60 million transistors? 20%? That 20% should see R580 250% speed increase.

Anything of course from a layman's perspective. I'm sure any ATI HW engineer would be able to break down those theories and he would even know what he's talking about.

Pardon me if the above sounds weird at best, but you're proposing different scenarios without having obviously the slightest idea how to build such a thing. While I won't disagree with you that R5x0 could have used higher MT fillrates overall, the question still bares whether it's unbalanced architecture or not. IMHLO it isn't and that from a global perspective encounting also the competition's disadvantages and the R5x0 advantages.
 
Neeyik said:
Gateway - you're rapidly turning this discussion into a childish debate, either reason your arguments sensibly or don't bother at all.


Err, how have my arguments not been sensible?

I'm stating a point then defending it with benchmarks.
 
NO IT DOES NOT YOU FOOL. Only the ALU units are increased. They are NOT the same pixel pipes.

Not every shader is the same. Example: If a third of a game's rendering time on R520 is in shaders that are heavily mathbound, then R580 boosts those shaders by a factor of 3. Your framerate goes up 29%. If you had only 32 math units, it's a 20% increase.

Translation: Just because framerate isn't increased 200% doesn't mean you're never using all 48 math units. Every single component in a graphics chip will have times it is sitting idle and performing at less than its peak rate, even within one scene of one game.

Uhh yeah that's great.

R580 is still OBVIOUSLY TMU bound.

When you have an argument come back. Fact is you cant because what I'm stating is simple fact, and Xbit labs, PC perspective, and many others have pointed out these facts before me.

Or lets go back to X1600..what's your excuse there? Why are those 12 "pixel ALU's" doing very poor compared to 12 "pixel ALU's" in X1800GTO?

What you're saying in the qoute is "pixel shaders only help sometimes"..then ATI should have focused on making something else faster. Like I dunno, texturing.

We already know ROPS aren't the bottleneck though I suppose you'll get to that claim next.

Why does everytime Nvidia adds pipes (not pipes, because they are NOTHING like NV40 pipes, but rather Pixel ALU's, to borrow your term) they get huge linear increases? ATI adds 32 "pixel ALU's" and gets nothing. Nvidia adds 8 pixel alu's (NOTHING like the old pipes, by the way) and gets 50% speedup or more.

Whatever they are doing is working a lot better.
 
Gateway2 said:
G71=~60% shader power of R580
For the last f***ing time, no. It's never that clear cut to generalize. In pure math shaders it has anywhere between 50% and 100% of R580's performance. In pure texture shaders it has 150% in theory.

G71=~55% die size R580
G71=~perfomance of R580

Hmm..maybe cus it's not texture bottlenecked?
I already told you why the die size is smaller. ATI can easily have 10 times the performance of G71 in a shader with dynamic looping. But this won't make it into games for a while. That's where ATI made its mistake.

They're actually getting worse over time. Not long ago Nvidia released drivers that made FEAR a tossup, Nvidia typically winning at 1280X with 4XAA8XAF. Oblivion is almost a tossup.
And did these drivers help R520 as well? :LOL: :LOL: :LOL:
Yet you think ATI should have stuck with that design, maybe just die shrunk it.

That games was built for NV40/G70 right from the beginning. It's a wonder NVidia did so poorly in the first place, as its shaders have many normalizations (hence can be tweaked with free FP16 versions), and 50% of the rendering time is in stencil shadows.

Being Open Gl belongs to Nvidia, being older games belong to Nvidia, being non-AA benches belong to Nvidia, it probably wins 80% of benchmarks against R580. The 20% R580 wins are probably more important, true, but it doesn't change the results.
More useless points. A different architecture wouldn't help ATI in the OpenGL games, except Doom3 which relates to the stencil stuff again. No-AF benches are completely useless, because enabling AF gives you gobs more detail than upping the resolution. No AA benchmarks are almost irrelevant to the target market.


Neeyik, I'm inclined to agree with you. Just lock this thread, as it's completely pointless.
 
Gateway2 said:
Err, how have my arguments not been sensible?

I'm stating a point then defending it with benchmarks.

Ironically the R580 wins against G71 in Fear by 9% in 1600 and by 14% in 2048 (always at default frequencies) in that FS link you posted :rolleyes:
 
Gateway2 said:
Uhh yeah that's great.

R580 is still OBVIOUSLY TMU bound.

When you have an argument come back. Fact is you cant because what I'm stating is simple fact, and Xbit labs, PC perspective, and many others have pointed out these facts before me.
And those sites make BS claims all the time. They don't know the workloads. They can't disable the other pipes. They don't know how much more texturing ability could be afforded in exchange for math ability.

Look at how often the X1600XT beats the X700 and the 6600GT which have double the texture units.

Or lets go back to X1600..what's your excuse there? Why are those 12 "pixel ALU's" doing very poor compared to 12 "pixel ALU's" in X1800GTO?
The X1800GTO is twice the die size, has 40% more bandwidth, and has 4x the fillrate. Great example genius.

Let me ask you this: Why does the X1600XT hand the 6600GT its ass on a platter when it has half the texture units? Other games are closer, but that's precisely my point. Things are never as clear cut as you're making them.

Read carefully Gateway2: SILICON ISN'T FREE. You want them to do something more with texturing, then they have to do something less elsewhere. What they did in R580 was much better than simply expanding R520 to 20 math+texture shader pipes. It will also serve them better for every game released until R600.

Regarding die size vs. NVidia, texturing has nothing to do with this. Period. We've covered that in many other threads, too.
 
Look at how often the X1600XT beats the X700 and the 6600GT which have double the texture units.

And they have much less shader power, too. That would be my guess. They're being held back more by lack of shading than X1600 is by lack of textures. Anyway that is only one game. I bet a 6600GT gives the X1600 a run a lot of the time, knowing how strong Nvidia is.

Wouldn't you agree the X1600 has been a disappointing part for having 12 shaders? Many websites have commented on this. So disappointing ATI ended up pushing R520 dies into the mid-range.

I happen to think it's pretty clear that's MOSTLY because it only has 4 TMU's, and that seems pretty obvious to me.
 
Gateway2 said:
Texture requirments in future games will go up as well, which is bottlenecking R580 currently. So the nicest thing you can say is R580 will slow down slower than the other guy (because it will be less bottlenecked by shaders). Not exactly a ringing endorsment.

And by then, R600 and G80 will be merrily doubling it's performance anyway. You only get a short time where people actually care about R580, and that is now.
So you're saying the "other guy's" card is going to slow down a lot faster than the R580, and (presumably) will need replacing a lot sooner if you want to play games? Not exactly a ringing endorsement. Especially at the price these cards cost.

By then, R600 and G80 will be meririly doubling it's performance anyway, and the "other guy" won't care if you can't play games and have to replace their card sooner.
 
Last edited by a moderator:
Mintmaster said:
Not every shader is the same. Example: If a third of a game's rendering time on R520 is in shaders that are heavily mathbound, then R580 boosts those shaders by a factor of 3. Your framerate goes up 29%. If you had only 32 math units, it's a 20% increase.

Translation: Just because framerate isn't increased 200% doesn't mean you're never using all 48 math units. Every single component in a graphics chip will have times it is sitting idle and performing at less than its peak rate, even within one scene of one game.

I think what he's saying is that if, on average, the ALU's are idle more than the TMU's (which seems to be the case today) then the architecture would have been better balanced with less ALU's and more TMU's (disregarding whether it was practical or feasible for ATi at the time). There's always the "ROPs still tied to TMU" excuse, but I'm sure ATi could've figured out a way to decouple them if they wanted to.

I consistently wonder if we'll ever see a game that takes full advantage of R580's shader power, while remaining playable in spite of the limited TMU power.
 
Mintmaster said:
Well, die space would explain it (BTW, BW efficiency doesn't make much sense as an explanation IMO). NV40 and G71 have much higher shader and texture unit density than NV30, and you can't do that for free.

Why not? Is aniso sampling usually all in cache?
 
trinibwoy said:
I think what he's saying is that if, on average, the ALU's are idle more than the TMU's (which seems to be the case today) then the architecture would have been better balanced with less ALU's and more TMU's (disregarding whether it was practical or feasible for ATi at the time). There's always the "ROPs still tied to TMU" excuse, but I'm sure ATi could've figured out a way to decouple them if they wanted to.

I don't want to know what happens when any of the crucial units of a GPU remain idle.

I consistently wonder if we'll ever see a game that takes full advantage of R580's shader power, while remaining playable in spite of the limited TMU power.

I personally don't even take that perspective; the part is more than just competitive overall and has quite sound advantages for the time it has been designed. According to ATI (if memory serves well) one of the goals for R520 was to increase image quality and then later on add a healthy portion of ALU processing power that the R520 actually lacked in.

Sadly enough R520 saw delays due to known reasons. From that point on, one way (and probably the easiest and cheapest) to further increase the R520's performance was to add more ALUs per SIMD channel in R580 and from what I can see the goal has been achieved.
Does ATI really need more performance for the time being?

With their next generation from what it seems they'll move to a USC, but some fundaments have been already set with R580. Some of them I wouldn't be surprised if we'll see in NVIDIA's next generation parts.

Now that leaves one part open whether architecture X or Y is ahead of it's time or not. I'll call it irrelevant as long as X or Y is more than just competitive.
 
Ailuros said:
Does ATI really need more performance for the time being?

Everytime this topic comes up, somebody asks this question. But that's too easy an escape. People are simply asking whether R580 would have been more impressive for its time with fewer ALU's and more TMU's. Given the ginormous die size it already has, maybe the trade-off would have resulted in an even bigger and impractical die size which is fine.

All speculative of course, but it seems to me that R580 will never be a "cheap" card to make and won't gracefully fall into the mainstream - it will just die a high-end death. I really do hope ATi is going to incorporate much of R580 into their USC from a financial standpoint.

The difference between NV40 and R580 is that NV40's checkbox advantages didnt cost Nvidia much transistor real estate - their recent boom was started with the NV40 and they were at a considerable die size disadvantage! Unfortunately for ATi, SM3.0 and SLI were marketable enough to make up the difference in sales - dynamic branching and ring-bus memory controllers are not :|
 
Gateway2 said:
Uhh yeah that's great.

R580 is still OBVIOUSLY TMU bound.
Sometimes it's texture bound.
Sometimes it's bandwidth bound.
Sometimes it's ROP bound.
Sometimes it's vertex shader bound.
Sometimes it's triangle setup bound.
Sometimes it's even ALU bound.

All of these situations happen sometimes. In games. Even during a single frame. But they differ in how often they occur, how much it costs to alleviate the bottleneck, and when you hit the next bottleneck if you remove that limitation.

R520 is ALU bound quite often. The obvious thing for ATI to do was to add more ALUs. It absolutely does not matter that increasing ALU count by 200% doesn't increase performance equally. What matters is whether more ALUs increase performance enough to justify the cost.

Why does everytime Nvidia adds pipes (not pipes, because they are NOTHING like NV40 pipes, but rather Pixel ALU's, to borrow your term) they get huge linear increases? ATI adds 32 "pixel ALU's" and gets nothing. Nvidia adds 8 pixel alu's (NOTHING like the old pipes, by the way) and gets 50% speedup or more.
From NV40 to G70 they added two complete quad-pipelines, extended the secondary ALU of every pipeline, and made them execute instructions independently.
 
Just to let people know you won't be getting any more replies from Gateway2, having just realised who that person actually is.
 
trinibwoy said:
Everytime this topic comes up, somebody asks this question. But that's too easy an escape. People are simply asking whether R580 would have been more impressive for its time with fewer ALU's and more TMU's. Given the ginormous die size it already has, maybe the trade-off would have resulted in an even bigger and impractical die size which is fine.

And it hasn't crossed anyone's mind yet that adding a couple of ALUs was the quickest and cheapest way overall?

How would one add "just" TMUs to a R520 without adding more quads and what exactly would that had meant in terms of transistor budgets and overall resources? R580 is just a refresh and while I can understand the reasoning why these kind of topics appear, it's as nonsensical as asking why G71 hasn't change A or B compared to G70. Uhmmm they're refreshes.

All speculative of course, but it seems to me that R580 will never be a "cheap" card to make and won't gracefully fall into the mainstream - it will just die a high-end death. I really do hope ATi is going to incorporate much of R580 into their USC from a financial standpoint.

Well NV is trompeting on that performance per W advantage because it'll most likely show in the mobile market for high end GPUs there and that in my mind was the primary reason why NVIDIA decided to not further increase the amount of quads on the G70 successor.

The difference between NV40 and R580 is that NV40's checkbox advantages didnt cost Nvidia much transistor real estate - their recent boom was started with the NV40 and they were at a considerable die size disadvantage! Unfortunately for ATi, SM3.0 and SLI were marketable enough to make up the difference in sales - dynamic branching and ring-bus memory controllers are not :|

Careful that's a different and quite large topic; it's roots can go way back to debates over inflection/deflection points with SM2.0 and SM3.0.

I'll overlook that one too and be as bold to claim that we'll see "SM3.0 done right" only during the D3D10 era.
 
Neeyik said:
Just to let people know you won't be getting any more replies from Gateway2, having just realised who that person actually is.

Who cares? (ok ok call me a hypocrite LOL). This conversation is still interesting ;)
 
And they have much less shader power, too. That would be my guess. They're being held back more by lack of shading than X1600 is by lack of textures. Anyway that is only one game. I bet a 6600GT gives the X1600 a run a lot of the time, knowing how strong Nvidia is.
So now you're saying the 6600 GT is more unbalanced than the X1600 XT, which is less limited by its lack of texture units? The 6600 is just a conventional 8 pipeline card, so if anything you're justifying ATI's choices with the X1600 XT - the 6600 GTs per-pipeline shading performance isn't enough to maximise its per pipeline TMU performance according to you!

The X1600 beats the 6600 by a good 20 - 50% in DirectX games

http://www.firingsquad.com/hardware/nvidia_geforce_7600_gt_performance/page6.asp

Wouldn't you agree the X1600 has been a disappointing part for having 12 shaders?
I don't see why we should be making comparisons based on what the card could have been. The X1600 was designed as a replacement for the X700 - it wasn't a true mid range card, and ATI worked within their transistor budget by going for a 3:1 ALU: TMU Ratio. The card wouldn't have fitted within its current price bracket had they gone for a full 12 pipeline design. So for what they wanted to do - replace the X700 and deliver a substantial improvement over the 6600, the X1600 suceeded, and I think it was an excellent compromise.

The issue is that they never designed a true mid-range chip, for whatever reason, leaving a huge 'price gap' in their lineup, one that has only now been filled with the X1800 GTO. I think they needed an 8 pipelne version of the X1600, ready to go head to head with the 7600 GT.

With regard to the R580, as others have pointed out, it only increased the die size by 20%, and yet performance increased by 30% in shader intensive titles, which made it an excellent investment. The R520 on the other hand doubled the number of transistors (from the R420) for the same 30% performance boost! So in terms of current real world performance, the R580 is arguably a far more balanced use of transistors than the R520.

I don't think it's reasonable to expect greater gains, considering the X1900 has the same bandwidth as the X1800. At a certain point you run into diminishing returns - this is what too many X1900 analyses missed, the role of bandwidth, which I'd say is the primary limiting factor, not the number of TMUs.

When the X1900 launched review sites assumed the TMUs were the principle cause of the holdup, that the X1900 was obviously unbalanced because it couldn't deliver on its 'paper increases'. And yet when the 7900 GTX launched, what did we see - a card with 18% more clockspeed and yet only 8 - 10% more performance.
 
Back
Top