Upcoming ATI Radeon GPUs (45/40nm)

I guess I'll pull out the analogy:

If we have a 1 lane highway where the first car in the lane is running at 30MPH then the overall flow of traffic is 30MPH.

If we have a 4 lane highway where the first car in each lane is running at 30MPH then the overall flow of traffic is still 30MPH, but with 4x the throughput.

Given the organisation of the parallel engine of R600/RV770 then this is a reasonable analogy.
 
Oh, I think Admdahl's Law would apply to gpus too, but you'd have to look at the non-parellized portions of the chip to find and point at your bottleneck.
Indeed. Ths actually points to altering the ratio within each SIMD, which is precicely the thing that hasn't changed in this discussion.
 
RV770 increases paralellism, it does nothing different for sequential processing in relation to R600. The only thing different that RV770 does here is increase the number of compeletely separate and unrelated threads that can be executed simulataneously.
I was using the more generalized variant of the Amdahl's law, which I was pointed to somewhere else on B3d's forum. That variant is only a more formal way of stating diminishing returns. Improving any given component of run time means future improvements there are proportionately smaller.

I contend that R6xx already supplied such a significant amount of ALU resources that more than doubling its capacity is improving on something that was already very good. Even a matching relative improvement on wall clock time devoted to ALU work is a fraction of an already smaller amount.

The TEX component of the workload was not that stellar for R6xx, so its proportion of the overall workload increased.
Improving on a wider slice of the pie has a greater overall effect than improving what is already quite good.

We have example designs that have done very well with a lower ALU:TEX ratio, though they show signs of hitting diminishing returns on the TEX side.
 
AA was fixed in RV770. Z-fill rate was fixed in RV770. AF may have been fixed too, although that's less clear.
I expect there are kinds of texturing that run much better on RV770 due to L1 being localised per SIMD - but proving this is pretty hard.

Jawed
 
I contend that R6xx already supplied such a significant amount of ALU resources that more than doubling its capacity is improving on something that was already very good. Even a matching relative improvement on wall clock time devoted to ALU work is a fraction of an already smaller amount.

The TEX component of the workload was not that stellar for R6xx, so its proportion of the overall workload increased.
Improving on a wider slice of the pie has a greater overall effect than improving what is already quite good.
Because of the nature of the execution units organization and operation, proportionally the workload execution is competely unchanged.

What you are asking for is an increase in Tex processing at a different rate to ALU, which in itself may be a valid opinion, it is not, however, what has happened and this element remains unchanged from the principles of R600, ergo there is no "correction" in this regard.
 
So what I gather from this is that TMU performance really isn't a bottleneck on R600 and RV670.... which really turns a lot of things on it's head. Given the huge perf increase we're seeing from RV670->RV770 it looks like the engine was the bottleneck. This raises some other questions though. If that was the case why was R600 shader core limiting performance so drastically? On paper it soundly matched G80 (I'm thinking pure flops here) and yet in apps it fell way short.
 
Question: was RV670 tex-bound a majority of the time (excepting for extremely AA-heavy or CFAA scenarios)?
R6xx is, in my opinion, more Z-rate bound than TEX bound. Shadowing and MSAA really kill this architecture.

Answer: definitively, resoundingly, yes.
Now apply the same question to RV770 and what happens? We see the opposite. RV770 is far from tex-bound.
RV770 is prolly more TEX bound than R600 simply because it's less likely to be Z-rate bound.

Additionally, the bandwidth increase given to RV770 is theoretically a bit marginal - but the seriously different memory tiling in RV770 makes this extremely hard to evaluate. And besides that, it's proving very hard to find situations when RV770 is significantly bandwidth-bound - though available reviews aren't helping here.

Jawed
 
Because of the nature of the execution units organization and operation, proportionally the workload execution is competely unchanged.
I'm not sure what you mean by that, can you clarify?
The ALUs can still do work while the TMUs are doing something, right?
In a TEX-limited shader, the ALUs may not be doing much, but by virtue of there being more than double the number of TEX units, the chip can make sure that it can churn through those stretches of the application faster.
For math-limited shaders, the the TMU idles when the ALUs are busy.

If the number of ALU-limited game workloads equaled the number of TEX-limited workloads, workload execution would look the same.
I'm not sure that equality exists in current applications, and the uptake in math appears to only be catching up to R6xx in some more recent titles.

What you are asking for is an increase in Tex processing at a different rate to ALU,
I'm not asking that it happen, just that the relative impact of the increase of Tex processing, even if ALU capacity increased by the same factor, is greater when the TMUs started at 16 and the ALUs started at 320.

what has happened and this element remains unchanged from the principles of R600, ergo there is no "correction" in this regard.
I guess we could consider it a happy coincidence, then.
 
(where's jawed and humus when you need them?)

Well, I'm on Dave's side on this. ;)

This is precisely what I'm saying. In order to utilize all those SPs, the TMU count needed to increase. And yes, the utilization rate between RV670 and RV770 will show this if someone would be so kind as to take the time to do the testing.

No, the utilization rate will stay the same. There was no "fix" here, the performance increased because it's just plainly a faster chip, not because it's using it's power better. Otherwise you might just as well argue that the 3870x2 has better utilization than the 3870 because it has twice the number of TEX. The only difference is the total performance.

If you have 16 bowls of rice to feed 320 starving kids, you may have trouble feeding them all. If you have 40 bowls of rice and 800 starving kids you can't exactly write to the UN and say "the food shortage is over, that 2.5x increase in rice really did the trick!!", each kid still got the same amount of rice.

Now, if you argue that the reason RV770 is performing so well is because it's got 2.5x the texturing power, there's probably a good chunk of truth in that with current games. But from what you're saying it sounds like you think this increase magically boosted ALU utilization so that this increase is rather because the ALU power finally is showing through, which I can't make any sense out of. In a mostly texture limited game, an RV770 at 40/800 or for instance 40/600 should perform about the same, don't you agree?
 
This is precisely what I'm saying. In order to utilize all those SPs, the TMU count needed to increase. And yes, the utilization rate between RV670 and RV770 will show this if someone would be so kind as to take the time to do the testing.
ShaidarHaran, I'm not following this very well :) but are you perhaps saying that TMUs do a certain amount of stuff which doesn't involve feeding ALUs? (I don't know what - old-fashioned DirectX5-style texture mapping, perhaps).

If we pretend that there's enough non-ALU-related activity going on in a game to keep 8 TMUs busy, then in R600 we end up with 64 ALU clusters being fed by (effectively) only 8 TMUs (because the other 8 are busy doing something else). That gives you an effective ALU:TEX ratio of 8:1, which may mean that things become ALU-limited as too many ALUs are sitting around waiting for non-existent TMU resources to feed them. But in RV770, with 8 TMUs busy, we have 160 clusters being fed by 32 remaining TMUs, which gives an effective ratio of only 5:1 and thus allows more of the ALUs to actually contribute.

It's quite possible that everything I've just said is complete gibberish, of course. :oops: But if not then it means that the practical ALU:TEX ratio is actually substantially lower in RV770 than it was in R600. If what I just said is complete gibberish (and TMUs don't do anything except feed ALUs) then Dave B's take on it would seem to be more accurate: the ratio hasn't changed between R600 and RV770.
 
There is one place where the TEX ratio did change in RV770... TEX:ROP. Since RV770 has 2.5 the texturing power per clock of R600, shaders that were texture-bound would run ~2.5x as fast. Similarly, ALU:ROP ratio has increased by a similar amount so ALU-bound shaders will also run ~2.5x as fast. In other words, the shaders run 2.5x as fast, so you see a big performance gain when you're shader-bound. Surprising, eh? ;)

Note that I am neglecting any improvements to ROP performance in RV770. We can choose a scenario where both chips would be able to output 16 pixels per clock, when not shader-bound.
 
Are there any numbers to look at?

Jawed
Well, after a brief testing with 3DMark's single-texture feature, looks like GDDR5 pretty much feeds the RBE's blenders to their near-theoretical maximum, that is ~11300 of 12000 MPix (16*750MHz). Some 200MHz more on the memclock yelds ~11500 MPix.

On the other hand, G200's rates are way off of it's 32 mammoth array of ROPs. ;)
 
Last edited by a moderator:
Well, I'm on Dave's side on this. ;)



No, the utilization rate will stay the same. There was no "fix" here, the performance increased because it's just plainly a faster chip, not because it's using it's power better. Otherwise you might just as well argue that the 3870x2 has better utilization than the 3870 because it has twice the number of TEX. The only difference is the total performance.

If you have 16 bowls of rice to feed 320 starving kids, you may have trouble feeding them all. If you have 40 bowls of rice and 800 starving kids you can't exactly write to the UN and say "the food shortage is over, that 2.5x increase in rice really did the trick!!", each kid still got the same amount of rice.

Now, if you argue that the reason RV770 is performing so well is because it's got 2.5x the texturing power, there's probably a good chunk of truth in that with current games. But from what you're saying it sounds like you think this increase magically boosted ALU utilization so that this increase is rather because the ALU power finally is showing through, which I can't make any sense out of. In a mostly texture limited game, an RV770 at 40/800 or for instance 40/600 should perform about the same, don't you agree?

How many times do I have to throw "it's not the ratio" at the wall before it sticks?
 
Well, after a brief testing with 3DMark's single-texture feature, looks like GDDR5 pretty much feeds the RBE's blenders to their near-theoretical maximum, that is ~11300 of 12000 MPix (16*750MHz). Some 200MHz more on the memclock yelds ~11500 MPix.
So a memory clock of 2GHz (22% faster) increases blending rate by less than 2%.

Is this way ahead of HD4850?

Jawed
 
Back
Top