Upcoming ATI Radeon GPUs (45/40nm)

ShaidarHaran · Jul 20, 2008

Sound_Card said:
So then you would agree that z fill was the main bottleneck?

Sure, if you or Mintmaster or someone else wants to come up with definitive numbers that show this, but I have yet to see it, and so far the largest single bottleneck appears to be texture rate (38%). Feel free to prove me wrong.

no-X · Jul 20, 2008

I'd say the main bottleneck was caused by absence of properly working ROP-based resolve. HD2900XT is 50% faster than X1950XTX, but only 15% faster when AA/AF applied (AF isn't a problem when comparing to R580; both of the has similar performance drops). Given the BW of 512bit bus, HD2900XT with HW resolve would be about 50-65% faster than X1950XTX when using AA/AF. I don't think that any other architectural change would boost performance in like manner. The second bottleneck seems to be Z-ops and the third (maybe) pure fillrate. Anyway, ROP-based resolve would be enough to bring R600 to G80 level, imo.

Pantagruel's Friend · Jul 20, 2008

ShaidarHaran said:
Sounds like a reasonable test case to me. Would you mind posting that in the other thread?

You mean the one called "Call for testing"?

ShaidarHaran · Jul 20, 2008

Pantagruel's Friend said:
You mean the one called "Call for testing"?

Yes, please.

Mintmaster · Jul 20, 2008

ShaidarHaran said:
I don't know what results to expect. If I did, I would've simply provided them already.

Do you know anything about science? What's the point in any test if you don't know what results would support or refute your hypothesis?

Were I able to do so, there would be no need for the thread I have created.

So then you have zero basis for suggesting RV670 is texture limited.

A bottleneck that is present 38% of the time is hugely significant, given just how many potential bottlenecks there are in RT graphics rendering. I find your omission of percentages of other bottlenecks to be rather curious, given this fact, and believe your analysis to be disingenuous in this regard.

First of all, I'm saying that 38% is the absolute maximum, and only holds if you assume that the 3870 can do all non-texturing related tasks as fast as the 4850. That just isn't the case, because games where AF has near-zero impact (and hence are probably rarely texture limited) still show huge gains for the 4850. My guess is that it's something like 20% for texturing.

Second of all, I don't have the data to list other percentages. I used only two data points - 3870 and 4850 no-AA scores - and made two assumptions: the 4850 is equal to or faster than the 3870 in the cumulative non-texturing-limited loads, and 2 times faster in the texture limited loads. Actually, I made a mistake last time (I used 2.5x), so the figure I get is <46%. Just to show you how conservative that is, if we assume the 4850 is 20% faster in the total non-texturing-limited load (which is quite reasonable), then the texturing-limited figure drops to 19%.

Anyway, I figured something else out below.

(BTW, stop misquoting me. I never said texturing-limited load is 38%.)

Incorrect. RV770 does not have "faster math speed", it has more math resources working in parallel. This is an important distinction.

No, it isn't important at all. RV770 can crunch through math loads way faster than RV670, hence it has way faster math ability. That's all that matters, especially in the sentence it was used it in.

Pardon? If bottleneck A is a bottleneck for less time than bottleneck B, would that not make bottleneck A less significant?

Remember that I don't share you opinion that bottleneck A (e.g. setup or even BW) is a bottleneck for less time than bottleneck B (texturing).

Feel free to show me another bottleneck that accounts for > 38% and I'll shut my big fat mouth

Like I said, it doesn't have to be 38% (or 46%) to be bigger than texturing because that's an upper bound, but I'll show one anyway.

Per-frame loads that are independent of resolution - which are mostly setup limited - account for 40% of the render time on the 3870 according to the same hardware.fr graphs (no-AA). Some of that will be shadow map rendering which is always either setup or z-fill limited.

I'm assuming pixels take about the same time to process at both resolutions, which is pretty reasonable considering that there's only 14% more pixels in each direction. With only two resolutions in the data I can't do much else anyway.

It's an IQ enhancer. If you lack AA capability you can simply choose not to use it. The same can't be said for texturing capabilities.

Yes, it can. Texturing ability is also just "an IQ enhancer", because you can simply choose a lower resolution if you lack texturing or shader ability.

R300 was primarily math-bound.

How do you know?

resulting in a chip which was bottlenecked by its math rate less often than the previous chip.

How do you know?

Jawed · Jul 20, 2008

hoom said:
I think (apart from the unit shrinking & AA changes which themselves are very impressive) the real big story with RV770 is in the cache changes, both that they get their own dedicated bandwidth and I think also the separation of the vertex cache.

R6xx has a dedicated L1 vertex cache, i.e. separate from the L1 texel cache.

What I don't understand is whether both L1s fetch data through the L2. Or if L2 only serves texels.

What's interesting about the caches in RV770 is that the increase in internal cache bandwidth (~2x) is less than the increase in capability (2.5x). I think RV770 has double the size of L1 texel cache but I'm not sure. I think the coherency of texels in RV770's L1s is far better - which I think lowers the count of L2 fetches.

Jawed

Jawed · Jul 20, 2008

Mintmaster said:
The same thing goes for shaders with texture heavy parts and math heavy parts. The overall ratio is all that matters because there are enough batches in flight to statistically even this out.

No. If you have a shader that uses 10 registers then you only have 25 batches in flight, which is 100 clocks of latency hiding - about half the number of threads required to hide memory latency. So if there's a section of the shader with 2-level dependent texturing coupled with a low ALU:TEX ratio, then that part of the shader is going to bottleneck in a way that's not represented by the shader as a whole - the cluster simply runs out of threads.

Jawed

Jawed · Jul 20, 2008

no-X said:
I'd say the main bottleneck was caused by absence of properly working ROP-based resolve. HD2900XT is 50% faster than X1950XTX, but only 15% faster when AA/AF applied

HD2900XT has 2x the non-MSAA Z fillrate of X1950XTX per clock, so comparing 50% and 15% is pointless.

MSAA resolve is a tiny proportion of frame rendering time (or the time per render target if there are multiple targets produced per frame). Admittedly if multiple MSAA'd shadow maps are generated per frame then any lack of performance caused by a software resolve is magnified as a proportion of the entire frame rendering time.

Jawed

ShaidarHaran · Jul 21, 2008

Mintmaster said:
Do you know anything about science? What's the point in any test if you don't know what results would support or refute your hypothesis?

That was rather uncalled for... I don't know what RESULTS to expect. If I did, it wouldn't be a hypothesis but fact. I've already stated my hypothesis, more times than I care to count.

Mintmaster said:
So then you have zero basis for suggesting RV670 is texture limited.

Well, I have the fact that ATi has utilized only 16 TMUs since R420, even without correlating results one can look at this fact and speculate (after all, what is a hypothesis?) that R6xx-generation parts were very much texture-bound.

Mintmaster said:
No, it isn't important at all. RV770 can crunch through math loads way faster than RV670, hence it has way faster math ability. That's all that matters, especially in the sentence it was used it in.

It is capable of processing more math instructions per clock cycle. That is not the same as getting an instruction done in fewer clock cycles. I know you can see the difference here.

Mintmaster said:
Yes, it can. Texturing ability is also just "an IQ enhancer", because you can simply choose a lower resolution if you lack texturing or shader ability.

You've got to be kidding me. If a chip is texture-bound you can't just turn off textures like you can AA. You can minimize the effects of poor texturing performance, but to what degree? You're still going to have some bottleneck there. Turning off AA means no AA perf bottleneck.

Mintmaster said:
How do you know?

Because I've followed this industry very closely over the years and remember specific comments by ATi employees at the time referring to this very fact.

Mintmaster said:
How do you know?

Because that's why they made the changes they made moving from R300 to R420...

ShaidarHaran · Jul 21, 2008

Sharkfood posted some interesting AF/no AF numbers from 3dmark03 here.

What do you all make of this?

Dave Baumann · Jul 21, 2008

You've not actually paid attention to this thread, have you?

BTW - Comparing no-AF/AF relative performance differences doesn't really tell you much. For one, caching mechanism between the chips are completely different, and this effects AF significantly. Additionally, RV770 has 32 texture interpolators and 40 texture units, meaning, again that it does a max of 32 bilinear filters, while it can use all 40 in AF scenarios; so comparing no-AF/AF results between this and R6xx doesn't really give you a comparison.

OpenGL guy · Jul 21, 2008

Dave Baumann said:
You've not actually paid attention to this thread, have you?

BTW - Comparing no-AF/AF relative performance differences doesn't really tell you much. For one, caching mechanism between the chips are completely different, and this effects AF significantly. Additionally, RV770 has 32 texture interpolators and 40 texture units, meaning, again that it does a max of 32 bilinear filters, while it can use all 40 in AF scenarios; so comparing no-AF/AF results between this and R6xx doesn't really give you a comparison.

It can also use all 40 texture units for dependent lookups, or in other cases where you are not interpolator limited.

Sound_Card · Jul 21, 2008

So why was RV770 limited to 32 interpolators instead of 40 to make it even with the texture units?

fellix · Jul 21, 2008

Attrib interpolation rate is not really an issue in the RV770 case. Someone mentioned earlier, that the engineering team opted out to add the two additional texturing quads as kind of a "bonus" due to leftover area on the die.

Pantagruel's Friend · Jul 21, 2008

Dave Baumann said:
BTW - Comparing no-AF/AF relative performance differences doesn't really tell you much. For one, caching mechanism between the chips are completely different, and this effects AF significantly. Additionally, RV770 has 32 texture interpolators and 40 texture units, meaning, again that it does a max of 32 bilinear filters, while it can use all 40 in AF scenarios; so comparing no-AF/AF results between this and R6xx doesn't really give you a comparison.

Does this also mean that any performance hit with 16xAF may (or may not) be attributed to caching instead of filtering capacity?

CarstenS · Jul 21, 2008

Sound_Card said:
So why was RV770 limited to 32 interpolators instead of 40 to make it even with the texture units?

Moreover (and I admit that I don't fully grasp this concept) this means, attribute interpolation is neither done in "the texture unit" itself nor in the shader core. Assuming that the SIMDs are functionally identical.

That is, if this really is a matter of units present and not being able to feed them with according data in time.

Jawed · Jul 21, 2008

Since R300 at least, ATI GPUs have had dedicated interpolators. Prolly all of them.

Since not all attributes are for texture coordinates, it wouldn't make sense to do interpolation in the TUs.

Jawed

3dilettante · Jul 21, 2008

The idea that the last 2 SIMDs are "bonus" SIMDs is also supported by the discrepancy in the L1 and L2 bandwidths stated by AMD.

480 GiB/sec L1 and 384 GiB/sec L2

480/10 = 48

2*48 = 96 and 480 - 384 = 96

If the extra SIMDs had not been laid down, the L1 and L2 bandwidths would have been matched.

Is there any data on the size of L2 transfers? The numbers seem to indicate each L2 cache quadrant can transfer 128 bytes/cycle.
If it's one transfer per section, that means 4 SIMDs can be fed 128 bytes a cycle.
If the sections are dual ported, it's 8 SIMDs that can be fed 64 bytes a cycle.

Either way, the last two SIMDs are a minor source of asymmetry.

ShaidarHaran · Jul 21, 2008

Dave Baumann said:
You've not actually paid attention to this thread, have you?

BTW - Comparing no-AF/AF relative performance differences doesn't really tell you much. For one, caching mechanism between the chips are completely different, and this effects AF significantly. Additionally, RV770 has 32 texture interpolators and 40 texture units, meaning, again that it does a max of 32 bilinear filters, while it can use all 40 in AF scenarios; so comparing no-AF/AF results between this and R6xx doesn't really give you a comparison.

Fine.

I'm done with this discussion. If you don't want to admit ATi fixed their lack of texture capability with RV770 - ok. What's so bad about acknowledging you've addressed your weaknesses anyway? Most people would consider that a GOOD thing.

Sound_Card · Jul 21, 2008

Jawed said:
Since R300 at least, ATI GPUs have had dedicated interpolators. Probably all of them.

Since not all attributes are for texture coordinates, it wouldn't make sense to do interpolation in the TUs.

Jawed

So it would be a combination of the TU's are not attached to the interpolators but scale evenly with each anyway, coupled with the notation that two extra SIMD's were added which is why the interpolators stayed at 32?

Upcoming ATI Radeon GPUs (45/40nm)

ShaidarHaran

hardware monkey

no-X

Pantagruel's Friend

ShaidarHaran

hardware monkey

Mintmaster

Jawed

Jawed

Jawed

ShaidarHaran

hardware monkey

ShaidarHaran

hardware monkey

Dave Baumann

Gamerscore Wh...

OpenGL guy

Sound_Card

fellix

Pantagruel's Friend

CarstenS

Moderator

Jawed

3dilettante

ShaidarHaran

hardware monkey

Sound_Card