AMD: R8xx Speculation

OpenGL guy · Aug 19, 2009

no-X said:
I'd like to know, what's the major cause of the R600's TMUs performance (compared to todays TMUs) - whether the native FP16 support, or the point samplers. Still thinking about the low (60%) performance difference between HD2900XT and HD4890 in current game w/o FSAA.

You can't take one example and try to extrapolate from that. For example, are you certain the HD4890 is not CPU-limited at the chosen settings? 1680x1050 is not a very high resolution.

w0mbat · Aug 19, 2009

Jawed said:
http://techpulse360.com/2009/08/12/...en-graphics-chips-you-wont-believe-your-eyes/

Hahaha. What r they trying to proof with that?

Thay see me rollin´, they hatin´

Mintmaster · Aug 19, 2009

mczak said:
Maybe never, because at this point it would make more sense to just use shader alus for filtering?

Filtering is a lot cheaper than you think. It's fixed function, fixed data flow arithmetic with one operand being low precision.

Davros said:
@jawed
why do you care about 8bit textures in this day and age ?

Maybe because 95%+ of textures are still 8-bit? Even when you need more range there's a host of solutions out there that don't need FP16 textures.

Jawed said:
RV730's 32 TUs were squandered

Why do you say that? Just because of the ratio compared to RV770? NVidia has always emphasized perf/mm2, and they doubled 8-bit bilinear throughput from G80 to G90, despite G80 already having higher TEX:ALU than RV730.

It's not squandered by any means. Go look at digit-life's shader tests, and then consider real-world conditions with AF enabled.

Jawed · Aug 19, 2009

CarstenS said:
Thanks, but no, that's not what I meant. I was merely referring to the "rumor" that started with Olicks presentation on id tech6 and wormed its way through doubled ROP-count on all DX11 chips compared to their DX10 predecessors (which may very well be more than just a rumor).

Ha, well I've even less idea what you're referring to now.

Would TA be have to beefed up significantly/ a bit in order to support the required 16k-textures?

It's only an extra bit in address computations that are already 13-bit. I can't see anything significant there.

Jawed

no-X · Aug 19, 2009

OpenGL guy said:
You can't take one example and try to extrapolate from that. For example, are you certain the HD4890 is not CPU-limited at the chosen settings? 1680x1050 is not a very high resolution.

These results are based on 8 games, not one. The performance delta between HD4890 and GTX285 in other CB tests are only 4% higher when going from 1680 to 2560, so I'm quite sure, that CPU limitation affects this result by less than 4%.

Jawed · Aug 19, 2009

Mintmaster said:
It's not squandered by any means.

All the evidence is in HD4770, which has the same number of TUs as HD4670 at the same clock, and radically more performance. HD4770's mixture of TUs, ALUs, RBEs and bandwidth is "perfect" in many ways.

HD4770 appears to be bandwidth constrained, but one test I looked at showed no signs of bandwidth limitation, i.e. with bandwidth increases > core clock increases scaling was limited by core clock.

As I hinted earlier, the fp16 throughput of HD4670 may be the justification for 32 TUs, as HD4670 with only 16 TUs (i.e. 4 clusters, 4:1 ALU:TEX), and therefore half the fp16 rate, might well have been just too miserable. I don't know.

Jawed

CarstenS · Aug 19, 2009

Jawed said:
Ha, well I've even less idea what you're referring to now.

It's on page 110 ff., also 118 in this:http://s08.idav.ucdavis.edu/olick-current-and-next-generation-parallelism-in-games.pdf

Jawed said:
It's only an extra bit in address computations that are already 13-bit. I can't see anything significant there.

Jawed

13s already closer to 16 than to 8

OpenGL guy · Aug 19, 2009

no-X said:
These results are based on 8 games, not one. The performance delta between HD4890 and GTX285 in other CB tests are only 4% higher when going from 1680 to 2560, so I'm quite sure, that CPU limitation affects this result by less than 4%.

I don't read German but here's what I see:
HD4890 is 77% faster in Anno 1404, 100% faster in CoD5, 84% faster in Crysis Warhead, 61% faster in F.E.A.R., 25% faster in Half-life 2 (likely CPU-limited), 38% faster in Lost Planet: Colonies, 41% faster in Oblivion, and 82% faster in World in Conflict, so what's the problem? You can't expect everything to scale equally.

Davros · Aug 19, 2009

eh I thought all textures these days were 32bit ?

no-X · Aug 19, 2009

OpenGL guy: There is no problem, I appreciate your response. I'm trying to point out, that HD4890 - despite almost 3-times higher ALU power - is only 1,6-times faster at the average. Maybe slightly more, if we skip HL2 as a CPU limited game - or maybe slightly less, if we skip COD5, which could be affected by a R6xx-related driver bug (at least, at the R600 launch interview, Eric Demers mentioned, that any game, which runs slower on R600 than on R580 - even with MSAA enabled - is more likely affected by a driver issue than by a hardware limit).

Anyway, if a 3-times (ALU) faster GPU performs about 1,6-times faster in real-world situations, it have to be caused by something. The first reason could be number of ROPs, which is the same for both R600 and RV770. (I choosed the non-MSAA results for comparision to avoid the impact of broken resolve hardware). The second reason could be different TMUs, which are more capable on R600. And I'd like to know, which feature has more impact in these games - if the better FP16 performance, or the additional point sampling units.

I believe, that more capable TMUs and (or) higher number of ROPs could boost R7xx performance significantly. And because of that, I think R8xx will bring at least one of these changes.

Davros: 8-bit per component?

Jawed · Aug 19, 2009

CarstenS said:
It's on page 110 ff., also 118 in this:http://s08.idav.ucdavis.edu/olick-current-and-next-generation-parallelism-in-games.pdf

It looks like he was taking the piss out of multi-GPU... He's basically saying that setup/rasterisation's days are numbered, it's a horrible constraint on parallelism.

Jawed

Mintmaster · Aug 19, 2009

Jawed said:
All the evidence is in HD4770, which has the same number of TUs as HD4670 at the same clock, and radically more performance. HD4770's mixture of TUs, ALUs, RBEs and bandwidth is "perfect" in many ways.

HD4770 has 60% more BW and twice the RBEs. To conclude that the ALUs are responsible for that performance jump is nonsense.

If you want to see the impact of RBE and BW, look at the G92 vs G94. An 8% clock advantage (but equal BW) nearly wipes out 75% more ALUs and TUs in the 8800GT.

If you really wanted to know how much different parts of the GPU affect game performance, I can crunch some numbers for you, but I need some help. In this TR review, list some scores at different resolutions from the 8800GT, 8800 GTS 512, and 9600GT in a table and I can do a regression (like I did before with BW) to figure out per-frame time (CPU + vertex), cycles per pixel limited by RBE or rasterizing (i.e. independent of ALU/TU count), and multiprocessor limited cycles per pixel.

It won't tell you the impact of the TU vs ALU, but it will tell you how much frame time an 8800GT class GPU will spend limited by the RBE and give you much better perspective on RV730 vs RV740.

Mintmaster · Aug 19, 2009

Davros said:
eh I thought all textures these days were 32bit ?

Most textures are 8-bit per channel. Sometimes that means 32 bits per pixel, but often they're compessed to far less. The per-channel width after decompression is what's relevent to the arithmetic logic that everyone is talking about.

aaronspink · Aug 19, 2009

Davros said:
@jawed
why do you care about 8bit textures in this day and age ?

cause like 99.9999999% of all textures are 8b per channel textures. There really isn't a need for anything higher than true color.

8bpc is the standard 24b/32b per pixel image format used by just about everybody.

16bpc and higher (48b/64b per pixel) is mainly used as an intermediate during HDR rending.

Jawed · Aug 19, 2009

HD4770 v HD4670

http://www.xbitlabs.com/articles/video/display/call-of-juarez-2_6.html#sect0

http://www.computerbase.de/artikel/..._grafikkarten/18/#abschnitt_performancerating

HD4770 is 60-70% faster than HD4670, courtesy of 60% more bandwidth, 100% more fillrate and 100% more GFLOPs and despite identical texturing rate.

Now the problem with reviews is they tend to push a budget graphics card into rendering options that no user would actually choose, i.e. maxed in-game graphics settings, optimisations turned off, with control panel settings such as transparency AA.

Arguably the case is still open - the 2:1 ALU:TEX is a nod to the "weak" configuration budget gamers will use (less AF). But I'd like some decent evidence that it isn't simply unbalanced.

Jawed

Jawed · Aug 19, 2009

Mintmaster said:
To conclude that the ALUs are responsible for that performance jump is nonsense.

When did I do that?

Jawed

Mintmaster · Aug 20, 2009

Jawed said:
When did I do that?

Jawed

When you said that HD4770's performance is all the evidence we need that RV730's TUs were squandered.

I think that if RV740 had only 320 SPs but the same number of TUs, then it would ony lose 5-10% of performance or so with AA/AF enabled.

mczak · Aug 20, 2009

Mintmaster said:
Filtering is a lot cheaper than you think. It's fixed function, fixed data flow arithmetic with one operand being low precision.

I didn't want to imply it's expensive. But fact is, if you want full-rate FP16 filtering (and I didn't bring this up...) it's going to cost you some die space. To counter this, Jawed suggested to increase ALU:TEX ratio again, but it looks to me like there's some point where it doesn't really make sense to have dedicated filtering units if the ratio of ALU:TEX is high enough.

Mintmaster · Aug 20, 2009

mczak said:
I didn't want to imply it's expensive. But fact is, if you want full-rate FP16 filtering (and I didn't bring this up...) it's going to cost you some die space. To counter this, Jawed suggested to increase ALU:TEX ratio again, but it looks to me like there's some point where it doesn't really make sense to have dedicated filtering units if the ratio of ALU:TEX is high enough.

I don't think ALU:TEX is going to matter because doing proper filtering (including aniso and filtering weights) in a shader is too complicated and slow. The decision of whether to have full rate FP16 filtering will depend entirely on the workload. I don't think it will ever be worth it, because there's barely any need for high speed FP16 filtering since most applications look just as good with some 32bpp hackery like logluv or RGBexp.

The only really good application I've seen of high precision filtering is VSM/ESM. It's only going to be small percentage of total texture accesses, though, and needs 32-bit filtering if not more.

CarstenS · Aug 20, 2009

How expensive would texture addressing be on it's own?

AMD: R8xx Speculation

How soon will Nvidia respond with GT300 to upcoming ATI-RV870 lineup GPUs

Within 1 or 2 weeks

Within a month

Within couple months

Very late this year

Not until next year

OpenGL guy

w0mbat

Mintmaster

Jawed

no-X

Jawed

CarstenS

Moderator

OpenGL guy

Davros

no-X

Jawed

Mintmaster

Mintmaster

aaronspink

Jawed

Jawed

Mintmaster

mczak

Mintmaster

CarstenS

Moderator

Similar threads