R6XX Performance Problems

Pantagruel's Friend · Nov 21, 2007

hoom said:
I see now you're talking 4*8*(4+1)=160 but at 2* clock =320.
Still the same scheduling issue, RV670 is already smaller in transistors than G92 & I think you'd be down to 8 texture units?

Yes, but not down to 8 units but up to 32 (I was thinking 8 units / cluster, but obviously it can be 8 quads also). I was trying to "save" transistors to make that possible, but as mczak writes, this may not be the case.

Btw. mczak, I understand this much of multiplier design, but to be honest I think it's much more heavily laden with lookup tables, and that's why I didn't assume a heavy increase in tr count. Point can be, though, that it may work quite differently in the GHz domain than I'd think, so some reading definitely won't hurt :smile:

Thanks for the responses regarding the ring bus, looks like I'll invest some reading here also.

mczak · Nov 21, 2007

Pantagruel's Friend said:
Btw. mczak, I understand this much of multiplier design, but to be honest I think it's much more heavily laden with lookup tables, and that's why I didn't assume a heavy increase in tr count. Point can be, though, that it may work quite differently in the GHz domain than I'd think, so some reading definitely won't hurt :smile:

Well I put that "2 times the transistor count for two times the frequency" out of my ass

. You're right this is definitely more complicated than that (you could also increase clock frequency by splitting calculation into more than one pipeline stage, but this has other disadvantages), but in any case I really would expect a significant increase in transistor count if your target clock frequency is that much higher.

ShaidarHaran · Nov 21, 2007

3dcgi said:
I'm a bit late to the discussion here I feel inclined to comment that you don't need to find something positive when trying to figure out why an architecture isn't performing as expected. Something that doesn't meet expectations can be just as fascinating technologically as something that exceeds expectations.

I suppose from the same perspective as car/train/plane crashes are interesting.

Mintmaster · Nov 21, 2007

Zaphod said:
Anyone wants to elaborate a bit on exactly how important Z-fill is for current games? It seems to be the only aspect where the GeForce 8x00-series have a significant advantage in theoretical benchmarks.

Thanks for pointing that graph out. I've always wondered whether anyone does those tests with AA enabled.

The strange thing is that with 4xAA, G80 and G92 halve their fillrate, even though they should theoretically run at full speed. It's as though they can only do 2 MSAA samples per clock, just like all other architectures.

As for Z-fill (and stencil fill), I don't think it's too important in today's games. 5-10 GPix is pretty big as it is. It was more important in Doom3 games, but extensive shadow volumes are dead now. Some future techiques could use it, though.

Freak'n Big Panda · Nov 21, 2007

Zaphod said:
Anyone wants to elaborate a bit on exactly how important Z-fill is for current games? It seems to be the only aspect where the GeForce 8x00-series have a significant advantage in theoretical benchmarks.

Can anybody explain why R580 is outperforming R600 on the single texture fillrate bench?

Arnold Beckenbauer · Nov 21, 2007

Freak'n Big Panda said:
Can anybody explain why R580 is outperforming R600 on the single texture fillrate bench?

I can't explain:
1xMSAA&1xAF vs. 4xMSAA&16xAF
http://www.computerbase.de/artikel/...2900_xt/11/#abschnitt_theoretische_benchmarks

And now RV670:
http://www.computerbase.de/artikel/...hd_3870/10/#abschnitt_theoretische_benchmarks

Pantagruel's Friend · Nov 21, 2007

mczak said:
Well I put that "2 times the transistor count for two times the frequency" out of my ass . You're right this is definitely more complicated than that (you could also increase clock frequency by splitting calculation into more than one pipeline stage, but this has other disadvantages), but in any case I really would expect a significant increase in transistor count if your target clock frequency is that much higher.

OK, let's assume you can save half the transistor count of the eliminated units - I think it's a safe bet. That means 16 complex ALUs and 64 MADD ALUs. Probably the transistors from the 16 complex ALUs alone would cover the addition of the 16 texture filtering units. While it would be still a lot less than the capacity of the G80 or the G92, probably the AF performance would be nearly identical in all but the more extreme cases. I wonder why ATI so firmly believes in drastically rising ALU:TEX ratios :???:

nbohr1more · Nov 22, 2007

ATI seems have timing issues with market expectations. The X1xxx Gen parts now perform better in games like Rainbow Six Vegas and UT3 than their 7xxx counterparts, but given that these titles only appeared recently the marketing of the parts was suspect from a consumer perspective. eg: "Everyone claims that their parts will be better at future titles." so ATI's ALU:TEX equation seems to be out-of-whack with game release timing. But if you want a card with below par performance now but will become a decent mid-range part later ATI's your company?

Maybe I'm way out on a limb here but the L'Inq article about MRT, Global Illumination, and DX10.1 got me thinking that all the R600 Gen parts have AA oriented towards DX10.1 deferred MRT scheduling and that extra functionality is crippling the AA performance...

Mintmaster · Nov 22, 2007

Freak'n Big Panda said:
Can anybody explain why R580 is outperforming R600 on the single texture fillrate bench?

For some reason R580 is doing more than 2 samples per clock except in the Z-fill test. Maybe they don't have Z-buffering enabled in the test, as I'm pretty sure R300-R580 is capable of 6 samples per clock when depth testing is disabled. Not a particularly useful feature, but a possible explanation nonetheless.

Rangers · Nov 22, 2007

no-X said:
You can't just tack 8 extra TMUs into this architecture

Err, yes you can.

Anyways, I've been harping about ATI's lack of TMU's for ages. It's amazing I can see the problem easier than their highly paid engineers who do this for a living.

They have not increased base texturing ability in 3 generations (while Nvidia has upped the same metric at breakneck pace). I would not be surprised at this point to see AMD's next product contain 8 TMU's, be 1.5 billion transistors, and 70% slower than R600. I really wouldn't. Heck I wouldn't be surprised to see AMD eliminate texturing altogether and just produce a non-functional product withh 900 shader alu's that blue screens upon launching a game.

fellix · Nov 22, 2007

Well, isn't that the next logical step of the unification in GPUs -- dumping dedicated hardware all together for even more general computational resources. Then, just don't ask for rapid driver updates.

Ailuros · Nov 22, 2007

Rangers said:
Err, yes you can.

After reading Dave's post here: http://forum.beyond3d.com/showpost.php?p=1096209&postcount=32

it doesn't sound that easy to me.

Anyways, I've been harping about ATI's lack of TMU's for ages. It's amazing I can see the problem easier than their highly paid engineers who do this for a living.

They have not increased base texturing ability in 3 generations (while Nvidia has upped the same metric at breakneck pace). I would not be surprised at this point to see AMD's next product contain 8 TMU's, be 1.5 billion transistors, and 70% slower than R600. I really wouldn't. Heck I wouldn't be surprised to see AMD eliminate texturing altogether and just produce a non-functional product withh 900 shader alu's that blue screens upon launching a game.

I personally can't know what they're up to for the "real" next generation, yet considering that they've already a programmable tesselator in their current GPUs (which their competition still has to add) it's more likely that their TMU count will increase than decrease (besides of course your obvious exaggeration).

hoom · Nov 22, 2007

If you want to cross a ring from the middle of the top edge to the middle of the bottom edge, you'll travel 1/2L + L + 1/2L = 2 die Lengths.

With a crossbar, you could go straight through the chips -> 1L.

Thats worst case compared to best case.
I'm no hardware engineer but surely best case on ringbus is at worst the same as a crossbar.
Being smaller individually, each ringstop should be physically closer to both the start point & the chips it services than a monolithic crossbar? So best case on ringbus should be better even.

Assuming both random data distribution between RAM chips & random access which is presumably worse than what could be expected as a norm, with 4 ringstops on a bi-directional ring there should be about:
25% chance the data needed is on that ring stop (0.5L?)
50% chance the data is one ring stop away (1L)
25% chance the data is 2 rings stops away (2L)

Anyway, wasn't this done to death way back before R520 came out?

Ailuros · Nov 22, 2007

fellix said:
Well, isn't that the next logical step of the unification in GPUs -- dumping dedicated hardware all together for even more general computational resources. Then, just don't ask for rapid driver updates.

I might be completely off track here, but my layman's sense tells me that it might be too early even for D3D11 generation GPUs. It's nice to have a "forward-looking" GPU from a technological perspective, but the consumer still wants at all times a healthy boost in performance compared to former generation GPUs.

Pantagruel's Friend · Nov 22, 2007

Rangers said:
Err, yes you can.

Anyways, I've been harping about ATI's lack of TMU's for ages. It's amazing I can see the problem easier than their highly paid engineers who do this for a living.

Many have. And I'm quite convinced that ATI engineers (and marketeers, too) see the problem very well, and they have some bloody good reason why they don't increase the TMU count - a reason they're less than willing to share with the public.

Rangers · Nov 22, 2007

Pantagruel's Friend said:
Many have. And I'm quite convinced that ATI engineers (and marketeers, too) see the problem very well, and they have some bloody good reason why they don't increase the TMU count - a reason they're less than willing to share with the public.

I dont think they do.

They designed the part this way..obviously they didn't, period.

They continue squandering enormous opportunities (R520 was twice as big as G70..it should therefore by all reasonable measures have been twice as fast). It was massively texture limited.

R600..much more shader power than G80..for lack of 4-8 TMU's it's a huge disaster.

Now we have RV670..half as big as G92..it should be the same size instead, but twice as fast..AMD 55nm edge=totally squandered, yet again.

LonelyMind · Nov 22, 2007

Rangers said:
...They have not increased base texturing ability in 3 generations (while Nvidia has upped the same metric at breakneck pace). I would not be surprised at this point to see AMD's next product contain 8 TMU's, be 1.5 billion transistors, and 70% slower than R600. I really wouldn't. Heck I wouldn't be surprised to see AMD eliminate texturing altogether and just produce a non-functional product withh 900 shader alu's that blue screens upon launching a game.

Speaking of blue, that reminds me:

"Blue river, it can't be found on any map that I know
But it's the place where lonely lovers all go
To cry their tears, blue river
It winds along a path of heartache and pain
Of broken dreams from lovin' someone in vain
Like I loved you, and baby I still do

I held you so tight
You were out of my sight
I'm feelin' so low
But I gotta go
Blue river because you never really cared about me
From now on baby that's where I'm gonna be
Cryin' over you, by the river of blue"

Kinda sums up my feelings towards AMD/ATI right now. *sob*

Dave Baumann · Nov 22, 2007

Rangers said:
(R520 was twice as big as G70..it should therefore by all reasonable measures have been twice as fast).

Where on earth did you get such a bizarre recolection from? (Hint: G70 was bigger than R520).

Rangers said:
Now we have RV670..half as big as G92..it should be the same size instead, but twice as fast..

Why "should" it have been?

Zaphod · Nov 22, 2007

Dave Baumann said:
Where on earth did you get such a bizarre recolection from? (Hint: G70 was bigger than R520).

At least as far as die size goes, but R520 had slightly more transistors, no (110nm vs. 90nm)?

Kaotik · Nov 22, 2007

Zaphod said:
At least as far as die size goes, but R520 had slightly more transistors, no (110nm vs. 90nm)?

R520 = 321M Transistors, 288mm^2 (16mm x 18mm)
G70 = 300M Transistors, 334mm^2 (18.65mm x 17.9mm)

R6XX Performance Problems

Pantagruel's Friend

mczak

ShaidarHaran

hardware monkey

Mintmaster

Freak'n Big Panda

Arnold Beckenbauer

Pantagruel's Friend

nbohr1more

Mintmaster

Rangers

fellix

Ailuros

Epsilon plus three

hoom

Ailuros

Epsilon plus three

Pantagruel's Friend

Rangers

LonelyMind

Dave Baumann

Gamerscore Wh...

Zaphod

Remember

Kaotik

Drunk Member

Similar threads