AMD: R9xx Speculation

Yes. SPs can be fully utilized, but only by syntetic tests, not in games.
Well, they are being used in games, as thats where the increase is coming from.

And just looking at the graphs you linked, now that I can see them, we have a 52% increase in performance from (full) Cypress to Juniper on Crysis and a 77% increase on Just Cause; to me that indicates that Cyrsis is fairly CPU bound. Taking a closer look at the Just Cause numbers shows that Juniper SIMD scaling is at <2% but Cypress is actually at 3% - given the only element that doesn't scale by 2x between Cypress and Juniper is the front end geometry, these results would indicate that Just Cause is not being particularly bound by that.
 
So the question is simple. Why gaming performance of Cypress doesn't scale well with the number of SIMDs, while it scales well with frequency?

HD5850 (1440SPs/72TMUs) clocked 3% higher than HD5870 (1600SPs/80TMUs - typo, thx to neliz) would outperform it. That's quite strange, isn't it? :)
 
Last edited by a moderator:
So the question is simple. Why gaming performance of Cypress doesn't scale well with the number of SIMDs, while it scales well with frequency?

HD5850 (1440SPs/72TMUs) clocked 3% higher than HD5870 (1600SPs/72TMUs) would outperform it. That's quite strange, isn't it? :)

5870's don't have 72TMUs?!

On the other hand, with SIMD's disabled, your L2 cache remains intact. I'm not sure if that's a bottleneck there, but in some cases it might be better to have more cache available per SIMD? That and your Geometry/Vertex/Z assembler performance is also increased since that's also equal between 5850/5870. So, if the app is limited on anything but SIMD, you'd benefit of increased clocks.

(Dave has that data.. spill it! :D )
 
Last edited by a moderator:
neliz: Fixed, thanks

CarstenS: Yes. I think neliz's point was that disabling SIMDs doesn't disable L2 (which isn't part of SIMDs) and because of that, HD5850 has more L2 per SIMD (ratio).
 
neliz: Fixed, thanks
np

CarstenS: Yes. I think neliz's point was that disabling SIMDs doesn't disable L2 (which isn't part of SIMDs) and because of that, HD5850 has more L2 per SIMD (ratio).

Indeed. L2 and the whole "Graphics Engine (Setup)" would enjoy the benefits of higher clocks there.
 
So the question is simple. Why gaming performance of Cypress doesn't scale well with the number of SIMDs, while it scales well with frequency?

HD5850 (1440SPs/72TMUs) clocked 3% higher than HD5870 (1600SPs/80TMUs - typo, thx to neliz) would outperform it. That's quite strange, isn't it? :)
Isn't 5870 limited by memory bandwidth in quite many situations? That could explain it. The cache/shader difference even gives an equally clocked 5850 a slightly higher effective bandwidth in theory.
 
Isn't 5870 limited by memory bandwidth in quite many situations?

That's quite simple for you to test, (just play with the memory frequencies)
People did it before.. and they discovered that, NO, HD5870 is not limited by bandwidth.
 
I remember several tests showing, that increasing bandwidth by 10% brings 1-3% of performance.

Btw. bandwidth limitation wouldn't explain why Cypress's gaming performance scales well with GPU clock frequency, but not so well with the number of SIMDs :)
 
Btw. bandwidth limitation wouldn't explain why Cypress's gaming performance scales well with GPU clock frequency, but not so well with the number of SIMDs :)

Isn't that the whole idea behind the clock domain rumours for R9xx? if it's set-up limited, you could re-work it or make it run out-of-sync (at a higher frequency) and gain the benefits there simply through clocks without having your whole core running at 1.2Ghz.
 
I remember several tests showing, that increasing bandwidth by 10% brings 1-3% of performance.

With Cypress it's harder to test that way right? Because of the error detection they have going on or something?
 
Then it's hard to test bandwidth limitation on all GDDR5 boards?

What I'm referring to is this:
AnandTech said:
Should an error be found, the GDDR5 controller will request a retransmission of the faulty data burst, and it will keep doing this until the data burst finally goes through correctly. A retransmission request is also used to re-train the GDDR5 link (once again taking advantage of fast link re-training) to correct any potential link problems brought about by changing environmental conditions. Note that this does not involve changing the clock speed of the GDDR5 (i.e. it does not step down in speed); rather it’s merely reinitializing the link. If the errors are due the bus being outright unable to perfectly handle the requested clock speed, errors will continue to happen and be caught. Keep this in mind as it will be important when we get to overclocking.
http://www.anandtech.com/show/2841/12
 
It would make it difficult to test overclocking of memory certainly. You could test downclocking the memory theoretically, although I'm not sure whether that data would be useful when trying to determine bandwidth limitations.

Regards,
SB

This was what i was going to suggest as well. Say you downclock the memory to 4 ghz, stock clock of 4.8 ghz is 20% higher. Depending on the performance increase from 4ghz to 4.8 ghz we can determine whether it is B/W limited. Say the performance increases only by 5%, then it dosent have quite so much of an effect. But say it increases by 15%, then it certainly is B/W limited. Carsten, care to test? :smile:
 
Then it's hard to test bandwidth limitation on all GDDR5 boards?

What I'm referring to is this:

http://www.anandtech.com/show/2841/12

Couldn't you just do a testing on the 5850 and extrapolate some results from that? As far as I know, the 5850 has the same memory modules as the 5870, all of which are perfectly 1250Mhz (physical) capable.

I have done many benchmarks on my own 5850 and I figured that by overclocking the core by 38% and the memory by 25%, I got a solid 25% on average. Its max was a full 35% though, which should also account for something.

Results can be found here.
 
This was what i was going to suggest as well. Say you downclock the memory to 4 ghz, stock clock of 4.8 ghz is 20% higher. Depending on the performance increase from 4ghz to 4.8 ghz we can determine whether it is B/W limited. Say the performance increases only by 5%, then it dosent have quite so much of an effect. But say it increases by 15%, then it certainly is B/W limited. Carsten, care to test? :smile:

But wouldn't 15% only prove that it would be limited at less than 4.8GHz? 5% would disprove a limitation, though.
 

Maybe running in lower resoultions would show bigger difference. 1920*1200 with maxed setings in Far Cry2, Crysis Warhead and Stalker: CoP could be quite shader limited. And there the extra memory bandwith should not help. Maybe those games are clearly within fillrate and texture bandwith limits with just double digit fps.
.
It could be better to test it the other way with those games. Downclock the memory to something like 3 GHz and then increase the frequency.
 
Back
Top