AMD: R9xx Speculation

Looking at computerbase, real-world results seems to be different. Comparing HD4890 (16 ROPs) and HD5850 (32 ROPs) at:
2560*1600: HD5850 is 48% faster
2560*1600 + AA 4x / AF 16x: HD5850 is 32% faster
2560*1600 + AA 8x / AF 16x: HD5850 is 22% faster

1920*1200: HD5850 is 43% faster
1920*1200 + AA 4x / AF 16x: HD5850 is 31% faster
1920*1200 + AA 8x / AF 16x: HD5850 is 27% faster

it doesn't seem, that HD5850 is able to utilize the advantage of twice as many ROPs compared to HD48xx

Is the glass half full or half empty? You could turn it around, and from the very same numbers argue that the 5850 shows clear benefits - the size of which are going to depend on the particular application, settings and scene.

I keep seeing statements on these forums that GPU such and such "is XX limited" as if it were some universal truth. In reality different applications/settings and scenes are going to have different requirements and bottlenecks - which is as it should be. GPUs strive to achieve some reasonable balance between resources taking both usage patterns and cost into account.

Since settings play such a big role, I'd say that the take home message is that reviewers need to test a wide variety of settings as well as applications, and that customers need to pay attention to the tests that target their particular set of needs and wants. (Regrettably, I find that most reviews test applications I do not use, at settings I wouldn't use.)
 
Is the glass half full or half empty? You could turn it around, and from the very same numbers argue that the 5850 shows clear benefits
You say, that these numbers shows, that the additional 16 ROPs result in better MSAA performance? It shows exactly the opposite, utilization of the additional 16 ROPs seems to be pretty poor. The reason doesn't seem to be obvious, but until ATi won't solve it, 32 ROPs seem to be pretty overkill for a GPU of this performance level.
 
You say, that these numbers shows, that the additional 16 ROPs result in better MSAA performance? It shows exactly the opposite, utilization of the additional 16 ROPs seems to be pretty poor. The reason doesn't seem to be obvious, but until ATi won't solve it, 32 ROPs seem to be pretty overkill for a GPU of this performance level.

Well, yes - for this particular anonymous benchmark. Then again that the ROPs weren't the only limiting factor in this particular example doesn't make a strong argument that the additional ROPs are generally a waste of die space. It's only natural that if you alleviate one bottleneck that other features become limiting relatively more often. Does that imply that reducing bottlenecks are pointless? Of course not.

My point is that if you're too narrow in how you evaluate an architecture, you can end up with a very skewed impression of where the limitations lie. And that as a consumer, you would do well to find tests that resemble your preferences. YMMV applies, as usual.
 
You say, that these numbers shows, that the additional 16 ROPs result in better MSAA performance? It shows exactly the opposite, utilization of the additional 16 ROPs seems to be pretty poor. The reason doesn't seem to be obvious, but until ATi won't solve it, 32 ROPs seem to be pretty overkill for a GPU of this performance level.

Assuming the performance level/target is that of the HD 5850/70 and in between, I don't see how they could really get it done without 32 ROP's, nor do I think it's overkill for a GPU with such a performance target. If you think it's overkill for a GPU that is suppose to be $200, then take a look at the GTX460.
 
Older samples with 6+8pin "just to be sure"?
The other pic clearly has the 6+6pin soldered, not 6+8pin
Possible, yes. But then, no-x already mentioned Pro and XT…

We have quite an old HD 5870 here, one of the first batch of samples, and it also has solder points for 8+6 pin, but only 6+6 was used - even though it was an early model.
 
These numbers are taken from computerbase review - based on 10-12 games. I think it is a very good example of real-world performance.

Then again that the ROPs weren't the only limiting factor in this particular example doesn't make a strong argument that the additional ROPs are generally a waste of die space.
If there are other limitations, which prevents utilization of the additional 16 ROPs (in majority of games), wouldn't it be more meaningful to stay with 16 ROPs? At least until the limitations will be removed?
 
I don't see how they could really get it done without 32 ROP's
These results show, that RV790 was in fact significantly less limited (if anything) by ROPs performance, than lots of people suggested. You can also notice, that HD4890 is as fast as GTX460-768... would it be really hard to get about 25% performance of it by adding just SIMDs? I don't think so. Especially considering, that Bart isn't high-end model, which would be rated by its eyefinity performance, ultra-high resolution performance and MSAA 8x performance - the aspects, which are mostly related to ROPs.

If we pass ultra-high resolutions where the GTX460-768 can be limited by VRAM capacity, it performs only 10% slower than 1GB model despite it has 25% less ROPs and bandwidth. It doesn't seem to be a factor, which would prevent 25-30% performance gain.

Anyway, I can imagine, that ATi will use 32 ROPs for Bart in the case, that other limiting factors will be removed and its performance will be at least comparable to HD5870 (or higher). Anyway, I still believe, that original 32nm model had 16 ROPs with 128bit bus. I don't believe they planned to use 32 ROPs for both mainstream (128bit) and high-end (256bit) part.

Jawed: It would be nice, but it could also boost Bart's performance quite above the performance level we are talking about (~HD5850). It would make sense then.
 
If we pass ultra-high resolutions where the GTX460-768 can be limited by VRAM capacity, it performs only 10% slower than 1GB model despite it has 25% less ROPs and bandwidth. It doesn't seem to be a factor, which would prevent 25-30% performance gain.
Hmm I thought it was more like a 5% difference. Anyway, I've long argued GTX460 has "too many" ROPs, given the pathetic amount of pixels other parts of the chip can push. That is, for GF104, if you're somehow limited by pixel throughput, it won't be because of the ROPs. This should be untrue for Barts.
That said, it doesn't sound impossible (just harder) to reach HD5850 performance with only 16 ROPs to me. What's the size of a Juniper/Cypress-style MC partition (with 2 quad-rops) vs a Redwood-style (with 1 quad-rop) one?
 
That is, for GF104, if you're somehow limited by pixel throughput, it won't be because of the ROPs.

This is true but the SM->ROP bottleneck doesn't seem to be much of an issue either if you compare the 460's 9.5Gp/s to the 285's 20.7Gp/s. Of course it's impossible to know if it would be significantly faster than it is now if it did have a higher fillrate.
 
Looking at computerbase, real-world results seems to be different. Comparing HD4890 (16 ROPs) and HD5850 (32 ROPs) at:
2560*1600: HD5850 is 48% faster
2560*1600 + AA 4x / AF 16x: HD5850 is 32% faster
2560*1600 + AA 8x / AF 16x: HD5850 is 22% faster

1920*1200: HD5850 is 43% faster
1920*1200 + AA 4x / AF 16x: HD5850 is 31% faster
1920*1200 + AA 8x / AF 16x: HD5850 is 27% faster

it doesn't seem, that HD5850 is able to utilize the advantage of twice as many ROPs compared to HD48xx

Bandwith is almost same for 5850 and 4890. Even if the 4870 had quite high bandwith, the increase for 5800 series was quite low if u take into acount double ROP-s. Also almost every new game is using some kind of defered rendering which adds aditional write bandwith.
 
This is true but the SM->ROP bottleneck doesn't seem to be much of an issue either if you compare the 460's 9.5Gp/s to the 285's 20.7Gp/s. Of course it's impossible to know if it would be significantly faster than it is now if it did have a higher fillrate.

If someone had both a 460 and a 285 (or any other similar GT200 derivative you care to compare), using a tool like Rivatuner which allows for independent adjustments of core and shader domains, you could play with both and see how fillrate or shader limited both cards are.
 
That said, it doesn't sound impossible (just harder) to reach HD5850 performance with only 16 ROPs to me. What's the size of a Juniper/Cypress-style MC partition (with 2 quad-rops) vs a Redwood-style (with 1 quad-rop) one?

No, not impossibe. Just really hard presuming Barts packs less shaders and textures and has to rely purely on efficiency to get to Cypress performance(and it's not like Cypress is some hugely inefficient chip either where such gains can be realistic by just changing and moving things around). Because I have doubts on the legitimacy of a chip that would be marginally faster than Juniper, marginally slower than Cypress an entire year after the evergreen launch. I know we are still on 40nm here, but I doubt their is no wiggle room even for a $200 target. What is looking like to be a 320 ALU part(let's roll with this), won't be choked by 16 ROP's?

I guess that is a good question though. How much die space do we know a single Cypress RBE take up?
 
Looking at computerbase, real-world results seems to be different. Comparing HD4890 (16 ROPs) and HD5850 (32 ROPs) at:
2560*1600: HD5850 is 48% faster
2560*1600 + AA 4x / AF 16x: HD5850 is 32% faster
2560*1600 + AA 8x / AF 16x: HD5850 is 22% faster

1920*1200: HD5850 is 43% faster
1920*1200 + AA 4x / AF 16x: HD5850 is 31% faster
1920*1200 + AA 8x / AF 16x: HD5850 is 27% faster

it doesn't seem, that HD5850 is able to utilize the advantage of twice as many ROPs compared to HD48xx

You're forgetting the clock speed difference, HD5850 is clocked at 725 mhz, while 4890 is clocked at 850 Mhz. Effectively 5850 has only 1.7 times the ROP's of 4890 and not 2 times.

But i agree with the point you are making, and this was discussed in detail when Cypress was launched. The performance increase over HD 4890 was not in line with the doubling of specs(excluding memory b/w)
 
Bandwith is almost same for 5850 and 4890. Even if the 4870 had quite high bandwith, the increase for 5800 series was quite low if u take into acount double ROP-s. Also almost every new game is using some kind of defered rendering which adds aditional write bandwith.
It was proven, that RV870 is slightly more bandwidth limited than RV770/790, but this limitation is not high enough to explain the low performance gain.

I still believe the GPU isn't able to utilize the ROPs effectively, because other parts of the chip aren't very effecient. I'm not sure, if these problems are related to the front-end part, but I remember, that disabling two SIMDs per block almost doesn't affect gaming performance...
You're forgetting the clock speed difference, HD5850 is clocked at 725 mhz, while 4890 is clocked at 850 Mhz. Effectively 5850 has only 1.7 times the ROP's of 4890 and not 2 times.

But i agree with the point you are making, and this was discussed in detail when Cypress was launched. The performance increase over HD 4890 was not in line with the doubling of specs(excluding memory b/w)
You can replace HD5850 by HD5870 in this comparision - same clocks as HD4890, but only ~50% better performance.
 
It was proven, that RV870 is slightly more bandwidth limited than RV770/790, but this limitation is not high enough to explain the low performance gain.

I still believe the GPU isn't able to utilize the ROPs effectively, because other parts of the chip aren't very effecient. I'm not sure, if these problems are related to the front-end part, but I remember, that disabling two SIMDs per block almost doesn't affect gaming performance...

5850 has not just same 256bit for 32 ROP-s but also just 4*L2 cache like the old rv770 (just double size). And the bandwith increase from L2 to L1 went up from 384 GB/s (rv770) to 435GB/s (and that with 5870 clocks ). That sounds quite cheap with 13% increase.
So i think the aditional 16 ROPs were just like a bonus, it shouldnt take to much space. The AA performance just shows it.

Maybe this time the 6000 cards will have 32 ROPs with some major cache changes. (and i was expecting also 384bit :cry:)
 
Last edited by a moderator:
Well, so what will/could/should be improved over R8xx?

1. utilization of SPs (5D -> 4D ALUs)
2. L1-L2 texture cache bandwidth
3. hard mip-map transition when using AF (maybe it's related to #2)
4. higher triangle-rate or tesselation throughprut better than 1/3 tr. per clock per GPU
5. conflicts in triangle distribution (?)
6. utilization of additional SIMDs (maybe related to #2, 4, 5)
7. GDDR5 undervolting
8. performance of ROP-disabled products (per-ROP bandwidth)?
9. Eyefinity limitation (2 clock generators)
10. UVD stuff

did I miss anything?
 
I'm not holding my breath for #8, it doesn't really seem worth the engineering effort. Everything else sounds plausible.
 
Good summary no-X.

8 seems irredeemable though.

I'd add single-card CrossFire that doesn't use a bridge chip - and perhaps doesn't suck, though that seems unlikely.

Also rasterisation that doesn't gum up when faced with triangles of 8 pixels or less.
 
Well, so what will/could/should be improved over R8xx?

1. utilization of SPs (5D -> 4D ALUs)
2. L1-L2 texture cache bandwidth
3. hard mip-map transition when using AF (maybe it's related to #2)
4. higher triangle-rate or tesselation throughprut better than 1/3 tr. per clock per GPU
5. conflicts in triangle distribution (?)
6. utilization of additional SIMDs (maybe related to #2, 4, 5)
7. GDDR5 undervolting
8. performance of ROP-disabled products (per-ROP bandwidth)?
9. Eyefinity limitation (2 clock generators)
10. UVD stuff

did I miss anything?
wrt 2 I think it might not improve with Barts (if it has same number of TMUs than Cypress), only Cayman (e.g. there could be a 2 times cache bandwidth improvement for 1.5 times the TMUs)
As for 8, I don't think there might be much need for ROP-disabled products, at least if Barts is 16 ROPs and Cayman 32, unless it's really needed for salvage parts. Or, maybe doubling internal bandwidth (see number 2) would solve that anyway.
 
Back
Top