AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Yeah, and my theory is that neither NI, nor SI were part of the initial plan. Perhaps there was something different, with a different name, for example Hecatonchires :mrgreen:. Later, they split it and NI and SI were born. Also, me thinks, that the architecture for SI will be further improved. Not just a simple shrink to 28 nm.

Pretty much I'm thinking they will skip the original specs that were to be on 32nm and go straight into the "refresh" specs on 28nm(For reference, NI being the gimped version on 40nm).

"NI Cayman XT" = 24 SIMD/40nm
"Original Plan" = ?28 SIMD/32nm?
"SI Cayman XT" = ?32 SIMD/28nm?

If SI is as related as to NI as I think, it should not take anywhere close to a year for the HD 7 series to be popping their heads. Summer release? Hit before back to school?
 
Pretty much I'm thinking they will skip the original specs that were to be on 32nm and go straight into the "refresh" specs on 28nm(For reference, NI being the gimped version on 40nm).

"NI Cayman XT" = 24 SIMD/40nm
"Original Plan" = ?28 SIMD/32nm?
"SI Cayman XT" = ?32 SIMD/28nm?

If SI is as related as to NI as I think, it should not take anywhere close to a year for the HD 7 series to be popping their heads. Summer release?
AMD desperately needs to address some architecture bottlenecks, otherwise a increase from 24 simds to 32 simds (at the same clock) would give a 10% perf increase at best. So IMHO SI can't really be almost the same as NI. For that matter, I doubt the original plan actually had more simds, unless there were other changes (which seems unlikely given that Dave said the design was just back-ported to 40nm).
 
Where does this rumour about 32 SIMDs @28 nm originate from? Because it is so conservative and on the low side of expected specifications. That possible chip would be both very small, and weak against what NVidia will offer. :oops:
 
Where does this rumour about 32 SIMDs @28 nm originate from? Because it is so conservative and on the low side of expected specifications. That possible chip would be both very small, and weak against what NVidia will offer. :oops:
I think it's rather the opposite, everybody assuming amd will go for somewhat small chip again (as they did for rv770 and initially Cayman, only Cypress got "too big"), then figure simd count can't be that much higher - if the chip is really only about 270mm² 32 simds sounds plausible, if there are other changes too. And fwiw, even with "only" 32 simds that be would more than twice the texture and peak alu rate of GTX 580, so you can't conclude from that "small" number of simds automatically the chip is slow - if other bottlenecks are addressed it could potentially be twice as fast as GTX 580 (ok not going to happen but you get the point).
 
Where does this rumour about 32 SIMDs @28 nm originate from? Because it is so conservative and on the low side of expected specifications. That possible chip would be both very small, and weak against what NVidia will offer. :oops:

I would expect something like that as it would consume about as much area as rv770 or barts,
 
Ok, let's assume that the new chip will be small enough. But where did the doubling of SPs go? 320 to 800 (that's even 2.5 times), 800 to 1600, etc. Only Cayman seems quite beefy and containing stuff which should be removed, like Fermi 1.0. So, my bet is at least 2500 SPs, 32 simds is only 2048 SPs, if my calculations are right. Next generation Nvidia GPU can very easily be a 1024 SPs' part. :(
 
Ok, let's assume that the new chip will be small enough. But where did the doubling of SPs go? 320 to 800 (that's even 2.5 times), 800 to 1600, etc.
Doubling was easy for RV770 because R600/RV670 actually spent a lot of their transistor budget on stuff like the ring bus, and their SIMDs weren't that size-optimized either.
Cypress could double SP count because it was both on a smaller process node and had more die area at its disposal.

And like mczak already said, it's pointless to double SPs unless some bottlenecks that are obviously still in place are removed first.

For example, it's possible that 4 schedulers feeding 8 SIMDs each would be far more efficient than the current 2x12 design, in a best-case scenario the increased efficiency would result in a real-world scaling better than the theoretical 33% increase in shader/TMU power.

Only Cayman seems quite beefy and containing stuff which should be removed, like Fermi 1.0.
Would you mind telling us what "beefy stuff" exactly is superflous in your opinion and should be removed?

afaik, most of the additional transistors were spent on
- upgraded ROPs = better AA performance
- more TMUs [more SIMDs] = better AF performance
- doubled 'graphics engine' = better tesselation performance

As you can see, the problem is just that most of the additional transistors were spent on things that increase performance only in specific scenarios.

Nevertheless, I think those changes were necessary to make Cayman more future-proof than Cypress.


if my calculations are right. Next generation Nvidia GPU can very easily be a 1024 SPs' part. :cry:
And it would likely be at least twice as large as a 32 SIMD chip from AMD would.
 
AMD desperately needs to address some architecture bottlenecks, otherwise a increase from 24 simds to 32 simds (at the same clock) would give a 10% perf increase at best. So IMHO SI can't really be almost the same as NI. For that matter, I doubt the original plan actually had more simds, unless there were other changes (which seems unlikely given that Dave said the design was just back-ported to 40nm).

I pretty much agree, yet vehemently disagree.

I agree Cayman would be pretty well maxed out at 1792sp (<10% more perf from SIMDs), but that is more than 24. Anything above that appears pointless as far as I can tell...and I've come at it from A LOT of different angles. I don't doubt the possibility I'm overlooking something, but at this moment it looks like the buck stops there for absolute efficiency. I hardly believe that makes the architecture broken though, or non-scalable.

That said, I think 6970 is a woefully-damaged product compared to what may have been (on 32nm or not). I have little doubt at 900/6000, a quad more SIMDs would have put 'RV970' eye-to-eye with GTX580. GTX580 seems clocked as if they were still expecting that product, it's that close (according to my math anyway).

Going by TPU's latest graph at 1920x1200:

GTX580 avg 1.11363636_ faster than 6970 (+-1%).

580 has 9.2% greater BW.
9.2%*16% = 1.472% performance difference from BW.
Difference: 1.09891636363636_

On Cayman, 1% SIMD theoretical increase = .5% RW performance increase on avg. (Judging by overclocking results & 6970/6950 difference adjusting for BW)

2703660*1.1978327272727_ = 3238532 Flops

3238532/(1792*2) = 903.6mhz

903.6/900 = .4%/2 = .2%

SRSLY.

Now, is that PERFECTLY EXACT & REFLECTIVE FOR ALL SCENARIOS? No. Is there possibly a bottleneck between 1536-1792 (and probably 1664-1792)? Sure. I think it's pretty close though, as-in within a couple/few percent.
_____________________________________________________________________

I think the next stop is 48 ROPs. 40-42 SIMDs. 40 makes the most sense for lots of reasons, but 42 meshes nice with 1792/32, plus it's the answer to everything in the universe, why not a future GPU SIMD count. At this point my opinion (which seems to change randomly) is that I would be surprised if neither is the case. If '32nm Cayman' was going to replace Cypress at a similar size, something has to do it in-turn on 28nm. As such, something was always going to replace Barts, and that previous 32nm part was probably going to be shrank to 28nm. A Cypress size die, going from 32->28nm would fill that spot nicely.

I hope the 28nm stack has both those discrete chips TBH, I think they'd be shoe-in replacements for Cypress and Barts. I'd be happy with half-specs as Juniper and Redwood replacements too. You'll notice they would oddly pair up with past parts (896 VLIW4 = 1120 VLIW5, 1280 VLIW4 = 1600 VLIW5.) Strange, that...only not really at all.
 
On Cayman, 1% SIMD theoretical increase = .5% RW performance increase on avg. (Judging by overclocking results & 6970/6950 difference adjusting for BW)
You are way overestimating simd scaling. On Cayman (from Pro to XT), 9% more simds is good for only 2% perf increase (results taken from here, http://ht4u.net/reviews/2010/amd_radeon_hd_6970_6950_cayman_test/index43.php HD6970 at HD6950 clocks). Ok there are some inaccuracies here, but still scaling to even more simds is only going to make things worse. 4 more simds? 5% increase at best.
FWIW, Barts showed much better scaling - 7% faster for 17% more simds (14 vs. 12 simds - http://ht4u.net/reviews/2010/amd_radeon_hd_6850_hd_6870_test/index34.php). Given the same number of rops, not that much less memory bandwidth etc. though it is certainly expected.
But the point is, Cayman is past the point where just increasing amount of simds is economical - that is an increase in simds has a larger increase in terms of die size than performance. Now, it is possible fixing other bottlenecks also would increase die size more than performance, but at this point simd scaling is so minimal it seems amd really will have to find something else to improve performance. I'm not sure increasing ROPs is going to help a lot neither, since AMD apparently was looking at a 16 ROPs Barts (with 2 more simds instead) with very similar performance - so it doesn't look like 32 ROPs are very limiting for Cayman (though 48 ROPs are probably a viable option - just 3 quad-ROP partitions instead of 2 per 64bit channel).
I'm still thinking 32 simds (4x8) would be enough for a next gen chip - that's twice the "simd performance" of Barts, after all. It just needs to scale better with simd count than Cayman (or Cypress...).
 
Last edited by a moderator:
:mrgreen: Ok, if that's the problem, the scaling in such a case, simply add.... :mrgreen: .... uhhhhhhh....grrrrrrrrr............. :mrgreen:............ more cards. It's so simple. :mrgreen: CF scaling looks like a very good option. :mrgreen:
 
But the point is, Cayman is past the point where just increasing amount of simds is economical...
I think more analysis than a few rounded, averaged benchmarks need to be done to really conclude anything; I'd say that probably one of the titles in the list there would probably demonstrate much in the way of any perfrmance improvement from additional SIMD's.
 
I think more analysis than a few rounded, averaged benchmarks need to be done to really conclude anything; I'd say that probably one of the titles in the list there would probably demonstrate much in the way of any perfrmance improvement from additional SIMD's.

Oh you're right more detailed analysis would be nice. In particular the average not only includes quite a few titles, but also quite low resolutions (with and without AA/AF). It is entirely possible in some titles and at the higher resolutions scaling was better.
Still, that the average is only 2% more (where at the same time for 10% more clock for the same mix 7% of additional performance is gained) is quite telling imho.
And unfortunately you didn't give us a die shot of Cayman so we don't know the die size of a simd :). So those 9% more simds might only really cost 4% die area or so in which case scaling wouldn't really look that bad (but still not quite good).
In any case, if SI really turns out to be mostly a shrink with just more simds, color me surprised.
 
I think more analysis than a few rounded, averaged benchmarks need to be done to really conclude anything; I'd say that probably one of the titles in the list there would probably demonstrate much in the way of any perfrmance improvement from additional SIMD's.

Question for you, sir.

Good or bad way to look at things (on average):

After special function, this arch should be shooting for 3.25-3.4 shaders per rop per cluster. That, to me, seems the average shader usage on both architectures after SF.

IE, if we figure doing special function at the rate of GF100, the leftover rate is 3.5/4. If we figure GF10x, it's 3.33__. Obviously that's a comparative generalization, and usage varies, but they're constants and the competition, so it's worth using them.

(4+4+4+2/4= 3.5 and 4+4+2/3 = 3.33__)

Therefore,

32 ROPs:
3.5 = 1792 (3.5+3.5/2 = 3.5, 3.33_+3.5 = 3.416) --- Excellent...If not over-equipped.
3.25 = 1664 (3.5 + 3.25/2 = 3.375, 3.33_ + 3.25/2 = 3.29166_ )...Optimal design.
3 = 1536 (3.5 + 3/2 = 3.25, 3+ 3.33_/2 = 3.16) - What we call 6970.
2.75 = 1408 (3.5 + 2.75 = 3.125, 3.33_+2.75 = 3.04) - What we call 6950.

48 ROPs
3.5 = 2688...same as 1792 with 32 ROPs.
3.33_ = 2560 (3.33_+3.5/2 = 3.4166_, 3.33_+3.33_/2 = 3.33_)...Optimal design.
3.166_ = 2432 (3.16_+3.5/2 = 3.33__, 3.16+3.33_/2 = 3.25)
3 = 2304...same as 6970

Terrible generalization, or barking up the right tree?

(Edit: Wait, did I ask this before? If I did, I apologize...I have a terrible memory...and don't follow threads and keep up on things as well as I should.)
 
Last edited by a moderator:
Oh you're right more detailed analysis would be nice. In particular the average not only includes quite a few titles, but also quite low resolutions (with and without AA/AF). It is entirely possible in some titles and at the higher resolutions scaling was better.
Still, that the average is only 2% more (where at the same time for 10% more clock for the same mix 7% of additional performance is gained) is quite telling imho.
And unfortunately you didn't give us a die shot of Cayman so we don't know the die size of a simd :). So those 9% more simds might only really cost 4% die area or so in which case scaling wouldn't really look that bad (but still not quite good).
In any case, if SI really turns out to be mostly a shrink with just more simds, color me surprised.

Yes, I would implore you to look at TPU, for example, that show the array of resolutions and the differences between them, while also being a much wider array of games (including pretty much everything from your link) that you can look at individually. Cayman actually (comparably) performs pretty badly at low-rez. That said, it shouldn't really have any massive advantages over 1.536MB at 1920x1200 because of memory size...and it doesn't when you compare it to 580 at 1680x1050. Yet, it should accurately scale at that resolution...Which it does (70 is 11.39% avg faster than 50, +-1%). That's why I picked it...well, that and it's the most common rez for it's market. Likewise, I don't compare integrated gfx at 2560x1600. Not trying to pick on ya though. I love TPU's reviews for all that info...They must take W1z freaking forever...but I love him for it.

As for being surprised, I guess I think you will be. That said, there might be a 1664sp product (if not 1792) on 28nm...and really....would that be more SIMDs than Cayman has on die? The world may never know. Considering that whole "6950s became 6970s...blahblahblah bios chage...ZOMG YIELDS" snafu, I wouldn't take that bet.
 
Last edited by a moderator:
Yes, I would implore you to look at TPU, for example, that show the array of resolutions and the differences between them, while also being a much wider array of games (including pretty much everything from your link) that you can look at individually. Cayman actually (comparably) performs pretty badly at low-rez.
The problem is, there's not HD6970 at HD6950 clocks there. I'm willing to believe though at higher resolutions there is more of a performance difference for these 2 more simds, but how much more there's not enough data to even make a guess.

As for being surprised, I guess I think you will be.
I'd be surprised if I'll be surprised :) We'll see ;-).
 
The problem is, there's not HD6970 at HD6950 clocks there. I'm willing to believe though at higher resolutions there is more of a performance difference for these 2 more simds, but how much more there's not enough data to even make a guess.

There's been benchmarks done by 6950 SIMD unlockers and the increase from only unlocking the additional SIMDs varies greatly by application and benchmark. Some applications see very little increases in performance while others see quite a bit more.

So any potential increase or average increase is going to be determined quite a bit by the applications chosen by a person benching his card or a review site benching the card.

Regards,
SB
 
Back
Top