AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

So in Cayman, 64 transcendentals would be executed in groups of 16, taking four cycles total (full pipelining?), while in Tahiti, they would be executed all together, but would have to go through at least three loops? Is that right?
Cayman:
An instruction containing a transcendental (instructions always operate on a full warp of 64 elements) is issued over 4 cycles, same as all instructions. A transcendental operations just takes 3 slots of the VLIW instruction (leaving the 4th one for another operation).
The transcendental itself is computed as some kind of 3 dependent multiply adds (using lookup tables and the inter slot communication channels also used for double precision, DOT instruction, and for co-issue of dependent operations [add_prev, mul_prev_ madd_prev, sad_prev and so on]). So the latency is the same as with all other operations (8 cycles in all VLIW GPUs), just the throughput is lower.

With GCN, the same algorithm would probably map to an instruction computing and combining the result of the 3 dependent (internal) operations. This is done by looping over the MUL and ADD stages in the pipeline to compute the intermediate results in series (instead of in parallel in 3 slots of a VLIW unit). The wide adder circuit at the end of the pipeline (for FMA) combines the intermediate results to the final one. The throughput would be lowered to 1/3 and the latency would triple (12 cycles). AMD may save on some hardware somewhere, so it may be 1/4 or 1/6 (1/3 would be quite fast for transcendentals).
 
The die size argument is simply sticking your head in the sand. Just looking at the die sizes of say the 580 vs 5870 won't tell you very much about anything.

Die sizes don't matter at all to me as a customer. I'm not sure how you can start talking about architectural efficiency without first understanding the architectures you're discussing ;)

Die size doesn't matter to end customers, but it—and manufacturing cost in general—matters to IHVs. To customers, power matters. Well it matters to me anyway.

It's better if you understand the architectures but actually, you can talk about efficiency without understanding much. All you need is i) the cost, ii) the performance, iii) the power draw. Defining efficiency with theoretical numbers like peak flops or texels/s makes for interesting discussion, but in the end it's not useful to anyone.

Say you're a manager, your architects come to you with two designs, and you have to pick one:

Design A:
Peak FMA rate: 2TFLOPS
Manufacturing cost: $150
Power: 250W
Average performance: 100FPS
FPS/TFLOPS: 50

Design B:
Peak FMA rate: 20TFLOPS
Manufacturing cost: $140
Power: 240W
Average performance: 110FPS
FPS/TFLOPS: 5.5

Which one do you pick? Does the FPS/TFLOPS ratio factor in at all? In what way is it even useful information, beyond geeky curiosity?

We'll see soon enough, I'm not sure how much "compute" Southern Islands can drop for lower cost parts. It seems like everything is already well balanced.

I'm guessing they can drop DP (or go to 1/8 rate at least) but little more. I don't think they'd want to drop more either: after all, they can't stop talking about Fusion these days.
 
I'm guessing they can drop DP (or go to 1/8 rate at least) but little more. I don't think they'd want to drop more either: after all, they can't stop talking about Fusion these days.
Additionally, they will probably drop the ECC implemented in Tahiti for the smaller GPUs and they can make the LDS slower (by using fewer banks, they did that already for the small members of HD5000/6000 family).
 
Cayman:
An instruction containing a transcendental (instructions always operate on a full warp of 64 elements) is issued over 4 cycles, same as all instructions. A transcendental operations just takes 3 slots of the VLIW instruction (leaving the 4th one for another operation).
The transcendental itself is computed as some kind of 3 dependent multiply adds (using lookup tables and the inter slot communication channels also used for double precision, DOT instruction, and for co-issue of dependent operations [add_prev, mul_prev_ madd_prev, sad_prev and so on]). So the latency is the same as with all other operations (8 cycles in all VLIW GPUs), just the throughput is lower.

With GCN, the same algorithm would probably map to an instruction computing and combining the result of the 3 dependent (internal) operations. This is done by looping over the MUL and ADD stages in the pipeline to compute the intermediate results in series (instead of in parallel in 3 slots of a VLIW unit). The wide adder circuit at the end of the pipeline (for FMA) combines the intermediate results to the final one. The throughput would be lowered to 1/3 and the latency would triple (12 cycles). AMD may save on some hardware somewhere, so it may be 1/4 or 1/6 (1/3 would be quite fast for transcendentals).

Thanks!
 
AMD may save on some hardware somewhere, so it may be 1/4 or 1/6 (1/3 would be quite fast for transcendentals).
Not sure it really needs to be the same performance for all ops neither. intel had quite different rates for different ops at least when they still had the separate MathBox (with rcp being the fastest). Not sure though on SNB/IVB IGP (could probably look it up...).

We'll see soon enough, I'm not sure how much "compute" Southern Islands can drop for lower cost parts. It seems like everything is already well balanced.
In addition to what others have mentioned (lower rate DP, no ECC, ...) which I think don't really reduce transistor count that much, I somewhat suspect the architecture is still (as previous ones) not really scaling that well with more shader resources at the high end (e.g. for 10% more simds with cayman you got something like maybe 3% more performance on average, whereas GF110 or GF114 it was more like 7% when not changing anything else, but in contrast nvidia didn't really seem to scale with ROPs, for instance). So I think something with quite a bit less CUs (but same amount of "other stuff" including ROPs, geometry units) might indeed be slightly better balanced for games. Even the loss of bandwidth might not be that bad (I'm not quite sure how much area this saves, maybe significant if internal busses also can be more narrow etc.).
 
:???:I still don't understand (because I'm a dumb cave dweller) why DP rate is artificially limited when compute adoption is needed.:???:

I've heard the argument about cannibalizing professional segment products... but I don't think it would affect those segments sales because the main reason the pro segment stuff sells is the validation and driver/application stability, not theoretical performance.
 
2011122803343375.png
1536? x2 for New Zealand? At those specs, 2x Pitcairn XTs come close. After Hemlock and Antilles I'm surprised if they drop that low.

Also that is a really wide range of variants of the Cape Verde chip. Also with Cape Verde Pro compared to the full chip…

43% of the shaders
43% of the TMUs
25% of the ROPs
50% of the bus width

Me too...just wouldn't put it as the best place I've lived :)
Agreed.
 
I see no way New Zealand is going to have 1536x2 with no other southern islands part using 1536. It makes no sense.

Yeah. Antilles features two full Caymans, albeit at lower clocks, and Cayman is a 250W part, just like Tahiti. I see no reason for New Zealand to be any different.
 
:???:I still don't understand (because I'm a dumb cave dweller) why DP rate is artificially limited when compute adoption is needed.:???:

I've heard the argument about cannibalizing professional segment products... but I don't think it would affect those segments sales because the main reason the pro segment stuff sells is the validation and driver/application stability, not theoretical performance.

DP, at least currently, isn't much used in the consumer space. Thus it's not a high priority when you get to consumer only parts. Or to put it another way, it's an area to increase margins (presumably smaller die size) without impacting performance in the consumer space.

For chips that are used both in professional and consumer level products then the distinction would mostly be to give professional users a reason to pay 4x (or more) as much for the professional card versus the consumer card.

Regards,
SB
 
I've heard the argument about cannibalizing professional segment products... but I don't think it would affect those segments sales because the main reason the pro segment stuff sells is the validation and driver/application stability, not theoretical performance.

In addition to what SB said, I also think it would lead many people (probably university students) to try some of the game-grade SKUs for their semi-professional needs. But without the appropriate drivers for those applications (I'm thinking mainly CAD/CAM/CAE here), the experience would be less than probably expected and turn people away from the pro-SKUs based on their (false) conclusions.
 
DP, at least currently, isn't much used in the consumer space. Thus it's not a high priority when you get to consumer only parts. Or to put it another way, it's an area to increase margins (presumably smaller die size) without impacting performance in the consumer space.

in 2009 there wasn't a lot of adoption for AO implementations, or DoF, etc. It needed new hardware capabilities to gain traction. Would developers use DP more if it were pervasive throughout the range? Would there be a benefit? I don't know, nobody does it because there's no hardware to run it on to determine if it would be beneficial at the lower SKU's.

For chips that are used both in professional and consumer level products then the distinction would mostly be to give professional users a reason to pay 4x (or more) as much for the professional card versus the consumer card.

Regards,
SB

Professional users don't buy for the absolute performance, they buy for the stability and support as well as the perf/$. Enabling the top end cards on the consumer side to have full rate DP wouldn't suddenly cause a stall in top end firepro sales. Especially not with the lag to introduction the FirePro's have vs. consumer cards. Is there even going to be a Cayman FirePro, at this point? For being better at compute, it sure got it's ass handed to it by Cypress for being implemented in servers and workstations, and it had a nigh on essential feature for high performance compute use - Powertune.

In addition to what SB said, I also think it would lead many people (probably university students) to try some of the game-grade SKUs for their semi-professional needs. But without the appropriate drivers for those applications (I'm thinking mainly CAD/CAM/CAE here), the experience would be less than probably expected and turn people away from the pro-SKUs based on their (false) conclusions.

But that's the case now, there are plenty of people who run their pro apps on consumer cards and then try to mod their Radeon to FirePro (and GeForce to Quadro) to run the pro drivers.

I agree that for products that will never be used for DP compute that there is no reason to include it, but I'm cautious of the claims of which segments need that capability. I feel like DP should be full range and full speed in all products, from APU's to top end dGPU's. It seems to me, backed by two or even three google search queries, [folds arms, nods smugly] that providing that capability throughout the range with all the compute accessibility we have now would be beneficial for the apps that can leverage it and would lead to greater adoption of that products' variants - more sales.

By Apps, I'm thinking photo/image/video post processing filters, cryptographic and compression, but I welcome correction from those who know whether there would be genuine rewards for adding that kind of capability support to real world apps. In my dream fantasty world I'm thinking of an APU based notebook with a software encrypted harddrive using the GPU to perform the encrypt/decrypt for faster performance, or automatically encrypting emails.
 
In the internal gubbins bits. Maybe I'm wrong and there will be no/is no discrepency between firepro/stream card DP rate and consumer cards? Corrections and education welcomed!


[Hi Dave! Shame you weren't at the tech day. Devon did a great job :)]
 
in 2009 there wasn't a lot of adoption for AO implementations, or DoF, etc. It needed new hardware capabilities to gain traction. Would developers use DP more if it were pervasive throughout the range? Would there be a benefit? I don't know, nobody does it because there's no hardware to run it on to determine if it would be beneficial at the lower SKU's.

I'm unsure if there will ever be a need for DP in the purely consumer space. Sure it's possible that some prosumer applications that straddle the line between consumer and professional markets might benefit but there aren't a lot of scenarios where single precision is that lacking. Again, in the consumer space. And, of course, for pro-sumers you always have the performance or enthusiast level of cards which will have DP whether full or castrated.

It isn't like AO (SSAO, HBAO, etc.) where you have a sometimes subtle but noticeable increase in visuals while at other times a much more noticeable increase in visuals.

Professional users don't buy for the absolute performance, they buy for the stability and support as well as the perf/$. Enabling the top end cards on the consumer side to have full rate DP wouldn't suddenly cause a stall in top end firepro sales. Especially not with the lag to introduction the FirePro's have vs. consumer cards. Is there even going to be a Cayman FirePro, at this point? For being better at compute, it sure got it's ass handed to it by Cypress for being implemented in servers and workstations, and it had a nigh on essential feature for high performance compute use - Powertune.

Performance may or may not be a top priority when buying a product in the professional space as it all depends on what use or workload is expected. Support is obviously going to be consistently high on the list as is stability. But performance isn't going to be far behind if the application demands it.

Cayman faced a lot of uphill battles compared to Cypress. Not least of which was people in the industry thinking (and rightly so it appears) that it was a transitional product. There's other factors, of course, but it was what it was.

Having full DP in consumer cards (and we don't even know if 7970 is artificially limited in any way) obviously won't undermine all or even a majority of professional card sales, but it has the potential to reduce a not insignificant number of sales if it is too easy to modify the consumer level product into a professional level product in the level of performance if not in the level of support. Hence why Nvidia has made it almost impossible to convert a GTX 580 into a Quadro FX 580.

Regards,
SB
 
I'm unsure if there will ever be a need for DP in the purely consumer space. Sure it's possible that some prosumer applications that straddle the line between consumer and professional markets might benefit but there aren't a lot of scenarios where single precision is that lacking. Again, in the consumer space. And, of course, for pro-sumers you always have the performance or enthusiast level of cards which will have DP whether full or castrated.

Sure, I just don't know - if you unlock the capability, what the unwashed masses do with it? Like you say, it could be very little benefit. But all it takes is one big innovation, different thinking, new way to crack a nut.



Having full DP in consumer cards (and we don't even know if 7970 is artificially limited in any way) obviously won't undermine all or even a majority of professional card sales, but it has the potential to reduce a not insignificant number of sales if it is too easy to modify the consumer level product into a professional level product in the level of performance if not in the level of support. Hence why Nvidia has made it almost impossible to convert a GTX 580 into a Quadro FX 580.

Regards,
SB

I really disagree with this statement of potential. I think there will certainly be guys who use consumer instead of pro cards for their work. But I don't think it will be enough to make a huge dent in sales numbers, unless it gets leveraged like the Llano APU supercomputer. The number of guys that would do that, and the number of cards they'd buy, just wouldn't be a huge impact in my opinion. No business making real revenue is going to want to wander that far off the support matrix reserve, unless they're really pushing the envelope like being a valley startup in venture stage 2 or something.

While it's possible to point to a 500 GPU deployment using Radeon instead of FirePro's as a loss for the pro business, it's also 500 GPU's that weren't NVIDIA as well. In that case, I can see it making a dent... but honestly, it's really not caniballizing AMD FirePro/Stream sales, it would be taking Quadro/Tesla sales.

Limiting h/w features feels like the wrong response to a trivial problem, treating customers as if they're going to do wrong unless you prevent them. Rather, enable them to build what they need and sell support packages - offer tiers, like if a professional business wants to use Radeon cards and get advanced support for driver issues, they can buy a package from AMD (or a partner company) to match FirePro levels. For FirePro's they can get the 'free' support they get now and upgrade it Enterprise grade with more options for support, problem fix etc. That way the solution becomes a revenue source, a feature becomes a value-add.

But I digress.

I think the next iteration of GCN-based chips (Central Islands? Tropical Islands? Archipelagos?) will need to have an increased prim rate, for performance under Eyefinity size resolutions and 4K screens/projectors.
 
But that's the case now, there are plenty of people who run their pro apps on consumer cards and then try to mod their Radeon to FirePro (and GeForce to Quadro) to run the pro drivers.
And you would like to make the appeal to people like this even larger, thus reducing your sales of high-margin products? Good luck selling that at the next shareholder meeting as company policy. ;)

I agree that for products that will never be used for DP compute that there is no reason to include it, but I'm cautious of the claims of which segments need that capability. I feel like DP should be full range and full speed in all products, from APU's to top end dGPU's. It seems to me, backed by two or even three google search queries, [folds arms, nods smugly] that providing that capability throughout the range with all the compute accessibility we have now would be beneficial for the apps that can leverage it and would lead to greater adoption of that products' variants - more sales.

By Apps, I'm thinking photo/image/video post processing filters, cryptographic and compression, but I welcome correction from those who know whether there would be genuine rewards for adding that kind of capability support to real world apps.
I really am not so sure where we would really need DP for better results. I hope someone else (edit: someone other than me!) can come up with reasonable examples.


In my dream fantasty world I'm thinking of an APU based notebook with a software encrypted harddrive using the GPU to perform the encrypt/decrypt for faster performance, or automatically encrypting emails.
FWIW, I'm running an encrypted SSD in my Atom-based Netbook and it's dirt slow (Serpent slows the SSD perf from arounbd 70ish MB/s to just under 30 MB/s. That said, I don't think normal CPUs especially with more than one core would show such a degree of slowliness. :)
 
Last edited by a moderator:
I think the next iteration of GCN-based chips (Central Islands? Tropical Islands? Archipelagos?) will need to have an increased prim rate, for performance under Eyefinity size resolutions and 4K screens/projectors.
At such screen resolutions the prim setup rate is the last problem -- limitations will be elsewhere in the pipeline. ;)
 
Back
Top