AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

APU's are light years away from rv770 level perf. Let alone a modern ~250mm2 discrete GPU.

Im not suggesting that APU's are going to catch up to upper mid range or high end GPU's, in fact i said nothing of the sort. I just said i dont expect AMD's next high end GPU to be the size of RV770. My point was rather that the gap between the next gen APU's and the next gen mid-range GPU's will grow. There wont be any more low end GPU's (which of course is a known fact), and that i expect a three chip lineup going forward with ~150,250 and 350 mm2 die sizes.

And about light years away, well i expect a 22nm Fusion with GCN could get quite close to RV770 (mem b/w might be a limiting factor though, maybe they'll go triple channel to solve that problem). And we should see it in about two years time, say mid 2013.
 
APU's are light years away from rv770 level perf. Let alone a modern ~250mm2 discrete GPU.

"Light years away" isn't the term I'd use when the desktop versions of the APUs are 1/3rd to 1/4th the performance of a discrete RV770.
I'm expecting Trinity to come really close to a HD4850, for example.

BTW, xbitlabs is claiming there'll be 28nm NGC GPUs in 2011, and those will be made using HKMG
 
Otherwise there would be no reason to purchase mid range discrete GPU's.
Maybe they would buy it because they need a discrete card? Or they don't want an APU? Or they have an older system?

I think they'll have to have a three chip lineup going forward, similar to the current gen where they have Cayman, Barts and Turks. A chip like Caicos will not be required in 2012 when we have Ivy Bridge and Trinity. So for the three chip lineup i expect something like a ~350 mm2 GPU, a ~250 mm2 GPU and ~150 mm2 GPU
It will be 4 chips again. They still need a ~100mm2 8-12SIMD vliw4 discrete card to buddy up with Trinity and take care of the <$100 and HTPC market.
I would consider it more of a low-mid range chip. I think we have seen the last of the 64bit cards from AMD though.

The chips as I see them;
Highend GCN 32CUs
Performance GCN 24CUs (matching GTX580)
Midrange GCN 16CUs (somewhere between 6950/6870, maybe a 5870?)
Lowend VLIW4 8-12SIMDs (around a 6790/6770)
 
Last edited by a moderator:
BTW, xbitlabs is claiming there'll be 28nm NGC GPUs in 2011, and those will be made using HKMG
HKMG was clear. What is more interesting in my eyes is that the wording of Thomas Seifert implies that there will be GPUs from both TSMC and GF. So maybe performance model from GF and highend and mainstream from TSMC or something like that.
Thomas Seifert said:
"At the 28nm node, all of our products will be based on bulk process technology, providing increased flexibility to work across our two committed and valued partner. [...] [With the introduction of 28nm process technologies], our flexibility to manage risk across the foundry partner ecosystem that we have has significantly increased," said Mr. Seifert.
Or that is merely a prospect for later in the year 2012.
 
Sorry, that's a bit vague. It does not mention for example, clocks. Nor does it take in to account the observedly worse ROP performance in salvage products like HD 5830 and HD 6790 compared to what would expect.
Put another way Cypress is hugely inefficient per mm² by comparison. 31% bigger for ~15% more performance?
 
A little bit off-topic: Whole thing is getting confusing for me: So when is the first GCN based product coming out ? Thought that it was SI but looks like I was confused?
 
A little bit off-topic: Whole thing is getting confusing for me: So when is the first GCN based product coming out ? Thought that it was SI but looks like I was confused?

SI discrete cards will be GCN based, at least most of them.
But as far I understand, an exception will be made on the lower end cards.

Trinity (Llano's successor) will sport a VLIW4 iGPU. AMD has been betting on Hybrid Crossfire between the APU's iGPU and discrete GPUs, mainly because of the performance value it represents in mid-range laptops.
In order to achieve this for Trinity, AMD will have to launch a GPU that will be a discrete version of Trinity's iGPU (let's say, HD75xx).

That said, Southern Islands GPUs should be GCN, except for the HD75xx/76xx parts that should be VLIW4.

Since there's really no need for better performance in that segment, Caicos will probably be rebranded to HD73xx, with this discrete card segment eventually disappearing during 2013-2014, when top-end AMD CPUs become APUs..



So most probably:

HD77xx and up: GCN architecture
HD75xx and/or HD76xx: VLIW4 architecture
HD74xx: probably reserved for iGPUs in lower-end Trinity APUs: VLIW4 architecture
HD73xx and down: probably rebadged Caicos discrete cards, along with Krishna/Wichita iCPUs (which may very well bring integrated Caicos): VLIW5 architecture (160sp, 8TMUs, 4 ROPs)
 
Last edited by a moderator:
SI discrete cards will be GCN based, at least most of them.
But as far I understand, an exception will be made on the lower end cards.

Trinity (Llano's successor) will sport a VLIW4 iGPU. AMD has been betting on Hybrid Crossfire between the APU's iGPU and discrete GPUs, mainly because of the performance value it represents in mid-range laptops.
In order to achieve this for Trinity, AMD will have to launch a GPU that will be a discrete version of Trinity's iGPU (let's say, HD75xx).

That said, Southern Islands GPUs should be GCN, except for the HD75xx/76xx parts that should be VLIW4.

Since there's really no need for better performance in that segment, Caicos will probably be rebranded to HD73xx, with this discrete card segment eventually disappearing during 2013-2014, when top-end AMD CPUs become APUs..



So most probably:

HD77xx and up: GCN architecture
HD75xx and/or HD76xx: VLIW4 architecture
HD74xx: probably reserved for iGPUs in lower-end Trinity APUs: VLIW4 architecture
HD73xx and down: probably rebadged Caicos discrete cards, along with Krishna/Wichita iCPUs (which may very well bring integrated Caicos): VLIW5 architecture (160sp, 8TMUs, 4 ROPs)

Thanks for the clarification :)
 
Individual power gating of CUs seems hard. The thread scheduler is probably hard wired for the number of CUs. The data paths between ff hw <-> ALU's is probably hardwired. Gating a few CUs will probably lead to load imbalance.
I stated gating groups of 4. If Llano is any guide, 4 CUs should have at least the area of an x86 core that is individually power gated.

Which thread scheduler are you speaking of?
The per CU scheduler which is within the CU, or a thread scheduler that is not displayed at the level of the command processor, ACE, and primitive pipes?

The likely minimum granularity of any link is probably by quad CU, since the CU would be in charge of doing the necessary instruction fetch to perform whatever work its compute client requests and finer granularity would thrash the Icache. Other than tracking occupancy and readiness information, a global thread scheduler has much less to do than it did in earlier designs.

More detail is needed to flesh out the links, since the only mentioned connection at all between the CUs and the specialized pipes at the chip level is the message bus.

To achieve that, you'll need some kind of sw managed scheduling.
Microcontroller-managed gating and thread movement is found within selling products.
 
Put another way Cypress is hugely inefficient per mm² by comparison. 31% bigger for ~15% more performance?
It depends on the workload. Cypress is far faster than Barts on compute workloads, for example. Also, Cypress has support for doubles and Barts does not.

One other thing to keep in mind is that Barts has higher clock speed than Cypress, so that increases the performance of the non-shader parts (triangle rate, fill rate, etc.).
 
It depends on the workload. Cypress is far faster than Barts on compute workloads, for example. Also, Cypress has support for doubles and Barts does not.

One other thing to keep in mind is that Barts has higher clock speed than Cypress, so that increases the performance of the non-shader parts (triangle rate, fill rate, etc.).

Even then, a ~30% vs ~15% delta seems like inefficient design.
 
Can't see what the big deal is here. SIMDs can be disabled (both in hardware and by software) for ages and I don't see that changing with the new-style CUs. Not quite sure how dynamic that switching currently can be but clearly those paths can't be that hardwired. So adding power gating should be quite easy from that point of view (now if it's that helpful is another matter).
I think a bigger problem might be that bulk power gate transistors for gating several watts aren't going to be easy in the first place. Only Intel does that and Intel, is well, Intel.
Or are you referring to really individual CUs? Then yes I don't see that neither, should always disable groups of 4.
I had actually not considered that. But obviously, you would have to disable CUs in multiple of 4.
 
I stated gating groups of 4. If Llano is any guide, 4 CUs should have at least the area of an x86 core that is individually power gated.

Which thread scheduler are you speaking of?
The per CU scheduler which is within the CU, or a thread scheduler that is not displayed at the level of the command processor, ACE, and primitive pipes?

The likely minimum granularity of any link is probably by quad CU, since the CU would be in charge of doing the necessary instruction fetch to perform whatever work its compute client requests and finer granularity would thrash the Icache. Other than tracking occupancy and readiness information, a global thread scheduler has much less to do than it did in earlier designs.

More detail is needed to flesh out the links, since the only mentioned connection at all between the CUs and the specialized pipes at the chip level is the message bus.
I guess the issue of links is surmountable. But there has been little incentive so far for pursuing this since, afaik, gating ~10W on TSMC bulk has not been possible so far.

Microcontroller-managed gating and thread movement is found within selling products.
Do existing GPUs have sw/hw/firmware/microcontroller managed graphics migration? AMD is targeting GPU graphics preemption in it's 4th gen of fusion, so they can't be doing it. I think preempting in graphics mode is needed for this kind of fine grained power consumption modulation.
 
I think a bigger problem might be that bulk power gate transistors for gating several watts aren't going to be easy in the first place. Only Intel does that and Intel, is well, Intel.
I had actually not considered that. But obviously, you would have to disable CUs in multiple of 4.

That's not entirely true. AMD is doing that sort of power gating for Bulldozer and other designs.

The problem is that unless you have a huge upper metal layer (e.g. Intel's M9), you don't have enough interconnect to handle the power. AMD is using the package for power distribution IIRC, which probably has some advantages and disadvantages compared to Intel's approach.
 
Do existing GPUs have sw/hw/firmware/microcontroller managed graphics migration? AMD is targeting GPU graphics preemption in it's 4th gen of fusion, so they can't be doing it. I think preempting in graphics mode is needed for this kind of fine grained power consumption modulation.

I'm not aware of migration capability in current GPUs. I meant to say that in a general sense it was possible since there are CPUs that have done this for the latest generations.

Even without migration, a more conservative scheme could run based on the work distributor. In low-load situations, the distributor could see that the full set of CUs have kernels whose wavefronts finish quickly and are not taking up full occupancy.
The distributor could then favor assigning new kernels to a subset of the compute engine.
Once the ignored CUs complete their wavefronts, they could go to sleep.
I'm not saying this would be done, but that it could be.
 
That's not entirely true. AMD is doing that sort of power gating for Bulldozer and other designs.

The problem is that unless you have a huge upper metal layer (e.g. Intel's M9), you don't have enough interconnect to handle the power. AMD is using the package for power distribution IIRC, which probably has some advantages and disadvantages compared to Intel's approach.
BD is SOI, and wasn't soi supposed to be better for this any way? IIRC, AMD touted p channel power gate transistors on SOI as one of SOI's advantages.

And even then, this is another example of Intel's process prowess showing through.
 
Even without migration, a more conservative scheme could run based on the work distributor. In low-load situations, the distributor could see that the full set of CUs have kernels whose wavefronts finish quickly and are not taking up full occupancy.
The distributor could then favor assigning new kernels to a subset of the compute engine.
Once the ignored CUs complete their wavefronts, they could go to sleep.
I'm not saying this would be done, but that it could be.

This might be feasible. Even clock gating idling CU quads could be a win.
 
Back
Top