AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

Good points all around.

I wonder if AMD still targets doubling of compute but will scale design with clock to achieve that? What improvement in clock speed we can expect from 28nm TSMC HKMG process? Most importantly how scaling of clocks on this process affects power consumption? (where's the best balance point for it)

EDIT: Remember, new arch can go wrong initially because it can be too much ahead of it's time and process, aka R600. Just to be clear I don't expect R600 repeat this time, more R300 type of launch.
 
It's not as simple. ALUs (unified core) take less than 1/2 of the die (I'd say it can be closer to 1/3 than 1/2) and I doubt they'll double number of TMUs, ROPs and memory controller as well.
AMD might not double TMUs, ROPs, MCs (though in case of TMUs I wouldn't be surprised if there will still be 4 per CU), but there's certainly other things they can do which will need die area.
Actually, the GCN presentation was pretty clear about that there will be 4 TMUs per CU (even as it wasn't named as such).
 
I wonder if AMD still targets doubling of compute but will scale design with clock to achieve that? What improvement in clock speed we can expect from 28nm TSMC HKMG process? Most importantly how scaling of clocks on this process affects power consumption? (where's the best balance point for it)
As it looks like AMD opted to keep the scheduling overhead to a bare minimum, I wouldn't be surprised if they actually reduced the latency for arithmetic instructions compared to the VLIW offerings (as suggested in the GCN presentation). From that one cannot expect a very significant rise of the clockspeeds, even if they should use the HP process.

If Charlie is right and they use HPL, I would expect about the same or only slightly better transistor performance as 40G for half the power consumption per transistor (but dramatically reduced leakage). So roughly the same or slightly lower clockspeed, doubled transistor count and roughly the same (or slightly lower?) power consumption for the complete GPU.
On HP, I would expect a bit less units than with a potential HPL offering but at an increased clockspeed and still roughly the same power consumption. Those 200 something Watts are basically a hard limit. But I expect an aggressive Powertune feature to push that limit.
 
Good points all around.

I wonder if AMD still targets doubling of compute but will scale design with clock to achieve that? What improvement in clock speed we can expect from 28nm TSMC HKMG process? Most importantly how scaling of clocks on this process affects power consumption? (where's the best balance point for it)

EDIT: Remember, new arch can go wrong initially because it can be too much ahead of it's time and process, aka R600. Just to be clear I don't expect R600 repeat this time, more R300 type of launch.
I am expecting ~250mm2 die, with 32 CU's ~@800MHz.
 
I am expecting ~250mm2 die, with 32 CU's ~@800MHz.
That's pretty optimistic, don't you think?
Cayman is almost 400mm² in 40nm with 24 SIMDs. So even a straight Cayman shrink with 32 SIMDs would probably be close to that. Factor in the (moderately) increased transistor count per ALU with the more flexible scheduling, the doubled shared memory, the doubled L1 cache and the whole cache hierarchy supporting some coherency protocol, ECC protected caches and registers as well as half rate DP (for Tahiti) combined with a beefed up parallel setup and compute features it appears a bit hard for me to believe that. Especially if you take into account, that TSMCs 28nm HKMG processes don't reach twice the transistor density compared to 40G (something between factor 1.8 and 1.95 I think), because of more restricted layout rules.
 
I am expecting ~250mm2 die, with 32 CU's ~@800MHz.
Ahh someone who is on the same boat :).
I have to agree with Gipsel though on the die size - I don't expect a ~400mm^2 chip but more than ~250mm^2 with the new arch. Maybe around or slightly below Cypress die size.
 
I have to agree with Gipsel though on the die size - I don't expect a ~400mm^2 chip but more than ~250mm^2 with the new arch. Maybe around or slightly below Cypress die size.
If they shoot for a 384bit memory interface (and 40 CUs) I would expect 400mm² for sure.
 
Maybe it's stupid but They could do 384bit memory interface with Redwood's MC's @halfsize like Barts? 384bit @4.2Ghz should have enough bandwith, no?
 
Maybe it's stupid but They could do 384bit memory interface with Redwood's MC's @halfsize like Barts? 384bit @4.2Ghz should have enough bandwith, no?

Not likely given that it would be more expensive, especially since they would have to put 3GB of RAM, so as not to move backwards from Cayman.
 
Maybe it's stupid but They could do 384bit memory interface with Redwood's MC's @halfsize like Barts? 384bit @4.2Ghz should have enough bandwith, no?
That's only halfsize for the PHY which probably isn't all that much really. Plus I'm quite sure even assuming better memory efficiency AND a 384bit memory interface it could still benefit substantially from faster memory, so if going 384bit it doesn't seem like it would make sense to limit memory frequency "artificially". No matter the bus width I'm quite confident we're going to see memory frequency higher than ever seen before (or if it's really 384bit, at least as high as on Cayman).
 
I hope there will be interesting tidbits about the power management of the upcoming design.

For example, the quad-CU groups with their shared instruction cache and scalar L1 do amortize the cost of a big jump in Icache capacity. A group of four CUs has double the instruction cache of Cayman, so 32 CUs would give 16x the capacity for GCN.

Could these groups be put to deeper sleep states or power gated (assuming TSMC's process has reached the point that it can offer full-core gating, the literature I've seen is finer-grained).
They are described as being more autonomous, so less of the chip needs to interact with them should they shut down.
 
I hope there will be interesting tidbits about the power management of the upcoming design.

For example, the quad-CU groups with their shared instruction cache and scalar L1 do amortize the cost of a big jump in Icache capacity. A group of four CUs has double the instruction cache of Cayman, so 32 CUs would give 16x the capacity for GCN.

Could these groups be put to deeper sleep states or power gated (assuming TSMC's process has reached the point that it can offer full-core gating, the literature I've seen is finer-grained).
They are described as being more autonomous, so less of the chip needs to interact with them should they shut down.

Individual power gating of CUs seems hard. The thread scheduler is probably hard wired for the number of CUs. The data paths between ff hw <-> ALU's is probably hardwired. Gating a few CUs will probably lead to load imbalance.

To achieve that, you'll need some kind of sw managed scheduling.
 
With regards to die size, i dont think we're going to go back to RV770 levels of die size. I expect something close to Cypress. With APU's set to capture the entire low end and lower mid-range market, the performance gap between APU's and mid-range discrete will have to go up. Otherwise there would be no reason to purchase mid range discrete GPU's. I think they'll have to have a three chip lineup going forward, similar to the current gen where they have Cayman, Barts and Turks. A chip like Caicos will not be required in 2012 when we have Ivy Bridge and Trinity. So for the three chip lineup i expect something like a ~350 mm2 GPU, a ~250 mm2 GPU and ~150 mm2 GPU
 
That's pretty optimistic, don't you think?
Cayman is almost 400mm² in 40nm with 24 SIMDs. So even a straight Cayman shrink with 32 SIMDs would probably be close to that. Factor in the (moderately) increased transistor count per ALU with the more flexible scheduling, the doubled shared memory, the doubled L1 cache and the whole cache hierarchy supporting some coherency protocol, ECC protected caches and registers as well as half rate DP (for Tahiti) combined with a beefed up parallel setup and compute features it appears a bit hard for me to believe that. Especially if you take into account, that TSMCs 28nm HKMG processes don't reach twice the transistor density compared to 40G (something between factor 1.8 and 1.95 I think), because of more restricted layout rules.

May be closer to 280 mm2 then. 32 CU's is kinda minimum. But considering the amount of efficiencies they were able to wring from rv670->rv770, I think I'll still keep my fingers crossed. ;)
 
With regards to die size, i dont think we're going to go back to RV770 levels of die dize. With APU's set to capture the entire low end market, the performance gap between APU's and discrete will have to go up. Otherwise there would be no reason to purchase mid range discrete GPU's. I think they'll have to have a three chip lineup going forward, similar to the current gen where they have Cayman, Barts and Turks. A chip like Caicos will not be required in 2012 when we have Ivy Bridge and Trinity. So i expect something like a ~350 mm2 GPU, a ~250 mm2 GPU and ~150 mm2 GPU

APU's are light years away from rv770 level perf. Let alone a modern ~250mm2 discrete GPU.
 
Individual power gating of CUs seems hard. The thread scheduler is probably hard wired for the number of CUs. The data paths between ff hw <-> ALU's is probably hardwired. Gating a few CUs will probably lead to load imbalance.

To achieve that, you'll need some kind of sw managed scheduling.
Can't see what the big deal is here. SIMDs can be disabled (both in hardware and by software) for ages and I don't see that changing with the new-style CUs. Not quite sure how dynamic that switching currently can be but clearly those paths can't be that hardwired. So adding power gating should be quite easy from that point of view (now if it's that helpful is another matter).
Or are you referring to really individual CUs? Then yes I don't see that neither, should always disable groups of 4.
 
Sorry, that's a bit vague. It does not mention for example, clocks. Nor does it take in to account the observedly worse ROP performance in salvage products like HD 5830 and HD 6790 compared to what would expect.

The way I see it, 14/32 decision led to a more balanced (aka one with less performance cliffs) GPU. A 16/16 chip would be ROP bound much more often.
 
Back
Top