AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

hkultala · Aug 1, 2011

fehu said:
i keep asking myself what is the return of investment of enginering two different new architecture so close
ok, the vliw4 is providing a boost and probably is more suitable for fusion, but why spend in r&d and rewrite the driver if it must stay only a year?
why not stick with with the old architecture until gcn is ready?

There have been small changes along the way before; some new instructions to add DX11 support, moving from decoupled TMU's to TMU's inside the shader SIMT processors, etc.

All these have required driver support.

The VLIW4 instead of VLIW5 requires some new code in the compiler of the driver, but not so much as one would think, if the compiler is done smartly. (my work consist of compiler development for one parametrizable/very customizable achitecture in an university project, and we can create code for unlimited number of targets without _any_ changes on our compiler's codebase)

And, the R&D for the VLIW4 is already sone, for 6900. Now it doesn't "cost anything extra" to use it in lower-end chips too.

hkultala · Aug 1, 2011

Alexko said:
To an extent, GCN builds upon VLIW4. There are 64 shaders per SIMD/CU, the ALUs themselves are probably very similar, and I bet a lot of improvements made to Northern Islands (e.g. geometry) were built upon for GCN.

From driver/compiler point of view, GCN and VLIW4 are VERY different, but VLIW4 and VLIW5 are quite close.

Jawed · Aug 1, 2011

I can't see any reason for VLIW4 if GCN is "~1 year away", either, so I'm with fehu. Total waste of time.

CarstenS · Aug 1, 2011

Pipe cleaner for OpenCL applications so that Fusion/Trinity as well as GCN get an optimal start? It also gives the driver team some time to adjust when it's actually out in the wilds.

Blazkowicz · Aug 1, 2011

power efficient chips for mobile use and desktop?
imagine if AMD launches GCN right away as low end and midrange chips. both an immature process and a new, area hungry architecture that brings pretty much nothing in terms of gaming.

now imagine the reaction to that, including on that board. it would be "the generation that sucks", "worse than what it replaces", "a failure", "are we back in HD 2400/2600 days?"

Jawed · Aug 1, 2011

CarstenS said:
Pipe cleaner for OpenCL applications so that Fusion/Trinity as well as GCN get an optimal start?

What pipe?

How does Cayman "help" GCN?

It also gives the driver team some time to adjust when it's actually out in the wilds.

Adjust to what?

I think what you're saying is that Trinity was destined to be VLIW-4, so Cayman eases that transition. But what's the point of making Trinity VLIW-4 instead of -5?

no-X · Aug 1, 2011

VLIW-4 was designed for 32nm node, which was promised for Q4/2009 (but cancelled at that time). I believe the initial plan was to launch complete VLIW-4 32nm family in late Q1/2010. It wouldn't make sense to base Trinity on VLIW-5, which would be EOLed at that time of Trinity's launch.

CarstenS · Aug 1, 2011

Oh, finally an easy question, thanks!

APUs are about accelerated processing, they're not about making discrete 3D accelerators obsolete for gaming. So, it really is the OpenCL part that matters: robustness, predictable performance and whatever else AMD though was worth being improved in Cayman upon Cypress. Don't know - maybe even the RBEs are making more effective use of bandwidth. Something which would be a good thing for bw starved APUs.

Except for some games, where it doesn't matter anyway, asymmetric VLIW5 does not seem to compare favourably to the symmetric VLIW4 approach in Cayman. And the vector cores in GCN have identical functionality also - so having that kind of animal out in the field probably helps the driver team to identify some tricks before either GCN or Trinity launches.

Edit: The OpenCL driver also has a preferred vector width of 4.

Jawed · Aug 1, 2011

CarstenS said:
Oh, finally an easy question, thanks!
APUs are about accelerated processing, they're not about making discrete 3D accelerators obsolete for gaming. So, it really is the OpenCL part that matters: robustness, predictable performance and whatever else AMD though was worth being improved in Cayman upon Cypress. Don't know - maybe even the RBEs are making more effective use of bandwidth. Something which would be a good thing for bw starved APUs.

Cayman is marginal as an "improvement" in these regards though, a tweak. OpenCL isn't really better served by VLIW-4 in the way that AMD is advertising GCN will achieve. That's a complete change.

Except for some games, where it doesn't matter anyway, asymmetric VLIW5 does not seem to compare favourably to the symmetric VLIW4 approach in Cayman. And the vector cores in GCN have identical functionality also - so having that kind of animal out in the field probably helps the driver team to identify some tricks before either GCN or Trinity launches.

Are you expecting each of the SIMD-16s to operate with sets of 4 lanes collaborating on a transcendental result, i.e. 4x transcendentals per clock (though in practice 3 lanes out of 4, like Cayman?)? And 4 lanes collaborating on DP?

---

For DP-MAD that's 6 VGPRs that need to be read in 4 cycles (assuming 4-cycle dependent-instruction issue), since each operand takes 2 VGPRs. I don't think this is possible. On top of that, for both transcendentals and DP, the 1:1 mapping of 128-bit portions of the VGPRs to ALU lanes would fall apart, making operand collection considerably more costly.

One argument you can make is that for DP it's really 6 VGPRs in 16 cycles (or 8 cycles if it's half-rate). Assuming there's a bypass network and that there's >4 cycles of GPR fetch time before ALU execution (e.g. 8 cycles) then this timing flexibility could save the day. Not sure...

Slide 18 of the GCN presentation says that 64-bit registers are formed from adjacent VGPRs. This might be nothing more than a technique to minimise fragmentation problems in register allocation, or it might refer to an RF burst operation that solves the "6 VGPRs in 4 cycles problem" (seems unlikely though).

---

The 4-lane DP (and other operations, such as dual-dependent MUL with ADD) are founded on the dot-product functionality of the VLIW configuration. Dot-product requires cross-lane communication in the VLIW configuration, and it's that communication that leads to those other instructions.

In GCN I expect dot-product is just a macro for MAD, MAD, MAD and ADD, like in NVidia. I expect transcendentals and DP are macros too, taking multiple cycles, with no lane-ganging.

---

I can't think of any other clues about the actual operation, but overall I'm sceptical about ganged operation.

Edit: The OpenCL driver also has a preferred vector width of 4.

Same for Cypress and Cayman. Will have to wait to see what it is for GCN. But GCN is going to be register-limited in comparison with VLIW so it seems that it'll probably be 1.

rpg.314 · Aug 2, 2011

For DP-MAD that's 6 VGPRs that need to be read in 4 cycles

Only for full rate. GCN will go upto half rate.

Jawed · Aug 2, 2011

rpg.314 said:
Only for full rate.

No, this is for quarter rate, just like Cayman, where all four lanes cooperate on a single result, requiring 6 VGPRs' data.

Bearing in mind that the 6 VGPRs contain all of the data required by all four results that will eventually be produced by the 4-ganged lanes. Which is where the 16 or 8 cycles (quarter-rate or half-rate) for effective throughput comes from.

6 GPRs are required by Cayman to produce a single DP-MAD per cycle (4 clocks). This is, of course, trivial since the RF bandwidth is 12 GPRs per cycle (4 clocks).

It's my contention that GCN's RF is three VGPRs per cycle (4 clocks) with a fourth VGPR being allocated to things like TEX, LDS or export. Also, that operand collection is much simpler in GCN, with no timing-dance (VEC_012 shenanigans) and no last-minute swizzling.

Of course I could be wrong (not unusual). Larrabee has lots of fancy swizzling operations across its 16-wide SIMD and they're quite attractive in the grander scheme of things, particularly as far as generalisation is concerned (thinking of FSAIL). Alternatively one might argue that that's because Larrabee is a small-RF SIMD, not a big-RF SIMD like GCN. That kind of swizzling/gather (and scatter by implication) happens twixt RF and memory, not twixt RF and ALUs, in GCN.

This is reinforced by GCN's return to the R700 style of LDS operation, where LDS and RF communicate directly (just like TEX and RF do) without the use of ALU instructions that Evergreen has.

GCN will go upto half rate.

Is that stated anywhere?

rpg.314 · Aug 2, 2011

Is that stated anywhere?

One of the afds presentations said it would be configurable from 1/2 to 1/16 at sku time.

Jawed · Aug 2, 2011

rpg.314 said:
One of the afds presentations said it would be configurable from 1/2 to 1/16 at sku time.

Can't find it.

Also remember that ADD is likely half-rate, whichever of these two configurations turns out to be true.

CarstenS · Aug 2, 2011

Jawed said:
Is that stated anywhere?

Damien wrote it, he was at the AFDS.
http://www.hardware.fr/news/11648/afds-architecture-futurs-gpus-amd.html

edit:
And Mike Mantor says exactly what Damien wrote in the video recording of the session.
http://developer.amd.com/afds/pages/video.aspx#/Dev_AFDS_Reb_2620
Jump to 21:30 aprrox....

Gipsel · Aug 2, 2011

CarstenS said:
And Mike Mantor says exactly what Damien wrote in the video recording of the session.
http://developer.amd.com/afds/pages/video.aspx#/Dev_AFDS_Reb_2620
Jump to 21:30 aprrox....[/i]

Anybody else having problems with the videos there? I see simply nothing, just an almost empty page.

Lightman · Aug 2, 2011

Gipsel said:
Anybody else having problems with the videos there? I see simply nothing, just an almost empty page.

I'm not sure if that helps you anything but it works for me with flash 10.newest available and Firefox 5.0 on Win 7 x64 Cat11.8A :idea:

Jawed · Aug 2, 2011

It works for me on Win XP, IE 8, Flash weeks out of date and AGP X1950Pro Cat 9.3

Now to watch...

Gipsel · Aug 2, 2011

Appears to be a problem with the AFDS website and IE9 (tried both Vista32 and Win7 x64). Other videos on AMD's developer pages run though, only the AFDS ones don't. IE8 on WinXP is working here too.

Jawed · Aug 3, 2011

So, every GPU in the family will support DP (at last) and the rate will be either 1/2, 1/4 or 1/16, corresponding with enthusiast, performance or mainstream chips I suppose.

I think that excludes 4-lane ganged operation as seen in Cayman.

Half rate corresponds with what I was saying earlier about 6 VGPRs being required, something that GCN supports in 2 cycles (8 clocks). That seems to imply a substantial amount of over-sizing in the ALUs, if they are nominally single precision ALUs that can handle double with looping. If that's really doing half-rate MAD (not merely half-rate ADD) then it's bloody expensive.

If it's really that expensive then that could imply roughly 2 TFLOPS DP (at least, otherwise why bother?) and 4 TFLOPS SP (hardly worth the 2-year wait eh?), the latter indicating say 2560 ALUs, or 40 CUs. Round it down if you're feeling conservative, 32 CUs, for ~1.75 DP-TLOPS or ~3.5 SP-TFLOPS.

It's possible to argue that the three rates correspond with differing gangings. Here the trade-off would be excess bits added to each lane in the gang versus throughput. This still has the register file "swizzle" overhead (a la Larrabee), which I think is a strong demerit and I can't see how it would save ALU capacity in comparison with looping.

I can't see anything that indicates which way AMD is going. Though it should be clear that Cayman's ALU configuration as a step towards GCN really isn't relevant.

Also notable is that all chips in a family will have the same size of CU, they will always have four SIMDs with 16 lanes.

So when this architecture moves into an APU it will be programmed and scale in exactly the same way as the discrete chip - though of course there won't be a PCI Express bus in the way for CPU<->GPU.

mczak · Aug 3, 2011

Jawed said:
It's possible to argue that the three rates correspond with differing gangings. Here the trade-off would be excess bits added to each lane in the gang versus throughput.

I don't know but that looks to me like it's a bit too much of an architectural change than what you'd want in your product family - then again nvidia uses two very different SMs in its family too... So single-lane for DP sounds more likely to me.
Also what does going with a 1/16 DP rate instead of 1/4 DP rate actually save, I thought 1/4 should be quite cheap?
And what will the special function rate be? I guess not configurable so 1/4 (which would still be quite a lot, nvidia does 1/8 (GF100/GF110) or 1/6 (everything else Fermi) rate)?

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

hkultala

hkultala

Jawed

CarstenS

Moderator

Blazkowicz

Jawed

no-X

CarstenS

Moderator

Jawed

rpg.314

Jawed

rpg.314

Jawed

CarstenS

Moderator

Gipsel

Lightman

Jawed

Gipsel

Jawed

mczak

Similar threads