AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

CarstenS · Aug 3, 2011

Begun the attack of the quotes has...

No, not really. I'll focus on the older point quoted, but try to regard more of your objections in the course of the post.

Jawed said:
Cayman is marginal as an "improvement" in these regards though, a tweak.

I disagree - especially in terms of RBEs, but first things first.
I don't think, AMD lied to us, when they said, a 4-way symmetrical design makes life easier for the compiler, for their toolchain and is more efficient for generalized algorithms. Now, if you want to APU-ify every single market segment, you don't want to waste your expensive CPU cycles with all their OoO-overhead (area- and thus cost wise) just to feed another processor inside the processor.

Having the experience that they have with VLIW5 and the evolution of workloads both for graphics and especially more generalized loads, and having actually listened (not only read PDFs) to some of the AFDS presentations, it seems pretty clear to me that they want to get rid of VLIW5 better sooner than later. And that not only, because they don't have to cater for one less architecture variant. In Eric Demers' keynote (~22:20) he explicitly mentions that they VLIW4 arch is better optimized for compute workloads and remember the Cayman launch - there they said, they'd be getting 10% better performance per area for this kind of VLIW and have a simplified register management (which won't play a big role once they move to GCN, I'll admit). Again - AMD is all about cost conscious products these days - and obviously they need to - so it just makes sense to have the best possible solution integrated into their products.

With regard to RBEs, I don't think AMD has been introducing EQAA modes just because Nvidia has them (as CSAA), especially not with independently selectable number of coverage samples. And I don't think, the coalesced write ops are there just because it's funky to show in a presentation. AMD knows that they are severely bandwidth limited in their fusion APUs and that they will need ways, methods and tricks to save up as much of this bandwidth as they can. Because even if they add area here, every byte of bandwidth wasted puts more of the expensive SIMD arrays (and in future CUs) to idle.

Another - maybe quite far fetched - hypothesis of mine is, that the actual execution units in Cayman being symmetrical as they are, might not be so very different to what we're going to see integrated into GCN's CUs' vector units. What I'm talking about are the individual ALU lanes, of course - stripped of the branch unit (which will then be handled by the scalar unit) and grouped into a 4x4 configuration.

How exactly DP is handled and thus probably gets handled in GCN I have no idea. I only hear on the forums about ganged operation of ALUs, but with all I've read neither AMD nor Nvidia provide specifics about this - they only talk about "operations" (instruction slots?). AFAIK, you need 23 bits of mantissa for single precision and 52 bits for doubles, right? Wouldn't it suffice then to "just" extend your SP-SP's mantissa to 26 bits (of course also the required data paths) in order to combine just two ALU lanes? Note that Demers only said about current VLIW4 "They're all effectively the same" (~22:50, my bold) and how it was easier to optimize (the hell out of) just one unit. I don't know if that's really a clue whether or not the four (or soon 16) Lanes are identical or if two out of four are just a little wider for half-rate DP.

Gipsel · Aug 3, 2011

CarstenS said:
Another - maybe quite far fetched - hypothesis of mine is, that the actual execution units in Cayman being symmetrical as they are, might not be so very different to what we're going to see integrated into GCN's CUs' vector units. What I'm talking about are the individual ALU lanes, of course - stripped of the branch unit (which will then be handled by the scalar unit) and grouped into a 4x4 configuration.

While possible, I'm with Jawed, that they will switch to looping instead of the current ganging to realize the savings on operand collection. The ganging is only possible because all VLIW slots have access to the whole register file anyway. If you basically hardwire the connections to the reg file in GCN (what makes it simpler, faster, lower power and completely sufficient for SP as well as DP with looping), you can't gang anymore.

CarstenS said:
How exactly DP is handled and thus probably gets handled in GCN I have no idea. I only hear on the forums about ganged operation of ALUs, but with all I've read neither AMD nor Nvidia provide specifics about this - they only talk about "operations" (instruction slots?).

As the VLIW architecture gets exposed in the ISA code, you can see the ganging of slots directly in the compiler output.

CarstenS said:
AFAIK, you need 23 bits of mantissa for single precision and 52 bits for doubles, right? Wouldn't it suffice then to "just" extend your SP-SP's mantissa to 26 bits (of course also the required data paths) in order to combine just two ALU lanes?

For additions basically yes (it should be 27 bits though

), if your SP adders have half the required latency (you need a carry from one half to the other). That's the reason Radeon GPUs always offered half rate addition. For multiplication (or also FMA) you actually need four 27 bit multipliers (and a few adders afterwards) and not just two to get the full result. That's why quarter rate DP is relatively cheap, the effort for multiplication rises quadratically with the width of the numbers. Only if you are limited by the datapaths anyway and you don't care too much about the extra logic, half rate gets cheap.

CarstenS said:
Note that Demers only said about current VLIW4 "They're all effectively the same" (~22:50, my bold) and how it was easier to optimize (the hell out of) just one unit. I don't know if that's really a clue whether or not the four (or soon 16) Lanes are identical or if two out of four are just a little wider for half-rate DP.

The xyzw-ALUs in a VLIW4 setup aren't completely identical neither. He probably should have exchanged the "effectively" for "almost". And he probably had a slightly higher level view on it.

Gipsel · Aug 3, 2011

mczak said:
Also what does going with a 1/16 DP rate instead of 1/4 DP rate actually save, I thought 1/4 should be quite cheap?
And what will the special function rate be? I guess not configurable so 1/4 (which would still be quite a lot, nvidia does 1/8 (GF100/GF110) or 1/6 (everything else Fermi) rate)?

1/16th DP should be doable with looping by basically any SP capable ALU without any additions. You can save the additional 4 bits on your multiplier and also adder (which does the step from 4 cycles to [strike]8[/strike]9) and maybe also the internal data paths to overlap the additions with the multiplications from different stages (which probably increases it again to ~16?, but I'm not sure if one wouldn't need some storage space for intermediate values in that case or if one could use the result forwarding for this).

And judging by Cayman, special functions may be 1/3 rate

CarstenS · Aug 3, 2011

Gipsel said:
While possible, I'm with Jawed, that they will switch to looping instead of the current ganging to realize the savings on operand collection. The ganging is only possible because all VLIW slots have access to the whole register file anyway. If you basically hardwire the connections to the reg file in GCN (what makes it simpler, faster, lower power and completely sufficient for SP as well as DP with looping), you can't gang anymore.

At this point, I wasn't even talking about DP...

Gipsel said:
As the VLIW architecture gets exposed in the ISA code, you can see the ganging of slots directly in the compiler output.

Please, go ahead.

Are those instruction slots then carried out in different ALUs or are they scheduled internally in a consecutive fashion?

Gipsel said:
For additions basically yes (it should be 27 bits though ), if your SP adders have half the required latency (you need a carry from one half to the other). That's the reason Radeon GPUs always offered half rate addition. For multiplication (or also FMA) you actually need four 27 bit multipliers (and a few adders afterwards) and not just two to get the full result. That's why quarter rate DP is relatively cheap, the effort for multiplication rises quadratically with the width of the numbers. Only if you are limited by the datapaths anyway and you don't care too much about the extra logic, half rate gets cheap.

That makes it really sound expensive doing half-rate DP. Wouldn't it be easier (and probably even more power efficient too, if you could prioritize the slots) to just go with one single fat unit for DP, ganging together only register file ports to the ALUs? You could probably even shut down the other ALU(s) in the meantime saving power.

Gipsel said:
The xyzw-ALUs in a VLIW4 setup aren't completely identical neither. He probably should have exchanged the "effectively" for "almost". And he probably had a slightly higher level view on it.

Probably. And then, probably not. Effectively said in the context of his presentation exactly what he intended to convey: Neither programmer nor compiler would have to worry about any differences which are there in normal operation. I suspect he probably used that in order not to weaken his earlier point of concentrating enginneering efforts on a single type of unit.

Jawed · Aug 3, 2011

mczak said:
I don't know but that looks to me like it's a bit too much of an architectural change than what you'd want in your product family - then again nvidia uses two very different SMs in its family too... So single-lane for DP sounds more likely to me.

RF timing and swizzling twixt RF and ALU lanes are problems for ganged, so I agree.

One of the things I learned while watching the video last night is that LDS broadcast is NOT from LDS to RF (though I'm sure it can be done if the programmer actually constructs it so with a fence). Instead the instruction decoder reads the value from LDS and then bundles that data as a literal into the decoded instruction sent to the SIMD. This is essentially a power (and latency) optimisation for attribute interpolation. (It would seem to imply that all fragments in a hardware thread are sharing a triangle which is terrible news, but that's yet another topic.)

LDS broadcast isn't exactly the most compelling motivation for cutting out RF->ALU swizzle (actually ALU->RF swizzle in this case - which is required by a ganged DP when writing its result back to RF), but it's potentially another factor. Essentially swizzling happens inside LDS and L1/TEX, but not in RF and ALU.

Also what does going with a 1/16 DP rate instead of 1/4 DP rate actually save, I thought 1/4 should be quite cheap?

Yeah, good point. What is the speed of a software-emulated DP MAD using SP hardware?

That raises a nasty prospect of the first APU having entirely useless DP throughput, being based upon the crappiest discrete chip. Though perhaps 1/16th would be reserved solely for the crappy discrete chips.

And what will the special function rate be? I guess not configurable so 1/4 (which would still be quite a lot, nvidia does 1/8 (GF100/GF110) or 1/6 (everything else Fermi) rate)?

In theory they'd always be the same rate to satisfy the "predictable performance" mantra. But now that AMD is keen to deploy erratic DP performance, why not also have erratic transcendentals?

Jawed · Aug 3, 2011

CarstenS said:
I disagree - especially in terms of RBEs, but first things first.

How is performance per mm² with Cayman, versus Cypress? Versus Barts (for gaming).

I don't think, AMD lied to us, when they said, a 4-way symmetrical design makes life easier for the compiler, for their toolchain and is more efficient for generalized algorithms.

Bearing in mind that I originally proposed this change way back when, for these reasons...

GCN is not a mere tweak, that's all I'm saying.

Now, if you want to APU-ify every single market segment, you don't want to waste your expensive CPU cycles with all their OoO-overhead (area- and thus cost wise) just to feed another processor inside the processor.

Which is why Trinity has 4 CPU cores. Or the crappy chips go down to 1 CPU core.

Having the experience that they have with VLIW5 and the evolution of workloads both for graphics and especially more generalized loads, and having actually listened (not only read PDFs) to some of the AFDS presentations, it seems pretty clear to me that they want to get rid of VLIW5 better sooner than later.

Agreed. But time spent on VLIW-4 seems wasted, frankly. I can see why Llano (Husky) happened (remember it'll be dead in ~1 year) but I can't see why VLIW4 happened. Appeasing OpenCL developers with "vaguely better" while in fact making their code/tuning redundant with a radical architecture change 1 year later (or 6 months if you count the immaturity of the Cayman compiler) isn't a win.

And that not only, because they don't have to cater for one less architecture variant. In Eric Demers' keynote (~22:20) he explicitly mentions that they VLIW4 arch is better optimized for compute workloads and remember the Cayman launch - there they said, they'd be getting 10% better performance per area for this kind of VLIW and have a simplified register management (which won't play a big role once they move to GCN, I'll admit). Again - AMD is all about cost conscious products these days - and obviously they need to - so it just makes sense to have the best possible solution integrated into their products.

My original point is that Cayman's VLIW-4 does not inform GCN - the ALUs are just different. Obviously when we learn more we can review that, but right now it's looking like a waste of time. Developer-hours spent on Cayman's compiler are hours lost from GCN's. Which do you think is more important?

Where's the Cayman based Firestream card? I can't find it.

With regard to RBEs, I don't think AMD has been introducing EQAA modes just because Nvidia has them (as CSAA), especially not with independently selectable number of coverage samples. And I don't think, the coalesced write ops are there just because it's funky to show in a presentation. AMD knows that they are severely bandwidth limited in their fusion APUs and that they will need ways, methods and tricks to save up as much of this bandwidth as they can. Because even if they add area here, every byte of bandwidth wasted puts more of the expensive SIMD arrays (and in future CUs) to idle.

Agreed that's a very good point. But that doesn't justify VLIW-4. The dual-shader engine and sequencers of Evergreen/Cayman are a serious stumbling block for overall efficiency as they stopped scaling. The micro-sequencers and the fragmentation of functionality into independently scaling blocks is what gets GCN out of the scaling hell that Evergreen is in.

The cache architecture of GCN is quite unlike Cayman too. That all ties in ultimately with ROP functionality too, e.g. making a chip that can do real producer-consumer for compute, which Evergreen/Cayman can't do.

Another - maybe quite far fetched - hypothesis of mine is, that the actual execution units in Cayman being symmetrical as they are, might not be so very different to what we're going to see integrated into GCN's CUs' vector units. What I'm talking about are the individual ALU lanes, of course - stripped of the branch unit (which will then be handled by the scalar unit) and grouped into a 4x4 configuration.

Yes, that's what I meant by 4-ganged. As Gipsel's already pointed out, Cayman's ALU lanes aren't "symmetrical" though.

How exactly DP is handled and thus probably gets handled in GCN I have no idea. I only hear on the forums about ganged operation of ALUs, but with all I've read neither AMD nor Nvidia provide specifics about this - they only talk about "operations" (instruction slots?). AFAIK, you need 23 bits of mantissa for single precision and 52 bits for doubles, right? Wouldn't it suffice then to "just" extend your SP-SP's mantissa to 26 bits (of course also the required data paths) in order to combine just two ALU lanes? Note that Demers only said about current VLIW4 "They're all effectively the same" (~22:50, my bold) and how it was easier to optimize (the hell out of) just one unit. I don't know if that's really a clue whether or not the four (or soon 16) Lanes are identical or if two out of four are just a little wider for half-rate DP.

Asymmetric ganged lanes is definitely another option. It's very hard to weigh up the options.

This is what I proposed for 4-lane DP:

http://forum.beyond3d.com/showpost.php?p=1142400&postcount=123

Code:

Blo         27
Alo         27
           ---
           w55
Bhi       53
Alo       27
         ---
         z81
         ---
Z+W 
         =====
         z82    partial sum 1
         =====
 
Blo       27
Ahi       53
         ---
         y81
Bhi     53
Ahi     53
       ---
       107
       ---
X+Y
       =====
       108    partial sum 2
       =====
 
p1       z82
p2    +108
       =======
       109
       =======

I've not seen anything better or revisited in light of the actual hardware (RV770, Cypress, Cayman). Improvements welcomed.

Ideally this would be worked out for looped half-rate, too. Implicit in looped is that the adder and bypass capabilities of each ALU are up-sized in order to deal with the vastness of double-precision intermediate workings.

Dave Baumann · Aug 3, 2011

CarstenS said:
Another - maybe quite far fetched - hypothesis of mine is, that the actual execution units in Cayman being symmetrical as they are, might not be so very different to what we're going to see integrated into GCN's CUs' vector units. What I'm talking about are the individual ALU lanes, of course - stripped of the branch unit (which will then be handled by the scalar unit) and grouped into a 4x4 configuration.

http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

This is basically as Anand presents. I wouldn't dismiss it.

CarstenS · Aug 3, 2011

Thanks, but since both you and Jawed poined into a direction I wasn't intending, maybe I should try and reword it a bit: I wasn't talking about SIMDs or CUs or SIMDs within CUs as whole, but rather on the individual lanes, i.e. that of which Cayman has 1536 or 64 per SIMD and that of which GCN will have 64 per CU. The execution cores, if you will.

The reason behind my theory is, that the math done by both should be largely the same from an outside view. It's just (and that's absolute no small just!) that the control logic has been completely redesigned - basically the rest of each CU outside the intra-CU Vector unit.

So basically AMD would be able to work on its second generation of pseudo-symmetrical ALUs in the first GCN generation already, having gathered experience and being able to better optimize it for power, area or whatever they might choose.

--
And to avoid the dreaded quote war with Jawed yet another time:
In a nutshell all I`m saying is, that Cayman is for both Fusion and GCN, what RV740 was for 40nm. It carries a lot of experimental, new units like the RBEs, which would also be very welcome - no matter if they perform 1, 10, 20 or 50% better per area than in Evergreen/NI - in any low-bandwidth architecture as Fusion is compared to the upper half of the discrete stack.

The 4 pseudo-symmetrical ALU lanes are possibly a starting point for optimizations in GCN. Additionally, I think AMD can gather much more data from a physically working chip than just from simulation. Just look at the density improvements from RV670 to RV770.

The real world benefits they could reap of a non-monolithic geometry pipeline - possibly helping them in determining what the ideal configuration for GCN would be.

AFAIK (and please correct me if I'm wrong), Cayman-silicon was at AMDs labs by early summer 2010 at the very latest, giving them 6 months+ to further optimize and possibly rebalance their first GCN.

Gipsel · Aug 3, 2011

Dave Baumann said:
http://www.anandtech.com/show/4455/amds-graphics-core-next-preview-amd-architects-for-compute/4

This is basically as Anand presents. I wouldn't dismiss it.

Actually, the two illustrations there comparing Caymans VLIW4 and GCN don't make sense at all

The VLIW architectures don't mix operations from different wavefronts in a single VLIW instruction, they take operations from a single wavefront. And if it got mislabeled for some reasons and should be in fact instructions for the same wavefront, GCN can't schedule them in different SIMDs (they have to run on the same). So either way, those graphs are simply wrong and only show that Ryan Smith missed a few key points of the architectures he is comparing there (edit: or he left it completely unclear what he has drawn while mixing and changing the meaning of things in the graphs [different instructions/wavefronts/clocks?], which simply confuses the people).

Edit:
I've seen so far only a single illustration describing it correctly (presumably, depends on the latency of the instructions, he assumed 4 clocks as I was also suggesting, what would really simplify things). It was from a Japanese article by Hiroshige Goto:

Gipsel · Aug 3, 2011

CarstenS said:
At this point, I wasn't even talking about DP...

Oops. Was reading your post and only thought about answering some of your points but obviously didn't pay too much attention with my quoting.

CarstenS said:
Please, go ahead. Are those instruction slots then carried out in different ALUs or are they scheduled internally in a consecutive fashion?

The scheduling of operations to different slots is done by the compiler, the hardware does nothing to that (that's one reason why AMD can save on the hardware). Even the result forwarding in form of the PV and PS registers is exposed to software. So if you can see in the ISA output, that's exactly the way it gets executed. There wouldn't be any possibility to serialize it in some way, just because the instructions within an ALU clause (bundled VLIW instructions) just "get stuffed" into the SIMDs every 8 cycles (alternating with instructions for another wavefront) without any possibility to lengthen that cycle or interrupt it before the clause ends. That's why it is so simple compared to Fermi for instance.
So on RV670/RV770/Cypress/Cayman DP is really done by ganging two slots for additions and conversions for instance and by ganging four slots for everything involving a multiplication.

CarstenS said:
That makes it really sound expensive doing half-rate DP.

In terms of multiplier hardware it is. The question is how relevant that is. There was some Nvidia paper discussing this register file cache (don't remember the official name for it right now) stating that reading the operand from register file for a madd costs already more energy than the actual calculation on a GT200. Reading a DP value is less than twice as costly as a SP value (you need to address it only once), so if datapaths and register files are becoming a limit (they also take quite some area), it may be significantly less than 4 times the effort to do a DP operation compared to a SP operation. In that case halfrate becomes feasible if you can easily fullfill the more stringent requirements for the latency of your DP operations.

CarstenS said:
Wouldn't it be easier (and probably even more power efficient too, if you could prioritize the slots) to just go with one single fat unit for DP, ganging together only register file ports to the ALUs? You could probably even shut down the other ALU(s) in the meantime saving power.

If you have the combined multiplier capacities and the wiring between the ALUs already in place to do DP operations in a ganged fashion, why should you duplicate that? That's not going to be efficient.

Jawed · Aug 3, 2011

CarstenS said:
The 4 pseudo-symmetrical ALU lanes are possibly a starting point for optimizations in GCN.

It's extremely unlikely GCN has a dot-product-four instruction that isn't simply MUL, MAD, MAD, MAD (I wrote this horribly wrong earlier, sigh) as four successive operations, all within a single lane. Now compare this with Cayman which does this in 1 cycle splatted across all four lanes which are ganged together to produce a single result.

So this is a simple example of a hardware instruction that "disappears" in GCN. Cayman's implementation of DP4 is irrelevant.

GCN might implement DP4 with a virtual register for the running accumulation, since it doesn't need to be written to the register file until the entire sequence is complete. It can sit in the ALU's bypass buffer.

Additionally, I think AMD can gather much more data from a physically working chip than just from simulation. Just look at the density improvements from RV670 to RV770.

I don't understand your point here. AMD had RV770's die size "wrong" by about 10% and painted in two clusters (ALUs, TEX, LDS) where otherwise there would have been empty die.

Dave Baumann · Aug 3, 2011

CarstenS said:
Thanks, but since both you and Jawed poined into a direction I wasn't intending, maybe I should try and reword it a bit: I wasn't talking about SIMDs or CUs or SIMDs within CUs as whole, but rather on the individual lanes, i.e. that of which Cayman has 1536 or 64 per SIMD and that of which GCN will have 64 per CU. The execution cores, if you will.

Maybe a little confused by the term lane; in the VLIW/SIMD type architecture I would equate a "lane" to the VLIW width, so Cayman have 4 VLIW lanes per SIMD and the rest of EV/NI has 5 lanes per SIMD. The number of units per SIMD (i.e VLIW width x16 in most standard SIMD cases for us) is just a function of processing a wavefront size which all consists of the same instruction.

GCN has 4 "SIMD'S" per CU with a VLIW instruction these each component of the VLIW is exectuted across the SIMD's, conceptually similar to it working across the VLIW lanes of the VLIW architecture. Of course, the benefit of GCN is that for non-VLIW instructions, or for VLIW <4 these SIMD "lanes" can be packed with other instuctions.

Dave Baumann · Aug 3, 2011

Jawed said:
I don't understand your point here. AMD had RV770's die size "wrong" by about 10% and painted in two clusters (ALUs, TEX, LDS) where otherwise there would have been empty die.

Thats not at all what he was saying - in fact your comment was a function of what he was saying. The point he's making is that the die size increase of RV770 over RV670 was significantly smaller than the scaling of the architecture (even at 8 SIMDs!).

Gipsel · Aug 3, 2011

Dave Baumann said:
Maybe a little confused by the term lane; in the VLIW/SIMD type architecture I would equate a "lane" to the VLIW width, so Cayman have 4 VLIW lanes per SIMD and the rest of EV/NI has 5 lanes per SIMD.

To avoid that potential confusion we like to call it VLIW slots. I'm quite sure the official documentation from you (AMD) does it that way, too.

So all VLIW architecture Radeons (at least the high end ones) have 16 SIMD lanes.

Dave Baumann said:
GCN has 4 "SIMD'S" per CU with a VLIW instruction these each component of the VLIW is exectuted across the SIMD's, conceptually similar to it working across the VLIW lanes of the VLIW architecture. Of course, the benefit of GCN is that for non-VLIW instructions, or for VLIW <4 these SIMD "lanes" can be packed with other instuctions.

I don't know if I get what you want to say. As pointed out before, a wavefront is tied to a certain SIMD in GCN, the next instruction for it can't switch to another one. The GCN SIMDs are filled with instructions from other wavefronts tied to those other SIMDs.

CarstenS · Aug 3, 2011

Sorry, just a quickie, as I need to be somewhere else right now.
WRT to „density” improvement I was referring to this - even though the SIMD themselves won't be 40% smaller, I guess.

Uploaded with ImageShack.us

mczak · Aug 3, 2011

Gipsel said:
1/16th DP should be doable with looping by basically any SP capable ALU without any additions. You can save the additional 4 bits on your multiplier and also adder (which does the step from 4 cycles to [strike]8[/strike]9) and maybe also the internal data paths to overlap the additions with the multiplications from different stages (which probably increases it again to ~16?, but I'm not sure if one wouldn't need some storage space for intermediate values in that case or if one could use the result forwarding for this).

Dunno still sounds cheap to me. Then again I thought so too for rv7xx nad nothing but the highest-end chips ever saw DP...

And judging by Cayman, special functions may be 1/3 rate

Yes, though it looks a bit like overkill? Granted though in contrast to Nvidia AMD doesn't have SFUs so arguably the special functions should probably be faster.
Also wouldn't it be a bit more expensive in GCN if it's looped rather than ganged since the different hw tables used for these ops can't be spread over different lanes?

Gipsel · Aug 3, 2011

mczak said:
Yes, though it looks a bit like overkill? Granted though in contrast to Nvidia AMD doesn't have SFUs so arguably the special functions should probably be faster.
Also wouldn't it be a bit more expensive in GCN if it's looped rather than ganged since the different hw tables used for these ops can't be spread over different lanes?

Yes, if they want to have the 1/3 rate, it will be a bit more expensive. But maybe they share the necessary small LUTs beween two ALUs, which may decrease the rate to 1/6. But I have no clue how it is done internally to be honest, so maybe you can come to a solution with other rates.

Jawed · Aug 4, 2011

What need is there for "fast" approximate (graphics) transcendentals now? In other words would 1/8th or slower be satisfactory?

mczak · Aug 4, 2011

Jawed said:
What need is there for "fast" approximate (graphics) transcendentals now? In other words would 1/8th or slower be satisfactory?

It might not need to be that fast but keeping in mind these will presumably prevent the alus from doing other work a 1/8 rate seems quite low. Don't forget that nvidia can still issue (near) full rate mads if 1/6 (or 1/8 depending on chip) of all instructions are special ops (well on Fermi) - something similar was true for pre-Cayman chips and even with Cayman you can do a little bit of other work while calculating special ops (and they are fast there). Though I guess not all of these special functions need to be the same rate really.

CarstenS · Aug 5, 2011

Ok, so after doing a bit more reading (and thinking), it probably makes sense what Jawed and others explained with regard to the SIMD structures not being of much use in order to gather experience for GCN. Taken that for granted, however, does not make it easier to evaluate the role of Cayman and it's revamped albeit obviously short-lived VLIW4 approach. Any further insights into that?

On another, but probably related matter: Has it somewhere been confirmed that or if GCN was the first architecture which was started from the drawing board after the merger of ATI and AMD?

AMD: Southern Islands (7*** series) Speculation/ Rumour Thread

CarstenS

Moderator

Gipsel

Gipsel

CarstenS

Moderator

Jawed

Jawed

Dave Baumann

Gamerscore Wh...

CarstenS

Moderator

Gipsel

Gipsel

Jawed

Dave Baumann

Gamerscore Wh...

Dave Baumann

Gamerscore Wh...

Gipsel

CarstenS

Moderator

mczak

Gipsel

Jawed

mczak

CarstenS

Moderator

Similar threads