AMD: R9xx Speculation

I didn't think of relaxed math options, but SKA doesn't expose any compiler options, anyway.

GPUSA has an option for IEEE strictness for HLSL (DirectCompute). The option is off. Regardless the optimisation doesn't appear.

GPUSA is Cat 10.3, though, so way out of date (I have the latest version of GPUSA, 1.54).

It's occurred to me that my long-standing qualm with 4-lane transcendental is due to my assumption that there'd be time for the first MUL to be rounded/normalised before feeding into the second lane's MUL (and so on for the MULs and ADD required for transcendental). If the _PREV versions of these instructions are un-normalised (as this behaviour suggests), then that timing issue seemingly disappears.

The reason I had this point of view is that MAD on Evergreen is a true MUL then ADD sequence with rounding/normalisation.

Anyway, I'm still not convinced about 4-lane transcendental.
 
The compiler could be being conservative if the serial MULs are paired in such a way as the first MUL's result is not rounded and normalised (as would happen in a dot-product, I presume). Is that what you were thinking of?
I was speculating that there could be some kind of difference in the value passed to a dependent MUL compared to a serially issued MUL. The ISA doc says output modification and clamping isn't applied to the previous output, but on reflection it would seem a big deal to omit the fact that rounding wasn't done either.

The other thought I had, given the apparent immaturity in the compilers, is that they are just defaulting to a super-conservative policy or nobody has written a version capable of the optimization.

One possibility is that it could be a hardware bug and the compilers will not emit that stream of instructions.
I didn't see any cycle restriction or port conflict that would affect the co-issue case in this test.

I haven't tried a pixel (or other graphics) shader, that's the last resort I guess.

If it only ever occurs in a pixel shader then it's pretty much useless, since the default is high levels of ILP anyway.
If any part of AMD has an emphasis on optimizations even at the expense of potential accuracy issues, one would think it would be the graphics portion. It does seem like there is a fair amount of low-hanging fruit the compute side is leaving on the table.
 
Cayman XT

95215812.jpg
 
I didn't, but I'm not going to rule out large architectural changes like that.

When you're winning the efficiency and teraflops war, you market those as being primary. When you need to move away from that efficiency, you go for something else - performance, features, price.

The cancellation of the new process node that AMD was apparently aiming for, and the subsequent 'fallback' to 'Plan B' can mean marketing wants a new poster boy to go to town with. Fast As Hell sure sells well.
 
It can true, but there are few conditions:

1. Bart should be at least as fast as Cypress per clock. That would indicate 32 ROPs. How big would it be then?

2. On the assumption, that per-clock gaming performance is the same, the Bart-X2 should run at 870MHz to offer 20% performance over Hemlock.

3. TDP of Bart have to be significantly lower than TDP of Cypress to stay under 300W at 870MHz for X2 config.

I can't imagine two GPUs with 80 TMUs and 32 ROPs running at 870MHz to stay in 300W limit.
Ad.1 You make a good point about 32 ROPs and the space. Given a few rumours I've seen floating around: AMD wanting to up what is considered mainstream graphics, 6770 being clocked around 700mhz and 6770 being around 5850 performance or close, 32 ROPs seem like a necessary to accomplish that, especially when AA appears in (almost ;) )all new games .

Ad.2 That 700mhz value sound either like a false romour or some deliberate crippling. If Juniper can be clocked up to 1GHz and Cypress, let's say 900MHz, then I don't see any reason why Barts wouldn't be able to clock similary.

Ad.3 That's indeed a necesity and I don't know how much this is possible. We can only hope the experience AMD has with 40nm graphics will allow them to pull of even more efficient designs we see now.

Add following to the mix: Sideport for better crossfire scalling (anybody got an idea how much it could be?) and the possibility to install GDDR5+ memory for more bandwidth, while single Barts cards will most liekly keep regualr GDDR5. Maybe overall there won't be such need for high clocks?

I've been over this subject in a lot of detail, describing a scenario where all lanes work together to compute those functions that used to be performed by T.
Thanks, very interesting read. :) Made me realize a few things I wasn't aware off. Those Rightmark shaders are going to be an interesting benchmark for the new architecture, most likely those situations where we will see a slowdown.

I should add I'm a little sceptical over the feasibility of this (inner workings of the serial math operations give me pause for thought) - though not as sceptical as some people back then. Also there are other possibilities with 4 lanes.

And there's also the question of whether transcendental instructions need to be computed in a single cycle. Related to this is the fact that for the precision required by OpenCL, the conventional single-cycle transcendental unit is of very little use - a much more complex sequence of operations is required.
Taking more cycles to do a transcendential makes sense, if it can be done with one medium sized shader, without the help of supporting lanes. But only if the code is vectorized, so the transcendential is done in all 4 shaders at the same time. If not, then what would happen then? We could use other lanes, but would have to wait for the transcendential to finish if we want to use the result. So major challenge for the compiler there - totally different to Evergreen, right?

One example would be 32bit integer multiplication. In the moment it can only be done by the t unit, but at the same time the other four ALUs can do something else.
If now the 4 remining ALUs need to work together to accomplish a 32bit multiplication, they can't do anything else in the same clock. So while the peak throughput of 32bit integer multiplication stays the same, the throughput with a real instruction mix (with a lot of integer multiplications but also a bunch of other operations) may be quite a bit lower. In the extreme case it may be half the performance.
In case of 32bit mul I am expecting (or at least hoping :) )these new medium sized shaders to deal with them on their own. So the other ALU's will be free to do other things. But things might be different for the harder functions...

8 Memory chips, bye bye 384-bit.
I think I am also seeing a 6-pin and 8-pin power connector. Or am I missinterpreting that pin soldering points closest too us?

Might be early silicone, but It doesn't look likely Cayman can be made to run on low enough wattage to run two of them on a :love:00W board. So BartsX2 after all?
 
GTX275- 219w TDP
GTX295- 289w TDP

I think you guys are underestimating AMD's engineering skill for Antilles. I honestly think that if Cayman stays under 220w, they can managed it. They might have to make some concessions with clocks and/or specs but they will come through.
 
I think I am also seeing a 6-pin and 8-pin power connector. Or am I missinterpreting that pin soldering points closest too us?

Might be early silicone, but It doesn't look likely Cayman can be made to run on low enough wattage to run two of them on a :love:00W board. So BartsX2 after all?

It's 6+6pin, the PCB has place for 6+8pin, but only 6+6 are soldered in
 
Ad.1 You make a good point about 32 ROPs and the space. Given a few rumours I've seen floating around: AMD wanting to up what is considered mainstream graphics, 6770 being clocked around 700mhz and 6770 being around 5850 performance or close, 32 ROPs seem like a necessary to accomplish that, especially when AA appears in (almost ;) )all new games.
Looking at computerbase, real-world results seems to be different. Comparing HD4890 (16 ROPs) and HD5850 (32 ROPs) at:
2560*1600: HD5850 is 48% faster
2560*1600 + AA 4x / AF 16x: HD5850 is 32% faster
2560*1600 + AA 8x / AF 16x: HD5850 is 22% faster

1920*1200: HD5850 is 43% faster
1920*1200 + AA 4x / AF 16x: HD5850 is 31% faster
1920*1200 + AA 8x / AF 16x: HD5850 is 27% faster

it doesn't seem, that HD5850 is able to utilize the advantage of twice as many ROPs compared to HD48xx
 
GTX275- 219w TDP
GTX295- 289w TDP

I think you guys are underestimating AMD's engineering skill for Antilles. I honestly think that if Cayman stays under 220w, they can managed it. They might have to make some concessions with clocks and/or specs but they will come through.
Granted, but plz bear in mind that the GeForce 295 GTX dual-GPU card basically wasn't possible before the G200 (65nm) > G200b (55nm) shrink. G200b still was a huge chip, of course (I didn't find exact numbers, but it should be well in the 400mm2+ range),

So yeah, no one actually denies that AMD COULD use two Caymans for their HD 6970 card (given some hefty clock concessions, of course) - the real question is: Would they actually WANT to do that (or better: would it be financially feasible?) if they could also best HD 5970's performance by a reasonable margin when pairing up two high-clocked, considerably cheaper-to-select Barts?

I don't see a company with an actually rather capable management and an alarmingly high amount of debt fighting their competitor's upcoming GF104x2 card with a brute-force behemoth consisting of two slightly-under-400mm2 Caymans.

You just don't fight 2x330mm2 Fermi-offspring with 2x390mm2 (?) Caymans when you can probalby use 2x280mm2 (?) Barts to achieve the same thing: take the "fastest card" performance crown ...

The only reasonable explanation for a Cayman-based Antilles card @40nm would be that Barts actually isn't quite as powerful as one could expect based on the current rumours (about as fast as Cypress clock-for-clock).
 
Last edited by a moderator:
So major challenge for the compiler there - totally different to Evergreen, right?
Certainly a new challenge, if 4 lanes with no T is what they're doing. I'm not sure a "macro" would make things significantly more difficult - see the macros for integer divide or double-precision divide. Or, even, the macros for high accuracy transcendentals that are required for OpenCL (have to admit I haven't investigated those).
 
Granted, but plz bear in mind that the GeForce 295 GTX dual-GPU card basically wasn't possible before the G200 (65nm) > G200b (55nm) shrink. G200b still was a huge chip, of course (I didn't find exact numbers, but it should be well in the 400mm2+ range),

So yeah, no one actually denies that AMD COULD use two Caymans for their HD 6970 card (given some hefty clock concessions, of course) - the real question is: Would they actually WANT to do that (or better: would it be financially feasible?) if they could also best HD 5970's performance by a reasonable margin when pairing up two high-clocked, considerably cheaper-to-select Barts?

I don't see a company with an actually rather capable management and an alarmingly high amount of dept fighting their competitor's upcoming GF104x2 card with a brute-force behemoth consisting of two slightly-under-400mm2 Caymans.

You just don't fight 2x330mm2 Fermi-offspring with 2x390mm2 (?) Caymans when you can probalby use 2x280mm2 (?) Barts to achieve the same thing: take the "fastest card" performance crown ...

The only reasonable explanation for a Cayman-based Antilles card @40nm would be that Barts actually isn't quite as powerful as one could expect based on the current rumours (about as fast as Cypress clock-for-clock).

G200b was ~484mm2.
GF104 is ~366mm2.

I don't think they are overly worried about a dual GF104 right now, the 5970's should be more than enough to handle them.

I think they are more worried with what Nvidia might have on the burner...
 
Back
Top