AMD: R9xx Speculation

If this VLIW-4 rumour is true then Barts with 1280 lanes would have the same TMU count as Cypress. Also, games don't seem to be ALU bound, so 1280 versus 1600 is probably not going to be much of an issue.

HD5870 game performance for $200 would be pretty cool.

I have not been lurking here for a while till recently. So what is the genral knowledge around here regarding what areas today's games are bound to? In Cypress's case, is it set up and texture limited?
 
An HD 6970 Antilles card with 2xBarts @ "full" 850-875 Mhz (I guess with Charlie's info concerning Cayman's die size probably ending up a few mm2 below 400mm2 we can lay Cayman-based Antilles rumours to rest) would trump the current HD 5970 card by about 20% :oops:
That's exactly what I've been thinking. Two properly clocked Barts, with possible Sideport feature, will easily stay within 300W, be about 20% faster than 5970 AND it will have a price that will be competetive to the double GTX460 incoming sometime, becasue it will be cheaper to make.

People are saying Antilles will be double Cayman, because the driver say so. Well, is it already carved in stone? Or may the next batch of drivers come with different assignments? Also; we are hearing of two HD69XX cards. Imagine this:

HD6950 = 2x Barts at 40nm, coming before Xmas
HD6970 = 2x Cayman at 28nm, coming next year.

Possible or not?

=====

What picked my interest of the Charlie article, is the alleged medium shaders in a [ xt yt zy wt ] configuration. If true, wouldn't that make autovectorization easier, if every shader can do the same things, albeit slower? As long as the speed is higher than 25% of the "big" old T unit, more complicated calculations should be done faster, as long as properly vecorized, am I right?

He also sounded very defensive when saying that in some special cases it might lead to lwoer perfromance than Cypress. Can you name such special case where it could be true? :?:
 
UniversalTruth said:
You are continuing with your guesstimations... :LOL:
Antilles is two times Cayman. May you accept it? :LOL:
Ok, I promise to stop with the guesstimations after this post - but CaymanX2 @40nm still sounds like a very bad idea to me: They already had to lower clocks significantly on the stock 5970 cards in order to stay within reasonable power limits - and Cypress was a 330mm2/180W GPU ...

With Cayman arguably being near the 400mm2 die-size mark as well as near to the 200W power mark, well, it's up to AMD to go with a pair of lower-clocked, HUGE, rare and expensive Caymans or a pair of high-clocked, approximately RV770 sized, high-yielding Barts.

If they were Nvidia, they'd certainly go for the brute force option - but I just expect the inventor of the sweet-spot strategy to know better than this :p

UniversalTruth said:
Sorry, but you would you be so kind to explain to all of us how this would happen. If Barts is slower or equal in performance to Cypress? :oops:

The PCI-e power specifications and dual >300mm2 GPUs basically don't seem to get along very well. If Barts actually turns out to be approximately as fast as Cypress clock-for-clock while maintaining a smaller die-size and featuring lower power consumption, they could probably clock two of them @ HD 5870 levels (850Mhz+) without breaking the 300W barrier (850-725)*100/725 then results in nearly 20% performance gain over HD 5970 ... a card with two Caymans@700Mhz (?) arguably won't do a whole lot better than that, but probably cost twice as much to make.

The important thing is that Barts should turn out WAY faster than GF104 when clocked properly ... the only thing they need to do is to BEAT the upcoming GF104 X2 card - by how much doesn't really matter.
 
That's exactly what I've been thinking. Two properly clocked Barts, with possible Sideport feature, will easily stay within 300W, be about 20% faster than 5970 AND it will have a price that will be competetive to the double GTX460 incoming sometime, becasue it will be cheaper to make.
It can true, but there are few conditions:

1. Bart should be at least as fast as Cypress per clock. That would indicate 32 ROPs. How big would it be then?

2. On the assumption, that per-clock gaming performance is the same, the Bart-X2 should run at 870MHz to offer 20% performance over Hemlock.

3. TDP of Bart have to be significantly lower than TDP of Cypress to stay under 300W at 870MHz for X2 config.

I can't imagine two GPUs with 80 TMUs and 32 ROPs running at 870MHz to stay in 300W limit.
 
What picked my interest of the Charlie article, is the alleged medium shaders in a [ xt yt zy wt ] configuration. If true, wouldn't that make autovectorization easier, if every shader can do the same things, albeit slower? As long as the speed is higher than 25% of the "big" old T unit, more complicated calculations should be done faster, as long as properly vecorized, am I right?

He also sounded very defensive when saying that in some special cases it might lead to lwoer perfromance than Cypress. Can you name such special case where it could be true? :?:
I've been over this subject in a lot of detail, describing a scenario where all lanes work together to compute those functions that used to be performed by T.

My first post was here:

http://forum.beyond3d.com/showthread.php?p=1416950#post1416950

and the relevant stuff ends here:

http://forum.beyond3d.com/showthread.php?p=1422566#post1422566

I should add I'm a little sceptical over the feasibility of this (inner workings of the serial math operations give me pause for thought) - though not as sceptical as some people back then. Also there are other possibilities with 4 lanes.

And there's also the question of whether transcendental instructions need to be computed in a single cycle. Related to this is the fact that for the precision required by OpenCL, the conventional single-cycle transcendental unit is of very little use - a much more complex sequence of operations is required.

This would tend to indicate that low-precision conventional transcendental computation is of low priority in a revised ALU architecture, biasing the design towards something that increases the throughput of multi-operation transcendentals.
 
He also sounded very defensive when saying that in some special cases it might lead to lwoer perfromance than Cypress. Can you name such special case where it could be true? :?:
One example would be 32bit integer multiplication. In the moment it can only be done by the t unit, but at the same time the other four ALUs can do something else.
If now the 4 remining ALUs need to work together to accomplish a 32bit multiplication, they can't do anything else in the same clock. So while the peak throughput of 32bit integer multiplication stays the same, the throughput with a real instruction mix (with a lot of integer multiplications but also a bunch of other operations) may be quite a bit lower. In the extreme case it may be half the performance.
 
I have not been lurking here for a while till recently. So what is the genral knowledge around here regarding what areas today's games are bound to? In Cypress's case, is it set up and texture limited?
Anything except ALUs ;)

In my view rasterisation is a particular bottleneck in Cypress, due to the way setup feeds the two rasterisers.

Texturing also seems to suffer from the limited bandwidth available twixt L2s and L1s - aggregate bandwidth for 20 cores that is no better, per clock, than in RV770 for its 10 cores. Though some compute centric benchmarking has shown very robust behaviour of the texture cache hierarchy.

I also think Z rate is a bit of a handicap.

In reality very little has been quantified in any meaningful fashion. A basic comparison of Cypress and Juniper in games gives a range of ~30-85% more performance for Cypress.
 
Retaining 98.5% of the ALU performance going from 5D to 4D is easy I would think. Wonder how they achieved the 25% reduction in area though. Certainly the scheduler logic wouldn't decrease by the same amount - is it more AMD transistor packing prowess at work? If I had to bet on Caicos I'd say 16x4Dx2, 8 TMUs, 64-bit DDR3.
 
It's really tedious hand-scheduling to convert from VLIW-5 to VLIW-4 (I got bored), so I'm gonna wait. Once GPU Shader Analyzer/Stream Kernel Analyzer support this change (if it's coming) it'll be fantasically easy to see how shaders win or lose.

Separately there's always the possibility that AMD made some density sacrifices in implementing Cypress/Evergreen - doubled VIAs seemingly take extra area, for example - no idea how much slack that is though.
 
Mitigated by the increase in SIMDs. The real question is how many extra SIMDs would fit into the same space? 10% more?
That depends a bit how much they beefed up the capabilities of the remaining ALUs in the VLIW units to compensate for the loss of t. And of course there is always the possibility they were able to optimize the layout increasing the transistor density without a loss of clock rate.
But without taking such optimizations into consideration, I doubt that the saved space significantly exceeds 10%.
Nevertheless, I would guess one Evergreen SIMD-Engine with 4 TMUs (but without the scheduler stuff) measures maybe 6.x mm² (RV770 on 55nm was at 10.5mm2 or something like that). Increasing the SIMD engine count by 50% on Cayman and assuming a 10% shrink of the SIMD size because of the transition to a 4 slot VLIW design would lead to about 45mm² increase in size (+ additional scheduling capability + improved front end), so the die size charlie guessed (380-400mm²) may be in the right range for a 1920 SP part with 256bit memory interface. Actually it is the exact same range I suggested in another forum roughly a month ago for the case of a somewhat mild overhaul of the frontend ;)

But regarding the beef up of the 4 remaining ALUs and to stay with the example of the integer multiplications, as there are 24x24->48Bit multipliers in each Cypress ALU (despite the t ones, but not on Juniper) for FMA, I always asked myself if there is a clever way to combine the multiplier capabilities of only two ALUs to get at least the low 32bit part of the full result (one needs four 16x16->32 bit multipliers to get the full 64bit result). I was only able to come up with a scheme needing at least 3 ALUs for that. It would be nice if they added something to enable that, as this would double the peak 32bit multiplication throughput (nvidias Fermi has improved a lot in that area) and will limit the performance loss even in the most artificially constructed scenarios to something compensated by the growth of shader unit count. Generally, I think it would be nice if the use of former "t only" instructions will not always block all slots of a unit.
 
Texturing also seems to suffer from the limited bandwidth available twixt L2s and L1s - aggregate bandwidth for 20 cores that is no better, per clock, than in RV770 for its 10 cores.
While that is true, a particular bottleneck for trilinear and especially anisotropic filtering is already the L1 cache bandwidth itself. It is designed to deliver a peak bandwidth just sufficient for bilinear filtering. While [strike]ATI[/strike] AMD GPUs show an exceptional efficient use of the theoretically available bandwidth in texturing tests (up to 99% or so), it is simply not enough when turning to more complicated filtering algorithms. The tries to limit the performance loss (by using less texture samples) are quite infamous as it can lead to texture shimmering. I guess it would be good to retain the 16 units (but now only 4 slot units) to 4 TMU ratio, as this would slightly increase the texturing power (and L1 bandwidth) available per shader instruction. Maybe this is already enough to improve the filter quality. Nvidia went with more filtering hardware and L1 cache bandwidth per texturing unit. Apropos nvidia, the L2 cache bandwidth on GF100 is just half that of Cypress.

Wonder how they achieved the 25% reduction in area though.
In the moment that is just some odd rumor spreaded by a forums post comparing the changes with four or five toll booths on a highway. Without some really hard work on the layout it is not going to happen as you need to keep the same functionality without too large performance losses.
Certainly the scheduler logic wouldn't decrease by the same amount
Actually the scheduler logic needs to increase as you have probably more units and even without that, the number of wavefronts in flight will slightly increase. So the hope is, that the savings on the SPs and the increase in efficiency will outweigh the increased scheduling logic so the performance per mm² averaged over the full spectrum of workloads will rise.

One nice thing of the 5 slot -> 4 slot change is that the improvements will be most pronounced in situations with low ALU utilizations, where it is needed most. The potential drawbacks are mainly compensated by sheer unit count growth and in areas with high ALU utilization (still pending is the actual solution for transcendental stuff), where ATI GPUs have a huge advantage either way.
 
Last edited by a moderator:
Nevertheless, I would guess one Evergreen SIMD-Engine with 4 TMUs (but without the scheduler stuff) measures maybe 6.x mm² (RV770 on 55nm was at 10.5mm2 or something like that). Increasing the SIMD engine count by 50% on Cayman and assuming a 10% shrink of the SIMD size because of the transition to a 4 slot VLIW design would lead to about 45mm² increase in size (+ additional scheduling capability + improved front end), so the die size charlie guessed (380-400mm²) may be in the right range for a 1920 SP part with 256bit memory interface. Actually it is the exact same range I suggested in another forum roughly a month ago for the case of a somewhat mild overhaul of the frontend ;)
Thing is, I think other changes are going to be similarly expensive.

But regarding the beef up of the 4 remaining ALUs and to stay with the example of the integer multiplications, as there are 24x24->48Bit multipliers in each Cypress ALU (despite the t ones, but not on Juniper) for FMA, I always asked myself if there is a clever way to combine the multiplier capabilities of only two ALUs to get at least the low 32bit part of the full result (one needs four 16x16->32 bit multipliers to get the full 64bit result). I was only able to come up with a scheme needing at least 3 ALUs for that. It would be nice if they added something to enable that, as this would double the peak 32bit multiplication throughput (nvidias Fermi has improved a lot in that area) and will limit the performance loss even in the most artificially constructed scenarios to something compensated by the growth of shader unit count.
I don't see the need for enhanced 32-bit lo or hi multiplications. Doing so would have to tie in with DP throughput. And in terms of DP FLOPS/mm² ATI seems sound, even if it's "1/4 or 1/2 rate".

Additionally, at some point AMD is going to have to bite the bullet and make all GPUs have double-precision - as IGP/APU eats away at the bottom of the discrete market, the fact that APUs will have to be double-precision to be taken seriously for compute workloads means that discrete will have to be DP capable too. But that's a few years off, generally.

Generally, I think it would be nice if the use of former "t only" instructions will not always block all slots of a unit.
Some things like float-to-integer won't clog up all the lanes. Indeed throughput of those will double, and conversions are often a waster of ALU capability in the current architecture (one or more of them being points of serialisation).
 
Actually the scheduler logic needs to increase as you have probably more units and even without that, the number of wavefronts in flight will slightly increase. So the hope is, that the savings on the SPs and the increase in efficiency will outweigh the increased scheduling logic so the performance per mm² averaged over the full spectrum of workloads will rise.

Yep, I meant per SIMD. Across the chip scheduler overhead will obviously increase. Wonder if there's a chance they reduce wavefront size to 32. Is branch granularity even a factor nowadays?
 
1. Bart should be at least as fast as Cypress per clock. That would indicate 32 ROPs. How big would it be then?
I thought the 256bit interface was pretty much confirmed? Seems like a waste with only 16 rops (granted that's exactly the ratio redwood is using), unless the rops have been improved otherwise (z fillrate?).

Apropos nvidia, the L2 cache bandwidth on GF100 is just half that of Cypress.
Where did you get that number from? I've never seen any official figure for GF100 (or gt200, g92, g80 for that matter...)
 
One example would be 32bit integer multiplication. In the moment it can only be done by the t unit, but at the same time the other four ALUs can do something else.
If now the 4 remining ALUs need to work together to accomplish a 32bit multiplication, they can't do anything else in the same clock. So while the peak throughput of 32bit integer multiplication stays the same, the throughput with a real instruction mix (with a lot of integer multiplications but also a bunch of other operations) may be quite a bit lower. In the extreme case it may be half the performance.

Or, they could just emulate int32 mul with 3 int24 multiplications in the same lane.
 
Isn't Cypress' L2 exclusive to each ROP/MC partition; iow the bandwith is what's called "aggregate bamdwidth"?
 
Last edited by a moderator:
Back
Top