AMD: R9xx Speculation

rpg.314 · Aug 26, 2010

Gipsel said:
You know, for a inherently parallel problem this can often be exchanged to some extent.

Yes, but those are _very_ niche, aren't they?

no-X · Aug 26, 2010

Jawed said:
It might merely reflect AMDs intention that 2GB per GPU (which is "needed" for E6) will be reserved for Cayman.

I don't think the number of display controllers is related to GPU's performance or VRAM capacity.

RV870 has 6 "Eyefinity display controllers"
RV840 has 5
RV830 has 4
RV810 has 4

If the rumour about E4 for Bart is right, than it could be pad-limited (another logical explanation doesn't come on my mind) ...

As for the ROPs, RV740 had GDDR5 memory controller, it had twice the bandwidth of its predecessor, doubling the ROPs count was logical here. Bart, which was (I think) originaly a 128bit part with maybe 20% faster memory modules than the RV840, probably wasn't designed with 32 ROPs. Even with 6GHz modules it would have only 75% bandwidth of HD4890, which did pretty well with 16 ROPs.

32 ROPs would also mean, that the only difference between Bart and Cayman would be number of SIMDs. They'd be too close in performance... ROPs are big, I think 16 would save more die are than sacrifice performance. Bart also isn't the part, which will be benchmarked with MSAA 8x enabled.

Kaotik · Aug 26, 2010

So, does this look absolutely impossible:

Antilles: 2x Cayman
Cayman: 1920 SP, 256bit
Barts: 1280 SP, 256bit
Caicos: 640 SP, 128bit
Turks: 320 SP, 128bit or 160 SP, 128bit (hoping to get rid of 64bit)

Jawed · Aug 26, 2010

no-X said:
I don't think the number of display controllers is related to GPU's performance or VRAM capacity.

6 is clearly a premium feature.

RV870 has 6 "Eyefinity display controllers"
RV840 has 5
RV830 has 4
RV810 has 4

Whereas 4 is the minimum apparently required for laptops.

If the rumour about E4 for Bart is right, than it could be pad-limited (another logical explanation doesn't come on my mind) ...

Obviously, I can't discount that. But why does Juniper have 5?...

As for the ROPs, RV740 had GDDR5 memory controller, it had twice the bandwidth of its predecessor, doubling the ROPs count was logical here. Bart, which was (I think) originaly a 128bit part with maybe 20% faster memory modules than the RV840, probably wasn't designed with 32 ROPs. Even with 6GHz modules it would have only 75% bandwidth of HD4890, which did pretty well with 16 ROPs.

Juniper would be faster with more bandwidth, just not a huge amount so. As I said earlier, if that was combined with other memory-efficiency improvements, then 128-bit might be enough. But I'm dubious.

For <=20% performance gain, yeah 128-bit with 16 ROPs would work.

The real question is then whether AMD was aiming for 20% gain or 50% gain.

32 ROPs would also mean, that the only difference between Bart and Cayman would be number of SIMDs.

I was suggesting that Cayman would have doubled ROPs per MC too, if Barts did. The configuration I was referring to was 128-bit + 32 ROPs for Barts and some multiple of that for Cayman (2x or 3x, who knows?)

They'd be too close in performance... ROPs are big,

16 ROPs and their associated L2s are 10% of RV770, about 26mm². Call it 20mm² for 16 Evergreen ROPs on 40nm.

I think 16 would save more die are than sacrifice performance. Bart also isn't the part, which will be benchmarked with MSAA 8x enabled.

Put a different way, how to get Juniper performance up to the level of HD5830 or higher?

I don't think AMD is going for a mere 20% gain or less.

A counter argument is that tessellation support takes the lion's share of new die so that performance in non-tessellated games is <=20% better. That might be the case due to 32nm being cancelled, i.e. that AMD has sacrificed the portion of new die budget aimed at generic performance gains.

Gipsel · Aug 26, 2010

rpg.314 said:
You can use the VLIW units in a virtual SIMD fashion.

It works quite well, as long as there in no divergence amongst these 4 "vector lanes". Which is why I am not particularly fond of using VLIW as a substitute for vectorization

Uhmm, ever had divergence between vector lanes in a "true" SIMD unit? VLIW can do everything (SIMD) vector units could do and then some more.

Gipsel · Aug 26, 2010

rpg.314 said:
Yes, but those are _very_ niche, aren't they?

No, actually quite common and a recommended technique (brings often even some benefits for nvidia GPUs as you most of the time reduce the granularity of memory accesses). The hindrance is often only the effort the developer has to put in, but this is a minor inconvinience in my book if you get twice the performance.

Jawed · Aug 26, 2010

Gipsel said:
Uhmm, ever had divergence between vector lanes in a "true" SIMD unit? VLIW can do everything (SIMD) vector units could do and then some more.

I think it's time to blow Rohit's mind:

Code:

             46  x: LDS_ADD     ____,  PV45.z,  (0x00000001, 1.401298464e-45f).x      
                 y: ADD_INT     T1.y,  (0x0000000C, 1.681558157e-44f).y,  T0.w      
                 z: MULADD_e    T2.z,  PV45.x, -0.5,  0.5      
                 w: MULADD_e    T1.w,  PV45.w,  0.5,  0.5      
                 t: MUL_e       ____,  R3.w,  KC0[6].w

trinibwoy · Aug 26, 2010

Gipsel said:
There is more to graphics workloads than just the arithmetics. And in some situations it pays off.

Can't think of any in the past few years, definitely nothing in line with the theoretical numbers.

No, the execution of two pairs of dependent operations in one instruction (instead of 4 or 5 independent ones) possible since Evergreen for some operations. That means the latency for some instructions is cut to the half.

Thought you were referring to the hardware dynamically filling the VLIW with instructions from multiple threads. The capability you're describing pulls instructions from a single thread at compile time.

Gipsel · Aug 26, 2010

trinibwoy said:
Can't think of any in the past few years, definitely nothing in line with the theoretical numbers.

Which shows you that pure arithmetic performance is not the bottleneck most of the time.

trinibwoy said:
Thought you were referring to the hardware dynamically filling the VLIW with instructions from multiple threads. The capability you're describing pulls instructions from a single thread at compile time.

Yes, but doing so it virtually converts the shader units of Cypress (or NI) to two slotted VLIW units with half the latency or twice the effective clockspeed if you want.

trinibwoy · Aug 26, 2010

Gipsel said:
Which shows you that pure arithmetic performance is not the bottleneck most of the time.

So the problem isn't a lack of ILP, it's a lack of ALU instructions in general? So then why overbuild to such a great extent if occupancy is not a problem?

Yes, but doing so it virtually converts the shader units of Cypress (or NI) to two slotted VLIW units with half the latency or twice the effective clockspeed if you want.

Lost me there. Cypress's units have 5 slots. I'm not sure why you consider filling those slots with precompiled VLIW instructions from a single thread as something particularly relevant to the prior discussion on issue width. As long as all instructions are coming from one thread the issue width is 5, what does "two slotted" mean?

RecessionCone · Aug 26, 2010

Gipsel said:
Uhmm, ever had divergence between vector lanes in a "true" SIMD unit? VLIW can do everything (SIMD) vector units could do and then some more.

Not true. Usually, VLIW is asymmetric, meaning that each VLIW lane can be specialized. In general, it's not possible to execute the same instruction over all lanes of a VLIW machine, which is the definition of SIMD. The 'T' lane in ATI's architecture is a good example of this difference. Other VLIW architectures, where you may have one lane for loads/stores, one lane for branches, one lane for integer ALU ops, and another for floating point ops, make this difference even sharper.

no-X · Aug 26, 2010

Jawed said:
But why does Juniper have 5?...

Maybe it doesn't have a purpose, which would be related to die size or performance. 6 is premium feature, 4 is low-end minimum. 5 can be just a nice mainstream value

Jawed said:
I was suggesting that Cayman would have doubled ROPs per MC too

Wouldn't it be quite an overkill? HD5870 wasn't able to utilize the 32 ROPs by the same level RV790 utilized it's 16 ROPs, so I don't think we'll see change soon.

Jawed said:
Put a different way, how to get Juniper performance up to the level of HD5830 or higher?

HD5830 has 16 ROPs, but its problem is the bandwidth thing...

1280 SPs (+14% compared to HD5830), better front-end/efficiency (+5%?), 50MHz higher clock-speed (+6%)... thats around 30% over HD5830. Native 16 ROPs connected better to the memory controller than crippled HD5830 does, would also boosted performance by a few percents. That's HD5850-GTX470 ball-park. Nice competitor for GTX460.

Jawed · Aug 26, 2010

trinibwoy said:
So the problem isn't a lack of ILP, it's a lack of ALU instructions in general?

http://forum.beyond3d.com/showthread.php?p=1220350#post1220350

We've been over this before.

Gipsel · Aug 26, 2010

RecessionCone said:
Not true. Usually, VLIW is asymmetric, meaning that each VLIW lane can be specialized. In general, it's not possible to execute the same instruction over all lanes of a VLIW machine, which is the definition of SIMD. The 'T' lane in ATI's architecture is a good example of this difference. Other VLIW architectures, where you may have one lane for loads/stores, one lane for branches, one lane for integer ALU ops, and another for floating point ops, make this difference even sharper.

You may have missed I was talking specifically about the implementation in ATI GPUs and not the VLIW concept in general (besides the statement that it is a method to exploit ILP). And ATI GPUs have fairly symmetric slots for most instructions. That there are a few exceptions with the t slot doesn't matter for my argument. And by the way, as noone would program a GPU like it would have 5way SIMD shader units (the natural data types typically support 2, 3, and 4way), regard the t lane just as an additional scalar unit in this context

Dave Baumann · Aug 26, 2010

trinibwoy said:
Can't think of any in the past few years, definitely nothing in line with the theoretical numbers.

RV770 compared to GT200/GT200b, Cypress compared to Fermi.

Gipsel · Aug 26, 2010

trinibwoy said:
So the problem isn't a lack of ILP, it's a lack of ALU instructions in general? So then why overbuild to such a great extent if occupancy is not a problem?

What Jawed linked.
As a general thought I would add that a game is bottlenecked by different parts of the GPU during different times even within the same frame. You can of course choose to widen all bottlenecks or look for the ones, where you get the largest overall improvement with the smallest effort. ATI has chosen to widen the ALU throughput bottleneck massively. Nvidia has maybe taken a more balanced but obviously also more costly approach.

trinibwoy said:
Lost me there. Cypress's units have 5 slots. I'm not sure why you consider filling those slots with precompiled VLIW instructions from a single thread as something particularly relevant to the prior discussion on issue width. As long as all instructions are coming from one thread the issue width is 5, what does "two slotted" mean?

Imagine you have 4 Instructions:

a = b+c
d = a+e
f = g*h
f = f*i

That are two independent pairs of dependent instructions, right? For RV770 the compiler has to put these instructions in two VLIW instructions because of the dependencies, which reduces the "fill factor" of the units. Starting with Cypress, the VLIW unit can process them in a single instruction. The four x,y,z,w slots are actually arranged as two pairs, where the result of the operation in one slot can be forwarded to the other member of the pair.

For my example the compiler could decide to generate the following VLIW instruction (exchanged the registers for the symbols)

Code:

t  NOP            // nothing in here, will be removed with NI either way
w  MUL _, g, h    // g*h, result not written to reg file
z  ADD a, b, c    // a = b + c
y  MUL_PREV f,i   // f = g*h*i, takes result of the w lane
x  ADD_PREV d,e   // d = a + e, takes result of the z lane

So for this task, the VLIW units would effectively look like smaller ones (with only two slots), just running at twice the clock.

These dependent instructions issue capability also exist for muladd (but not fma) and the bitcounting as well as sad instructions (and of course the dot product, which gets special attention and is able do a full dp4 in a single VLIW instruction).

By the way, the general capability appears to be there since R600 (judging from the single instruction DP4), it was just exposed only in Evergreen (the instructions to use it were missing before). But last time I checked the compiler was still not using it. The driver team appears to be sometimes slow to exploit such features. The 24bit multiplication only got added now to the IL language (at least umul24 and umad24, umul24_high ist still missing), almost a year after release and is still not documented (only in the ISA docs).

trinibwoy · Aug 26, 2010

Jawed said:
http://forum.beyond3d.com/showthread.php?p=1220350#post1220350

We've been over this before.

That analysis doesn't tell me why AMD's parts are only on par or slower than the competition when they have an abundance of flops, higher texturing throughput and similiar bandwidth. If you have more of everything and aren't any faster what does that mean?

Dave Baumann said:
RV770 compared to GT200/GT200b, Cypress compared to Fermi.

Cypress' gaming performance relative to Fermi is in line with its theoretical shader througput advantage? Don't understand how you can make a case that lots of VLIW flops are useful when it doesn't result in any tangible benefit even when coupled with a higher texturing rate too. Where I'm coming from is that if occupancy is high and games are ALU light then Cypress could achieve the same performance with fewer shader resources. Unless of course one of those assertions is false.

Jawed · Aug 26, 2010

no-X said:
Wouldn't it be quite an overkill? HD5870 wasn't able to utilize the 32 ROPs by the same level RV790 utilized it's 16 ROPs, so I don't think we'll see change soon.

HD5870 is ~80% faster than HD5770 on average in games - we're back to the old argument about how scalable games are, CPU bottlenecks and Cypress's rasterisation bottleneck.

HD5830 has 16 ROPs, but its problem is the bandwidth thing...

Well it's a mess. This:

http://www.computerbase.de/artikel/..._11/18/#abschnitt_performancerating_ohne_aaaf

shows the margin twixt HD5770 and HD5830 as much lower on average - maybe that's a better comparison? So, erm...

1280 SPs (+14% compared to HD5830), better front-end/efficiency (+5%?), 50MHz higher clock-speed (+6%)... thats around 30% over HD5830. Native 16 ROPs connected better to the memory controller than crippled HD5830 does, would also boosted performance by a few percents. That's HD5850-GTX470 ball-park. Nice competitor for GTX460.

If the memory system (and some other bits?) is radically better in SI then maybe all that's possible with 128-bits. It'd be cool if it was.

Jawed · Aug 26, 2010

trinibwoy said:
That analysis doesn't tell me why AMD's parts are only on par or slower than the competition when they have an abundance of flops, higher texturing throughput and similiar bandwidth.

The topic at hand is FLOPS. The analysis shows they're not relevant (well, except that Grid was well known for being faster on HD4870 than GTX280):

http://www.techreport.com/articles.x/14990/13

Oh, and in your list you conveniently forgot fillrate and Z rate

And if you're going to compare anything to Fermi, rasterisation granularity.

If you have more of everything and aren't any faster what does that mean?

You clearly aren't paying attention.

Cypress' gaming performance relative to Fermi is in line with its theoretical shader througput advantage? Don't understand how you can make a case that lots of VLIW flops are useful when it doesn't result in any tangible benefit even when coupled with a higher texturing rate too. Where I'm coming from is that if occupancy is high and games are ALU light then Cypress could achieve the same performance with fewer shader resources. Unless of course one of those assertions is false.

Which chip spends more area on ALUs? The last direct comparison we have is that GT200 (65nm) spends 210mm² on ALUs against 76mm² on RV770 (55nm).

Sadly some area is missing from that assessment of RV770, because there is some scheduling hardware outside of the central block of ALUs - but no good way of discerning how much. Add another 25% for luck.

trinibwoy · Aug 27, 2010

Jawed said:
Oh, and in your list you conveniently forgot fillrate and Z rate And if you're going to compare anything to Fermi, rasterisation granularity.

Cypress has for more fillrate than Fermi (theoretical and measured). Are you suggesting that Fermi's Z-rate makes up for a deficiency in flops, texturing and fillrate? I'm not trying to be difficult but I just don't see how you can have lots of flops + high utilization + higher texturing rate = equal performance. Something is wrong in that equation. If, as your analysis shows, AMD's cards are significantly faster when processing shaders then where does Nvidia catch up? With GT200 you could sorta pin it on the texturing but now Cypress has an advantage there too.

AMD: R9xx Speculation

rpg.314

no-X

Kaotik

Drunk Member

Jawed

Gipsel

Gipsel

Jawed

trinibwoy

Meh

Gipsel

trinibwoy

Meh

RecessionCone

no-X

Jawed

Gipsel

Dave Baumann

Gamerscore Wh...

Gipsel

trinibwoy

Meh

Jawed

Jawed

trinibwoy

Meh

Similar threads