You know, for a inherently parallel problem this can often be exchanged to some extent.
Yes, but those are _very_ niche, aren't they?
You know, for a inherently parallel problem this can often be exchanged to some extent.
I don't think the number of display controllers is related to GPU's performance or VRAM capacity.It might merely reflect AMDs intention that 2GB per GPU (which is "needed" for E6) will be reserved for Cayman.
6 is clearly a premium feature.I don't think the number of display controllers is related to GPU's performance or VRAM capacity.
Whereas 4 is the minimum apparently required for laptops.RV870 has 6 "Eyefinity display controllers"
RV840 has 5
RV830 has 4
RV810 has 4
Obviously, I can't discount that. But why does Juniper have 5?...If the rumour about E4 for Bart is right, than it could be pad-limited (another logical explanation doesn't come on my mind) ...
Juniper would be faster with more bandwidth, just not a huge amount so. As I said earlier, if that was combined with other memory-efficiency improvements, then 128-bit might be enough. But I'm dubious.As for the ROPs, RV740 had GDDR5 memory controller, it had twice the bandwidth of its predecessor, doubling the ROPs count was logical here. Bart, which was (I think) originaly a 128bit part with maybe 20% faster memory modules than the RV840, probably wasn't designed with 32 ROPs. Even with 6GHz modules it would have only 75% bandwidth of HD4890, which did pretty well with 16 ROPs.
I was suggesting that Cayman would have doubled ROPs per MC too, if Barts did. The configuration I was referring to was 128-bit + 32 ROPs for Barts and some multiple of that for Cayman (2x or 3x, who knows?)32 ROPs would also mean, that the only difference between Bart and Cayman would be number of SIMDs.
16 ROPs and their associated L2s are 10% of RV770, about 26mm². Call it 20mm² for 16 Evergreen ROPs on 40nm.They'd be too close in performance... ROPs are big,
Put a different way, how to get Juniper performance up to the level of HD5830 or higher?I think 16 would save more die are than sacrifice performance. Bart also isn't the part, which will be benchmarked with MSAA 8x enabled.
Uhmm, ever had divergence between vector lanes in a "true" SIMD unit? VLIW can do everything (SIMD) vector units could do and then some more.You can use the VLIW units in a virtual SIMD fashion.
It works quite well, as long as there in no divergence amongst these 4 "vector lanes". Which is why I am not particularly fond of using VLIW as a substitute for vectorization
No, actually quite common and a recommended technique (brings often even some benefits for nvidia GPUs as you most of the time reduce the granularity of memory accesses). The hindrance is often only the effort the developer has to put in, but this is a minor inconvinience in my book if you get twice the performance.Yes, but those are _very_ niche, aren't they?
I think it's time to blow Rohit's mind:Uhmm, ever had divergence between vector lanes in a "true" SIMD unit? VLIW can do everything (SIMD) vector units could do and then some more.
46 x: LDS_ADD ____, PV45.z, (0x00000001, 1.401298464e-45f).x
y: ADD_INT T1.y, (0x0000000C, 1.681558157e-44f).y, T0.w
z: MULADD_e T2.z, PV45.x, -0.5, 0.5
w: MULADD_e T1.w, PV45.w, 0.5, 0.5
t: MUL_e ____, R3.w, KC0[6].w
There is more to graphics workloads than just the arithmetics. And in some situations it pays off.
No, the execution of two pairs of dependent operations in one instruction (instead of 4 or 5 independent ones) possible since Evergreen for some operations. That means the latency for some instructions is cut to the half.
Which shows you that pure arithmetic performance is not the bottleneck most of the time.Can't think of any in the past few years, definitely nothing in line with the theoretical numbers.
Yes, but doing so it virtually converts the shader units of Cypress (or NI) to two slotted VLIW units with half the latency or twice the effective clockspeed if you want.Thought you were referring to the hardware dynamically filling the VLIW with instructions from multiple threads. The capability you're describing pulls instructions from a single thread at compile time.
Which shows you that pure arithmetic performance is not the bottleneck most of the time.
Yes, but doing so it virtually converts the shader units of Cypress (or NI) to two slotted VLIW units with half the latency or twice the effective clockspeed if you want.
Gipsel said:Uhmm, ever had divergence between vector lanes in a "true" SIMD unit? VLIW can do everything (SIMD) vector units could do and then some more.
Maybe it doesn't have a purpose, which would be related to die size or performance. 6 is premium feature, 4 is low-end minimum. 5 can be just a nice mainstream valueBut why does Juniper have 5?...
Wouldn't it be quite an overkill? HD5870 wasn't able to utilize the 32 ROPs by the same level RV790 utilized it's 16 ROPs, so I don't think we'll see change soon.I was suggesting that Cayman would have doubled ROPs per MC too
HD5830 has 16 ROPs, but its problem is the bandwidth thing...Put a different way, how to get Juniper performance up to the level of HD5830 or higher?
http://forum.beyond3d.com/showthread.php?p=1220350#post1220350So the problem isn't a lack of ILP, it's a lack of ALU instructions in general?
You may have missed I was talking specifically about the implementation in ATI GPUs and not the VLIW concept in general (besides the statement that it is a method to exploit ILP). And ATI GPUs have fairly symmetric slots for most instructions. That there are a few exceptions with the t slot doesn't matter for my argument. And by the way, as noone would program a GPU like it would have 5way SIMD shader units (the natural data types typically support 2, 3, and 4way), regard the t lane just as an additional scalar unit in this contextNot true. Usually, VLIW is asymmetric, meaning that each VLIW lane can be specialized. In general, it's not possible to execute the same instruction over all lanes of a VLIW machine, which is the definition of SIMD. The 'T' lane in ATI's architecture is a good example of this difference. Other VLIW architectures, where you may have one lane for loads/stores, one lane for branches, one lane for integer ALU ops, and another for floating point ops, make this difference even sharper.
RV770 compared to GT200/GT200b, Cypress compared to Fermi.Can't think of any in the past few years, definitely nothing in line with the theoretical numbers.
What Jawed linked.So the problem isn't a lack of ILP, it's a lack of ALU instructions in general? So then why overbuild to such a great extent if occupancy is not a problem?
Imagine you have 4 Instructions:Lost me there. Cypress's units have 5 slots. I'm not sure why you consider filling those slots with precompiled VLIW instructions from a single thread as something particularly relevant to the prior discussion on issue width. As long as all instructions are coming from one thread the issue width is 5, what does "two slotted" mean?
t NOP // nothing in here, will be removed with NI either way
w MUL _, g, h // g*h, result not written to reg file
z ADD a, b, c // a = b + c
y MUL_PREV f,i // f = g*h*i, takes result of the w lane
x ADD_PREV d,e // d = a + e, takes result of the z lane
RV770 compared to GT200/GT200b, Cypress compared to Fermi.
HD5870 is ~80% faster than HD5770 on average in games - we're back to the old argument about how scalable games are, CPU bottlenecks and Cypress's rasterisation bottleneck.Wouldn't it be quite an overkill? HD5870 wasn't able to utilize the 32 ROPs by the same level RV790 utilized it's 16 ROPs, so I don't think we'll see change soon.
Well it's a mess. This:HD5830 has 16 ROPs, but its problem is the bandwidth thing...
If the memory system (and some other bits?) is radically better in SI then maybe all that's possible with 128-bits. It'd be cool if it was.1280 SPs (+14% compared to HD5830), better front-end/efficiency (+5%?), 50MHz higher clock-speed (+6%)... thats around 30% over HD5830. Native 16 ROPs connected better to the memory controller than crippled HD5830 does, would also boosted performance by a few percents. That's HD5850-GTX470 ball-park. Nice competitor for GTX460.
The topic at hand is FLOPS. The analysis shows they're not relevant (well, except that Grid was well known for being faster on HD4870 than GTX280):That analysis doesn't tell me why AMD's parts are only on par or slower than the competition when they have an abundance of flops, higher texturing throughput and similiar bandwidth.
You clearly aren't paying attention.If you have more of everything and aren't any faster what does that mean?
Which chip spends more area on ALUs? The last direct comparison we have is that GT200 (65nm) spends 210mm² on ALUs against 76mm² on RV770 (55nm).Cypress' gaming performance relative to Fermi is in line with its theoretical shader througput advantage? Don't understand how you can make a case that lots of VLIW flops are useful when it doesn't result in any tangible benefit even when coupled with a higher texturing rate too. Where I'm coming from is that if occupancy is high and games are ALU light then Cypress could achieve the same performance with fewer shader resources. Unless of course one of those assertions is false.
Oh, and in your list you conveniently forgot fillrate and Z rate And if you're going to compare anything to Fermi, rasterisation granularity.