AMD: R9xx Speculation

I think there's little motivation for AMD to massively invest in the L1/L2 structure because of their (comparatively) massive register files. For conventional workloads this approach seems to work just fine, as long as you have more math to hide memory subsystem accesses. Fermi on the contrary needs it's elaborate memory subsystem partly to compensate it's smaller register file.
Bulking up on register files is a poor way of reducing avg. texturing latency, especially when you need only a few K (~10K L1) but you need (~50K reg file increase) to make any noticeable impact on latency.

Also, their reg file isn't that big considering the alu's they have.
 
I think there's little motivation for AMD to massively invest in the L1/L2 structure because of their (comparatively) massive register files. For conventional workloads this approach seems to work just fine, as long as you have more math to hide memory subsystem accesses. Fermi on the contrary needs it's elaborate memory subsystem partly to compensate it's smaller register file.
There's a degree of truth in all that: VLIW chomps through addresses in register files at a relatively high rate, after all. (Though NVidia's register file architecture is barely behind once you take the banking into account.) And VLIW tends to want more temp registers, though the pipeline registers and static temp registers (there's up to 8 of them per work item in Evergreen) in ATI both negate a substantial portion of that.

But I think an enhanced L1/L2 architecture is inevitable because:
  1. tessellation - writing data off die, always (seems likely in Cayman), un-cached, seems lunatic, particularly as tessellation is, partly, intended to be a way of saving bandwidth.
  2. compute - global buffer read/write is a fact of life now and arbitrary producer-consumer depends upon it. Compute is part of graphics. No excuses.
  3. register spill - can't be avoided with the hairiest kernels. Though I think the current compiler throws its hands up in horror and resorts to spill far too easily.
WRT to the 70%: Do you mean HD 6970 vs. HD 5830? Or fully blown (and clocked) versions respectively?

HD5830? What's that? :p
 
Jawed If your prediction is correct then Cayman will outperform GTX 580 by 15-20% .

Another leaked slide : http://forums.overclockers.co.uk/showpost.php?p=17962524&postcount=627

You mean another fake?

69700.jpg


fe47b9e4-0baa-456c-acee-39a80487bc19.jpg
 
I can't understand, how these people think. How could a GPU (which is only +50% bigger than super-efficient Barts) be +100% faster?
 
I can't understand, how these people think. How could a GPU (which is only +50% bigger than super-efficient Barts) be +100% faster?
How could RV770 be only 33% bigger than RV670 but almost 100% faster?

The rearchitecturing between RV670 => RV770 and Cypress => Cayman will be on a roughly comparable scale or even more pronounced.

I'm not saying it will be that way, but flatly excluding a more significant performance increase would be stupid.
 
Gipsel: They made MSAA resolve working (that was the greatest bottleneck of R6xx's performance), removed full-speed FP16 filtering (which was unused by that time games) and replaced ring-bus by a more-are efficient solution (in fact if you clock R600 and RV770 at the same speed and switch of MSAA, performance difference in many games is almost equal to the amount of additional transistors of the RV770).

Barts isn't a broken GPU, which would be boosted in performance by a simple fix. Barts also doesn't have anything, what would be unused, even the DP support was removed. So, any added feature will decrease it's efficiency, not contrary. If you add DP support, it will increase die-size but it won't increase gaming performance. If you add second geometry engine, it will increase die-size, but it won't affect performance in >90% games, etc. ATI/AMD can hardly make a GPU, which would me more efficient and feature-equiped, than Barts...
 
Bulking up on register files is a poor way of reducing avg. texturing latency, especially when you need only a few K (~10K L1) but you need (~50K reg file increase) to make any noticeable impact on latency.

Also, their reg file isn't that big considering the alu's they have.
It's design choices as always. Some people tend to tend to one way, other people to another way. It seems to have worked pretty well at least up until now.

WRT to their register files: I think you don't have to look at the number of ALUs but the average lifetime of a given thread. IIRC, AMDs architecture has shorter latencies than Nvidias, doesn't it?

There's a degree of truth in all that: VLIW chomps through addresses in register files at a relatively high rate, after all. (Though NVidia's register file architecture is barely behind once you take the banking into account.) And VLIW tends to want more temp registers, though the pipeline registers and static temp registers (there's up to 8 of them per work item in Evergreen) in ATI both negate a substantial portion of that.

But I think an enhanced L1/L2 architecture is inevitable because:
  1. tessellation - writing data off die, always (seems likely in Cayman), un-cached, seems lunatic, particularly as tessellation is, partly, intended to be a way of saving bandwidth.
  2. compute - global buffer read/write is a fact of life now and arbitrary producer-consumer depends upon it. Compute is part of graphics. No excuses.
  3. register spill - can't be avoided with the hairiest kernels. Though I think the current compiler throws its hands up in horror and resorts to spill far too easily.

HD5830? What's that? :p
Hehe...

But there already is a cache structure in Evergreen and former AMD parts, which seems to adequately handle the most terrible performance pitfalls - except for high tessellation levels. And this supposedly should be changing with Cayman.
 
Back
Top