AMD: R9xx Speculation

One way of arranging it could be this.

It would be better if that illustration was rotated by 90°, though.
I don't know looks to me like someone looked too long at Cypress & Fermi diagrams until it all blended together :).
Seriously I haven't seen anything credible about reorganization (this diagram also has decoupled tmus, for instance). So if you assume there's no major reorganization, 30 simds would only leave the possibility of 2x15, 3x10, 5x6, 6x5 as useful choices. I doubt though the first one is a good choice, and all others have a non-power of two number of groups, which Jawed didn't like (due to the uneven screen space partitioning, can't say I disagree there). So what gives?
 
Wasn't there a slide, where an allegedly improved geometry/tessellation number was given for Cayman? I mean, you basically could base your speculation about SIMD organization based on that, couldn't you?
 
I disagree. Rather than flatly stating that I'm wrong, perhaps you could try persuading me to see things differently.

I have never said that multi-GPU scaling should be linear. I'm just pointing out that physically, multi-GPU setups are very demanding. It's an indisputable fact that HD5970 consumes 256 GB/s of bandwidth.

Perhaps you can demonstrate how this bandwidth is achieved? Is it the same when considering, say, two GTX 460's in SLI? Or two HD 5870's in Crossfire?

'Consumes' is an odd word for this context, to my mind; are you a non-native english speaker? I ask to be sure I am not overlooking context or assuming knowledge of terminology that might be applied incorrectly.
 
all others have a non-power of two number of groups, which Jawed didn't like (due to the uneven screen space partitioning, can't say I disagree there).
Given that the screen space tiles are fairly small (8x8?), I wouldn't consider that as a problem as it will be pretty well balanced either way.
Hasn't the GTX465 only 3 GPCs enabled? It works there too.
 
shhh dont tell that to trinibwoy ;)

Heh, you guys can choose to revel in your ignorance all you like but the processing capabilities of multi-gpu cards are double those of a single card. The only thing that isn't doubled is memory capacity. Care to explain how two GPUs working on two frames in parallel does not constitute double the processing power? I await your informed explanation..... :)
 
Heh, you guys can choose to revel in your ignorance all you like but the processing capabilities of multi-gpu cards are double those of a single card. The only thing that isn't doubled is memory capacity. Care to explain how two GPUs working on two frames in parallel does not constitute double the processing power? I await your informed explanation..... :)

two times trinibwoy doesn't represent double the sexiness because any girl would only want one of them at the same time. So that would mean whilst the capability to be laid has doubled, the inputs would need to double also. So that means they would have to find twice as many girls in order to keep up an efficiency rate of 60 girls per trinibwoy per night.
 
two times trinibwoy doesn't represent double the sexiness because any girl would only want one of them at the same time. So that would mean whilst the capability to be laid has doubled, the inputs would need to double also. So that means they would have to find twice as many girls in order to keep up an efficiency rate of 60 girls per trinibwoy per night.

I have no idea what you're talking about but it sounds good to me :LOL: 120 "hurtz" please!
 
No, just available horsepower. Single GPU scaling is also subject to external bottlenecks so they aren't unique to dual GPU cards.

I think various CUDA tests have shown that, especially GPGPU can be sensitive to CPU bandwidth.
 
two times trinibwoy doesn't represent double the sexiness because any girl would only want one of them at the same time. So that would mean whilst the capability to be laid has doubled, the inputs would need to double also. So that means they would have to find twice as many girls in order to keep up an efficiency rate of 60 girls per trinibwoy per night.
:LOL:
 
Given that the screen space tiles are fairly small (8x8?), I wouldn't consider that as a problem as it will be pretty well balanced either way.
Hasn't the GTX465 only 3 GPCs enabled? It works there too.
Nominally each shader engine in Cypress and Barts has a dedicated set of ROPs, which works nicely with screen space tiling and the per-SE rasteriser.

But in truth the ROPs must be accessible globally, because of atomics.

So, ultimately, it seems unavoidable that ATI will have to adopt a layout like NVidia's, as the count of SIMDs climbs ever higher - along with having actual scalable tessellation/setup/rasterisation instead of the current crap. So my earlier objection doesn't carry much weight.

Each new ATI layout does have me scratching my head for a while (I still remember the consternation over Xenos, which lead to confusion in R580), so I'm sort of expecting to be surprised.

My thoughts are simple: if it's 24 or less SIMDs, then I expect the overall layout to be much like Cypress - and another year of waiting for a real overhaul ensues... If it's more than that, then all bets are off.

Call me wildly optimistic, but in light of the revelation about the crappy performance of 32 ROPs in Cypress/Barts and serious lack of balance in Cypress, I expect Cayman to be re-balanced for ~70% more performance than Cypress. At least 60%. Cypress sets a fairly low bar, frankly.
 
Nominally each shader engine in Cypress and Barts has a dedicated set of ROPs, which works nicely with screen space tiling and the per-SE rasteriser.
Are you sure about this?
It's true that the rasterizers and the ROPs gets assigned a set of screen tiles. But afaik in both, Cypress and Fermi, the assignments are "interleaved", i.e. the screen tiles of one rasterizer are distributed over all ROPs and each ROP gets screen tiles from all rasterizers. It's fairly easy to do and everything else simply makes no sense from the load balancing point of view.
And as you write here:
But in truth the ROPs must be accessible globally, because of atomics.
there s a crossbar in front of the ROPs either way, so each ROP is accessible from both shader engines. Why would you dedicate half of the ROPs to one shader engine in Cypress? It does not make sense, you only increase the probability that you encounter a performance pitfall.
 
True, using the smaller/simpler MC saves PHY but I would also assume they are more power efficienct, though that is also due to running at slower speeds.

Also, according to anandtech, the 6850's have no problem hitting 4.6ghz effective, so the 4ghz reference speed must be for another reason.

Yield, or put another way, return rate.
Thus quoth Dave Baumann:
Dave Baumann said:
I've said it before, I'll say it again - ASIC's have variable levels of leakage and you cannot take an absolute power differential when there is just a sample of one (of each).

What Dave is saying is that individual chips vary. Return rate is a killer of margins, and causes problems along the entire chain, from partners all the way through down to consumers. Their cards need to work as specified even with the lower parts of the bin, in poorly ventilated cases, iffy power, during summer, et cetera. So they engineer in margins - the very margins overclockers exploit.

By and large, reviewers and most people with lots of time to spend on forums value cards by a single figure of merit: price/performance. So AMD and nVidia have reason to push clocks as high as possible. However, they don't want to get into trouble with returns (and they are somewhat limited by the cost of the cooling apparatus, which can't be too high for any given segment as that would also hurt p/p). It's a balancing act. Generally the margins are thinner today than, say, 10 years ago and there are reasons for that. But there has to be margins, and they have to be sufficient to allow for variance in parts and environments.

In this case though, differentiation from 6870 may or may not have been a contributing factor. If 4200MHz was deemed the limit for the 6870, only AMD can know if the bin they use for 6850s allow for weaker performance vs. the memory or if it's a part of market positioning.
 
Gipsel: Maybe I'm mistaken, but I thought, that already at rasterizer level it's decided which RBE-block will get given pixel (speaking about Cypress)
 
Gipsel: Maybe I'm mistaken, but I thought, that already at rasterizer level it's decided which RBE-block will get given pixel (speaking about Cypress)
Indeed, it is. Simply because one gets exactly there the knowledge to which screen tile a pixel belongs to ;)

But that still doesn't mean that half of the ROPs are bound to one rasterizer (or the SIMD block it belongs to). Afair there is some presentation detailing how this screen tiling works.

Let A and B designate that a tile belongs to rasterizer A or B respectively and suffix 1, 2, 3, or 4 the ROP partition. The screen could be tiled roughly in that way:
Code:
A1 B2 A3 B4 A1 B2 A3 B4 ...
B3 A4 B1 A2 B3 A4 B1 A2 ...
.. ..
That may not be the optimal tiling, but you get the idea.
 
To be honest I'm not sure. The diagrams don't support my position - they show shader export with a bus that links to all ROPs/MCs - but of course that's needed for atomics plus CS/GS(SO) writes (and pixel shader writes to linear buffers).

Since the rasterisers work in tiles, they provide a framework for further sub-division into tiles organised per ROP-quad.

Rasteriser tiles provide load-balancing for fragments per triangle and ROP-quad-sized tiling within them provides latency hiding for render-target<->memory operations.

There's no need to make all SIMDs write to all ROPs to load-balance, in the general case.
 
There's no need to make all SIMDs write to all ROPs to load-balance, in the general case.
Of course there is no real need. It just makes the balancing better as you distribute the load to a wider set of units. I see no reason why ROPs should be tied to a rasterizer in Cypress. It isn't in Cypress and it isn't in Fermi neither.
 
Back
Top