Performance evolution between GCN versions - Tahiti vs. Tonga vs. Polaris 10 at same clocks and CUs

Even the quadrupled L2 size with double the throughput still barely hints at any tangible performance benefit. And yes I know, GCN still relies on its proprietary global data share for syncing on top of the dedicated color/depth caches. The faster primitive occlusion and tessellation also doesn't show much contribution, despite the roomier L2 holding the spillovers to the global memory.
 
Did the Tahiti vs. Tonga vs. Polaris comparison from computerbase.de go unnoticed in this thread? Their comparison shows impressive performance boosts between architectures, especially in gameworks titles where tessellation is very (ab)used:

hE5eVw.png


Between Tahiti and Polaris we see a full 35% performance boost out of the exact same theoretical compute throughput / fillrate and lower bandwidth.



How often have we said this when AMD launched an new GPU, maybe too often already.

And how many times have said predictions failed to come true?
Recent comparisons of Tahiti vs. GK104, Hawaii vs. GK110 and Hawaii vs. GM204 are pretty much self explanatory right now.
 
I figured it's time to start deriving from the other mess of a thread and this is an interesting subject.
Computerbase.de has made a comparison between a Tahiti, a Tonga and a Polaris 10 GPUs, all with the same number of CUs enabled and same clocks.

https://www.computerbase.de/2016-08...ormance/#abschnitt_gcn_1_3_und_4_im_vergleich


This is basically GCN1 vs. GCN3 vs. GCN 4, in the form of a R9 280X, a R9 380X and a RX 470.

In some games the difference is pretty negligible (e.g. Dirt Rally, Thalos Principle) but in others, the performance boost is pretty huge:

hE5eVw.png



In general, the games that have gotten the largest boosts from the architectural improvements are the gameworks games, which tend to push geometry as far as maxwell cards can do.
But there are games like Ashes of the Singularity who have sizable performance boosts too. Maybe the support for a larger number of compute queues in GCN3 and the HWS units in GCN4 are making a difference when async compute is being used.
 
I figured it's time to start deriving from the other mess of a thread and this is an interesting subject.
Computerbase.de has made a comparison between a Tahiti, a Tonga and a Polaris 10 GPUs, all with the same number of CUs enabled and same clocks.

https://www.computerbase.de/2016-08...ormance/#abschnitt_gcn_1_3_und_4_im_vergleich


This is basically GCN1 vs. GCN3 vs. GCN 4, in the form of a R9 280X, a R9 380X and a RX 470.

In some games the difference is pretty negligible (e.g. Dirt Rally, Thalos Principle) but in others, the performance boost is pretty huge:

hE5eVw.png



In general, the games that have gotten the largest boosts from the architectural improvements are the gameworks games, which tend to push geometry as far as maxwell cards can do.
But there are games like Ashes of the Singularity who have sizable performance boosts too. Maybe the support for a larger number of compute queues in GCN3 and the HWS units in GCN4 are making a difference when async compute is being used.
Did these all arrive at the same price points when entering the market?
Also, in today's market, are they competitive in price points -- err I mean if you were to buy them new?
 
Even the quadrupled L2 size with double the throughput still barely hints at any tangible performance benefit. And yes I know, GCN still relies on its proprietary global data share for syncing on top of the dedicated color/depth caches. The faster primitive occlusion and tessellation also doesn't show much contribution, despite the roomier L2 holding the spillovers to the global memory.
I wonder what their source is for Tonga having a 512kB L2 cache. AMD never published this information at the time of Tonga's release.
Fiji had 2MB, and was in most respects double Tonga (apart from shader engines, within which the same 4 they doubled CUs instead).
 
It's a good guess, I think. Hawaii packs 1MB (8 memory controllers, 128kB partition), Tahiti came with 768kB (6 controllers), so it's logical that Tonga keeps the same amount of L2 SRAM per partition.
 
Did these all arrive at the same price points when entering the market?
Also, in today's market, are they competitive in price points -- err I mean if you were to buy them new?

Both Tahiti and Tonga went through different PCBs, core clocks, memory clocks, etc. during their lifetimes.
Moreover, I think none of the cards in the article are running the clocks they were shipped with, so such a comparison wouldn't make much sense.
The reviewers clocked all cards the same in order to get the same compute and fillrate throughputs, in an effort to compare the architectures and not their position in the market.



That said, I was hoping for this thread to be about discussing what architectural improvements between GCN1-4 have resulted in substantial differences such as the ones we're seeing with Witcher 3 and Ages of the Singularity.
 
Both Tahiti and Tonga went through different PCBs, core clocks, memory clocks, etc. during their lifetimes.
Moreover, I think none of the cards in the article are running the clocks they were shipped with, so such a comparison wouldn't make much sense.
The reviewers clocked all cards the same in order to get the same compute and fillrate throughputs, in an effort to compare the architectures and not their position in the market.



That said, I was hoping for this thread to be about discussing what architectural improvements between GCN1-4 have resulted in substantial differences such as the ones we're seeing with Witcher 3 and Ages of the Singularity.
Lol no worries. I do appreciate the OT. I was just wondering how they were priced relatively to each other. This is a good showcase of architecture.


Sent from my iPhone using Tapatalk
 
Well isa wise nothing changed between tonga and polaris, just the front end. So in games where there's a large difference between the two a reasonable guess might be "tahiti/tonga/etc. were severely bottlenecked by triangle processing in this part of the game whereas polaris was not". GCN in generally was (is?) weak in this area compared to the competition (both nvidia and intel).

But graphics is a complicated business, hard to say with any level of certainty without some profiling. This would be my offhand guess.
 
Here's another comparison that popped up today (credit to @Alessio1989 )

http://www.bitsandchips.it/9-hardware/7334-tonga-vs-polaris-sfida-clock-to-clock


Well isa wise nothing changed between tonga and polaris, just the front end.

So you're suggesting the average 7% better performance-per-clock between Tonga and Polaris is coming from the increased L2 cache alone?
Witcher 3 and Metal Gear Solid V are showing a whopping 15% difference, though those are both gameworks titles.


Regardless, it really does seem that most of the architectural improvements happened during the 28nm generations and Polaris' changes may just be all about the new node and specific power improvements.
The OpenCL driver does call Tahiti a "SI" (below Bonaire+Hawaii GFX7), Tonga a GFX8 and Polaris a GFX8.1.



Interesting to see how each of the architecture step took its own geometry performance improvement, up until Polaris 10 which seems to be evened out with GP106.
Though I don' know if that's either good or depressing (or both). On one hand, the newer cards aren't hurting as much with gameworks titles. On the other, AMD is spending R&D resources to counter gameworks in their own hardware. This has to be frustrating as all multiplatform games are console ports coming from a more compute-centric GCN1/2 architecture in the first place.
 
It's a good guess, I think. Hawaii packs 1MB (8 memory controllers, 128kB partition), Tahiti came with 768kB (6 controllers), so it's logical that Tonga keeps the same amount of L2 SRAM per partition.
The "good guess" isn't good enough. 1MB would be just as good a guess (why not say it's doubled Bonaire instead of Hawaii derivative). 512kB, 1MB, 768kB, 1.5MB (though supposedly for the latter two options part of it would be deactivated) are all options I've seen mentioned somewhere for Tonga - all are good guesses but only one is right...
I don't think the performance difference would be all that big though in any case - surely if there'd be a 3% performance difference AMD would have increased it earlier (the die size the L2 uses should still be tiny).
If you'd really wanted to know, I suppose some directed compute tests could reveal it.
 
I'm suggesting the isa between tonga and polaris is the same, nothing more! I believe this can be verified with codexl (although it seems like you figured this out on your own). It's nearly impossible to connect performance differences (especially when they are in the single digit % range) to a single block of hardware. I wasn't kidding when I said graphics is complicated! You're right though, in general not a whole lot changed with polaris. Perhaps there are some interesting chips on the horizon...

Off-topic but I think you should consider the possibility that developers (even those working on gameworks titles) don't create rendering pipelines that purposely cater to one ihv at the detriment of another ihv. I promise there's no hidden agenda among developers (at least none that I've encountered). The notion that amd is "wasting" r&d money on geometry processing is kind of crazy. There's nothing wrong with processing triangles faster! I don't think it's unreasonable to say amd overshot compute a bit on gcn just like nvidia overshot compute a tad with fermi. I view gcn's "rebalancing" as a reflection of reality and not a reaction to gameworks (let's be real, it's doubtful amd's engineers even knew of gamework's existence when designing these revisions).
 
Though I don' know if that's either good or depressing (or both). On one hand, the newer cards aren't hurting as much with gameworks titles. On the other, AMD is spending R&D resources to counter gameworks in their own hardware. This has to be frustrating as all multiplatform games are console ports coming from a more compute-centric GCN1/2 architecture in the first place.

Pretty sure GCN experienced developers like sebbbi have said that AMD's geometry engines needed improving. It seems very unlikely that they've done this to counter Gameworks, and more that it's a side of the pipeline that was bottlenecking them, and so they improved it.

Up next are expected to be the ROPs, where nvidia are now ahead of them. As well as becoming more BW and power efficient, they would also appear to need more of them. They have been limited to a maximum of 16 per shader engine.
 
Outside the things AMD once said nobody needed (more geometry and tesselation power), I think much of the improvements come from more Cache and better colour compression, so more effective bandwith. It is quite interesting that the biggest gain comes from improved geometry power in the CB tests. I am more and more inclined to think that CGN was not a good architecture for DX11.
 
Up next are expected to be the ROPs, where nvidia are now ahead of them. As well as becoming more BW and power efficient, they would also appear to need more of them. They have been limited to a maximum of 16 per shader engine.
AMD's performance per ROP (and per unit of fillrate) is far ahead of NVidia, in games.

AMD's real problem is not doing tile-binned rasterisation. Doing that would have the side effect of "making the ROPs more efficient", but in truth wouldn't make any difference. Not doing work on a triangle you know will be overwritten is a win.

The irony is that with clustered geometry/occlusion algorithms, the need to do tile-binned rasterisation disappears. AMD could re-architect for this just in time for all the advanced engines to do a better job themselves.
 
AMD's performance per ROP (and per unit of fillrate) is far ahead of NVidia, in games.

Perhaps, but their absolute performance is frequently behind and having more performance per unit of fillrate is scant consolation if it's bottlenecking other areas of your system. Which, it appears, might be happening with Polaris and certainly was something that happened with Fiji.

AMD's real problem is not doing tile-binned rasterisation. Doing that would have the side effect of "making the ROPs more efficient", but in truth wouldn't make any difference. Not doing work on a triangle you know will be overwritten is a win.

The irony is that with clustered geometry/occlusion algorithms, the need to do tile-binned rasterisation disappears. AMD could re-architect for this just in time for all the advanced engines to do a better job themselves.

One thing I've realised about the PC gaming market is that it's never to late to add an optimisation that you need.

The exciting new way of doing things is always three years later in gaining mass adoptance than you hoped it will be ...
 
Outside the things AMD once said nobody needed (more geometry and tesselation power), I think much of the improvements come from more Cache and better colour compression, so more effective bandwith. It is quite interesting that the biggest gain comes from improved geometry power in the CB tests. I am more and more inclined to think that CGN was not a good architecture for DX11.

It's been an absolute belter in consoles though, where games really lean on compute / async compute, and "Vulkan Doom" style results will be far more common.

Probably better for DX11 than Kepler too, tbh.
 
Back
Top