Performance evolution between GCN versions - Tahiti vs. Tonga vs. Polaris 10 at same clocks and CUs

Off-topic but I think you should consider the possibility that developers (even those working on gameworks titles) don't create rendering pipelines that purposely cater to one ihv at the detriment of another ihv.

I didn't suggest developers create rendering pipelines that purposely cater to one IHV (though I wouldn't put too much trust in Epic in that regard).
Gameworks itself purposely caters to one IHV (duh) and some developers use it. I think we can both agree on that.


I promise there's no hidden agenda among developers (at least none that I've encountered).
Of course not. Not even the ones using gameworks because that brings them tools that accelerate development and ultimately saves them time and money.
I don't doubt they know that gameworks brings the caveat of favoring one IHV over another, but when you have a budget and a deadline you must balance all things together and make such decisions. It's perfectly understandable.



I view gcn's "rebalancing" as a reflection of reality and not a reaction to gameworks (let's be real, it's doubtful amd's engineers even knew of gamework's existence when designing these revisions).
Pretty sure GCN experienced developers like sebbbi have said that AMD's geometry engines needed improving. It seems very unlikely that they've done this to counter Gameworks, and more that it's a side of the pipeline that was bottlenecking them, and so they improved it.

It's most probably true that GCN chips are more balanced now with their current geometry performance than it was in 2012. Though "balance" is a moving goalpost that depends on each engine, each game and even each scene. As an "extreme" example, how much geometry performance does the Tomorrow Children need?

Maybe the geometry performance progression in GCN is purely for balance reasons, or maybe it's also because some of nvidia's "flagship effects" on their tools (e.g. hairworks and godrays) are geometry intensive.
BTW, gameworks is just the name they came up for a conglomerate of tools, some of which had been pushed through TWIMTBP for years.
I'm pretty sure AMD engineers were aware that nvidia would try to push their geometry performance through their own tools ever since Kepler came up in 2012, just like AMD themselves did with their compute advantage using TressFX, for example.

Though maybe this is a subject for another thread?
 
Last edited by a moderator:
Perhaps, but their absolute performance is frequently behind and having more performance per unit of fillrate is scant consolation if it's bottlenecking other areas of your system. Which, it appears, might be happening with Polaris and certainly was something that happened with Fiji.
Maxwell and Pascal send less work to the ROPs, so in games it's generally impossible to say that ROP throughput is the reason for performance differences. Delta colour compression is another factor. The three things go together beautifully in NVidia and hide ROP specific aspects of game performance. Synthetic tests definitely show many pure fillrate advantages for NVidia (but certainly not a clean sweep):

http://www.hardware.fr/articles/952-9/performances-theoriques-pixels.html

ROP performance has disappeared into a black box for games.

One thing I've realised about the PC gaming market is that it's never to late to add an optimisation that you need.

The exciting new way of doing things is always three years later in gaining mass adoptance than you hoped it will be ...
I won't argue that AMD needs to catch up, I just don't think it's the ROPs (quantity or architecture) that's the problem. AMD is doing too much work that's discarded which wastes compute, ROPs, bandwidth and power.
 
I won't argue that AMD needs to catch up, I just don't think it's the ROPs (quantity or architecture) that's the problem. AMD is doing too much work that's discarded which wastes compute, ROPs, bandwidth and power.

This sounds intriguing - I don't understand a great deal about what's going on inside these things beyond the large block diagrams you get with reviews. Could you point me in the direction of some more info about this?
 
This sounds intriguing - I don't understand a great deal about what's going on inside these things beyond the large block diagrams you get with reviews. Could you point me in the direction of some more info about this?


I think Jawed is talking about the type of rasterization nV cards are doing since Maxwell (tile based).
 
Perhaps I've misunderstood and used terms incorrectly.

I'd filed tile based rasterisation under improvements to the ROPs, but it looks like the magic happens in a separate block of the hardware, and when you're working out which groups of fragments to work on concurrently based on post transform geometry in a cache.

I need to sit down and go through this. Again.
 
I won't argue that AMD needs to catch up, I just don't think it's the ROPs (quantity or architecture) that's the problem. AMD is doing too much work that's discarded which wastes compute, ROPs, bandwidth and power.
I think the biggest problem has been (before Maxwell/Pascal and async compute) that AMDs compute units are not well utilized in most rendering workloads. There are too many different bottlenecks. Graham's GDC presentation (http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/) is a good example of this (includes occupancy charts). Async compute during rendering tasks gives AMD easy 20%-30% performance gains. Without bottlenecks in rendering hardware this 20%-30% could be used for the rendering task, helping all games, not just console games and PC games specially optimized for async compute using DX12 and Vulkan.

I don't think AMD did that much excess work prior to Maxwell's tiled rasterizer. Nvidia had delta color compression earlier, and it was more efficient, so that cost AMD a bit of bandwidth, but nothing dramatic. Nvidia's warp size of 32 (vs 64) also slightly reduces work in code with non-coherent branches (but finer granularity adds scheduling hardware cost). Maxwell's tiled rasterizer however is a big problem, unless AMD adapts similar technology. Both depth compression and delta color compression are based on tiles. Partial tile updates cost bandwidth. If a single pixel of a tile is touched, the whole tile must be loaded back to ROP cache, uncompressed, then again compressed and written back as whole (*) when it gets kicked out of the ROP cache. Nvidia's tiled rasterizer collects a big batch of triangles and coarse rasterizes them to tiles. Then each tile is rasterized as a whole. This is perfect for tile based compression methods. Bigger rasterizer tiles are aligned to smaller compression tiles, meaning that most compression tiles get written back once the tile is filled fully. This is a big improvement.

We don't know whether Nvidia is also doing any pixel occlusion culling in the tiled rasterizer (render depth of the whole tile before running any pixel shaders). This would be another big gain, but would require lots of on-chip storage to store (post VS) vertex attributes (**). Seems that someone needs to write a test case. Super heavy pixel shader, N full screen triangles (N = a few hundred) with increasing depth, compare front-to-back vs back-to-front render times with z-buffer enabled.

(*) A single pixel change likely changes most of the tile data, as most deltas will change. Also lossless compression tends to "randomize" all data even from a minor change.

(**) Also they could separate vertex shader to two. First depth only shader only includes instructions that affect SV_Position. Second vertex shader is the full shader (including SV_Position math again). This would somewhat increase the vertex shader cost (worst case 2x), but wouldn't require extra memory.
 
Last edited:
Personally I like AMD GPUs. During the past year I have been writing almost purely compute shaders. I have been consistently amazed how well the old Radeon 7970 fares against GTX 980 in compute shaders. I gladly leave all the rasterization and geometry processing woes to others :)
 
I think the biggest problem has been (before Maxwell/Pascal and async compute) that AMDs compute units are not well utilized in most rendering workloads. There are too many different bottlenecks. Graham's GDC presentation (http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/) is a good example of this (includes occupancy charts).
NVidia GPUs would have behaved much the same, except with warps of 32, before Maxwell.

A lot of discussion once centred upon tessellation killing AMD. Partly the problem there is that AMD can't load-balance geometry (HS-TS-DS are stuck with the CU/SIMD for their lifetime). Similarly NVidia has been load balancing rasterisation across the GPU, something that AMD doesn't do. So before Maxwell there were other things that hurt GCN, but the lack of tile-binned geometry is a kind of permanent disadvantage as opposed to the prior "extreme tessellation" disadvantage. AMD's problem is more important now.

Async compute during rendering tasks gives AMD easy 20%-30% performance gains. Without bottlenecks in rendering hardware this 20%-30% could be used for the rendering task, helping all games, not just console games and PC games specially optimized for async compute using DX12 and Vulkan.
We see similar gains from clustered geometry/occlusion. Both of these types of gains benefit NVidia too - though the degrees of improvement can be argued. NVidia has less compute available for smarter algorithms.

I don't think AMD did that much excess work prior to Maxwell's tiled rasterizer. Nvidia had delta color compression earlier, and it was more efficient, so that cost AMD a bit of bandwidth, but nothing dramatic. Nvidia's warp size of 32 (vs 64) also slightly reduces work in code with non-coherent branches (but finer granularity adds scheduling hardware cost).
I wonder how many pixel shaders with incoherent branching are in production? I bet they're rare.

Maxwell's tiled rasterizer however is a big problem, unless AMD adapts similar technology. Both depth compression and delta color compression are based on tiles. Partial tile updates cost bandwidth. If a single pixel of a tile is touched, the whole tile must be loaded back to ROP cache, uncompressed, then again compressed and written back as whole (*) when it gets kicked out of the ROP cache. Nvidia's tiled rasterizer collects a big batch of triangles and coarse rasterizes them to tiles. Then each tile is rasterized as a whole. This is perfect for tile based compression methods. Bigger rasterizer tiles are aligned to smaller compression tiles, meaning that most compression tiles get written back once the tile is filled fully. This is a big improvement.

We don't know whether Nvidia is also doing any pixel occlusion culling in the tiled rasterizer (render depth of the whole tile before running any pixel shaders). This would be another big gain, but would require lots of on-chip storage to store (post VS) vertex attributes (**). Seems that someone needs to write a test case. Super heavy pixel shader, N full screen triangles (N = a few hundred) with increasing depth, compare front-to-back vs back-to-front render times with z-buffer enabled.
I was under the impression that the tests already discussed indicated that vertex attribute count affected tile size.

(*) A single pixel change likely changes most of the tile data, as most deltas will change. Also lossless compression tends to "randomize" all data even from a minor change.
NVidia's tiling enables a tightly-bound ROP-MC architecture which then means that render target cache lines only ever visit a single ROP unit. In AMD the ROPs are almost continuously sharing cache lines amongst themselves (via L2/off-chip memory). On its own, just the work of shifting a single cache line to all of the ROPs over the lifetime of a frame (over and over again in a random walk) is going to cost power if nothing else.

When AMD frees itself of the shader engine shackles (maximum of 4, maximum of 16 CUs per) perhaps it'll tie ROPs to MCs and do NVidia style GPU-wide load balancing of tessellation and rasterisation and tile-binned geometry. It's a hell of a lot of catching up to do. It's arguably the most that either AMD or NVidia has been behind, since the GPU that shall not be named.

(**) Also they could separate vertex shader to two. First depth only shader only includes instructions that affect SV_Position. Second vertex shader is the full shader (including SV_Position math again). This would somewhat increase the vertex shader cost (worst case 2x), but wouldn't require extra memory.
If we take the view that it's impossible to get bottlenecked on VS then 2x cost in the worst case is prolly going to go un-noticed...
 
Perhaps I've misunderstood and used terms incorrectly.

I'd filed tile based rasterisation under improvements to the ROPs, but it looks like the magic happens in a separate block of the hardware, and when you're working out which groups of fragments to work on concurrently based on post transform geometry in a cache.

I need to sit down and go through this. Again.


It has more implications than that, as Jawed and Sebbbi explained.
 
A lot of discussion once centred upon tessellation killing AMD. Partly the problem there is that AMD can't load-balance geometry (HS-TS-DS are stuck with the CU/SIMD for their lifetime).
This is true for GCN 1.0 and 1.1. There are clever hacks to improve the situation, but it's not pretty. I thought GCN 1.2 fixed the VS/DS/HS load balancing issue. Haven't programmed GCN 1.2 myself, so I can't validate this.
I wonder how many pixel shaders with incoherent branching are in production? I bet they're rare.
Many engines still use full screen pixel shader passes (instead of compute shader) for screen space AO, post process AA (FXAA, morphological), etc. We used branching a lot on Xbox 360 pixel shaders (mostly full screen passes, including tiled deferred lighting). PS3 was lot worse in dynamic branching.
Similarly NVidia has been load balancing rasterisation across the GPU, something that AMD doesn't do.
Vertex shader outputs are exported (not stored to LDS). One CU could transform a triangle and then all CUs could be running pixel shader instances originated from that single triangle. This is a common case in full screen triangles, and it utilizes the GPU perfectly. I am not familiar with similar pixel shader load balancing issues than we have with VS/DS/HS. If you could point me to documentation or discussion about these issues I would be happy.
In AMD the ROPs are almost continuously sharing cache lines amongst themselves (via L2/off-chip memory). On its own, just the work of shifting a single cache line to all of the ROPs over the lifetime of a frame (over and over again in a random walk) is going to cost power if nothing else.
AMDs ROP caches are directly connected to the memory controller. They are not backed by L2 cache. No cache coherency either (always need to flush ROP caches before sampling render target as a texture). ROPs are also completely separate from CUs. As far as I know ROP cache lines (spatial tiles) bounce back to memory when evicted (just like any cache). AMDs ROP caches are super small so cache trashing occurs all the time. It's not designed to handle overdraw.

My own experiments clearly show that GCN ROP caches work very well when repeatedly rendering geometry to the same area of the screen. As long as the area is small enough, all data stays nicely in the ROP caches. Test case: render tens of thousands of 64x64 pixel alpha blended quads on top of each other. Use RGBA16f (full rate export with alpha blending). Result: You get maximum ROP rate. RGBA16f with full ROP rate would require almost double memory bandwidth than is available.
 
NVidia's tiling enables a tightly-bound ROP-MC architecture which then means that render target cache lines only ever visit a single ROP unit. In AMD the ROPs are almost continuously sharing cache lines amongst themselves (via L2/off-chip memory).
The ROPs do not share cache lines amongst themselves via L2, since the ROPs aren't included in the "normal" cache hierarchy. Therefore, any spills/reloads have to be to off-chip memory (that said, the ROP cache is probably large enough for several ROP tiles per ROP).
 
Sorry guys, you're right, render targets aren't cached through L2 in GCN. The point still stands: render target tiles will generally appear in all shader engines.
 
I think they are for nV cards though right? L2 caching of RT's I mean.


@sebbbi,
the PS3 used a gf 7x line chip so yeah dynamic branching would be an issue on those class of GPU's.
 
I think they are for nV cards though right? L2 caching of RT's I mean.
Yes, since Fermi (though I don't know if there's any kind of reservation mechanism, so that ROPs wouldn't use up all of L2 or things like that).
I thought though maybe AMD would "fix" this with Tonga - but nope didn't happen. Therefore I thought surely it's going to be changed with Polaris - wrong again (albeit in retrospect, Polaris only has very minimal architectural changes therefore there was no hope for this happening to begin with).
So I predict Vega is going to change this :).
 
LOL well I think we might have to wait a bit longer. Just can't see any major changes in Vega :/, really doesn't make much sense either if they want to consolidate their console development with pc development, but ya never know!
 
A lot of discussion once centred upon tessellation killing AMD. Partly the problem there is that AMD can't load-balance geometry (HS-TS-DS are stuck with the CU/SIMD for their lifetime). Similarly NVidia has been load balancing rasterisation across the GPU, something that AMD doesn't do. So before Maxwell there were other things that hurt GCN, but the lack of tile-binned geometry is a kind of permanent disadvantage as opposed to the prior "extreme tessellation" disadvantage. AMD's problem is more important now.
There are some misunderstandings here. No GCN part has required the DS to execute on the same CU as the HS, though it sometimes does. Also, AMD does load balance rasterization across the GPU. Probably in a similar fashion to Nvidia at a high level.

I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.
 
I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.
Not saying this isn't the case, but you can't deny that Paxwell is more bandwidth efficient by quite a bit (well for rendering tasks generally - I'm sure there's things where you'd really need more raw bandwidth). A RX 480 but with the bandwidth of only a GTX 1060 would tank quite a bit (and the GTX 1060 already has generous bandwidth compared to GTX 1070/1080). (Though my favorite example for this is always the ludicrous no-bandwidth gm108 64bit ddr3 (albeit there's now actually gddr5 versions available a bit more widely), which is doing sort of ok whereas AMD's discrete solutions using 64bit ddr3 (and there's about 200 of them by name using 3 different chips) are simply no match. Kind of a pity since bandwidth efficiency is super important for APUs too where you can't just as easily tack on a wider memory interface or use more esoteric memory.)
 
I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.
I'd have to agree with this. The tiled rasterization likely helps, but there should be ways to overcome it and it would be situational. We'd be seeing far larger performance gaps if it made that large of a difference. If AMD had the cash to fine tune critical paths and were clocked significantly higher would there really be that much of a difference?

Doom might present an interesting test case for this with all their render targets. Has anyone else tried presenting the scene as a bunch of tiles with a low level API or bundles? ESRAM would have already mimicked this to some degree. Present the scene as one giant tile versus 64(?) smaller tiles in separate draws. That should mimic a minimal amount of cache eviction. Memory clocks could also be adjusted to even out the bandwidth a bit. I suspect that core clock, and therefore cache bandwidth, likely plays a significant role there. That might provide some insight into the exchange of cache bandwidth versus binning costs.
 
Back
Top