Performance evolution between GCN versions - Tahiti vs. Tonga vs. Polaris 10 at same clocks and CUs

Discussion in 'Architecture and Products' started by Alessio1989, Sep 18, 2016.

Tags:
  1. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,780
    Likes Received:
    4,431
    I didn't suggest developers create rendering pipelines that purposely cater to one IHV (though I wouldn't put too much trust in Epic in that regard).
    Gameworks itself purposely caters to one IHV (duh) and some developers use it. I think we can both agree on that.


    Of course not. Not even the ones using gameworks because that brings them tools that accelerate development and ultimately saves them time and money.
    I don't doubt they know that gameworks brings the caveat of favoring one IHV over another, but when you have a budget and a deadline you must balance all things together and make such decisions. It's perfectly understandable.



    It's most probably true that GCN chips are more balanced now with their current geometry performance than it was in 2012. Though "balance" is a moving goalpost that depends on each engine, each game and even each scene. As an "extreme" example, how much geometry performance does the Tomorrow Children need?

    Maybe the geometry performance progression in GCN is purely for balance reasons, or maybe it's also because some of nvidia's "flagship effects" on their tools (e.g. hairworks and godrays) are geometry intensive.
    BTW, gameworks is just the name they came up for a conglomerate of tools, some of which had been pushed through TWIMTBP for years.
    I'm pretty sure AMD engineers were aware that nvidia would try to push their geometry performance through their own tools ever since Kepler came up in 2012, just like AMD themselves did with their compute advantage using TressFX, for example.

    Though maybe this is a subject for another thread?
     
    #21 ToTTenTranz, Sep 20, 2016
    Last edited: Sep 20, 2016
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Maxwell and Pascal send less work to the ROPs, so in games it's generally impossible to say that ROP throughput is the reason for performance differences. Delta colour compression is another factor. The three things go together beautifully in NVidia and hide ROP specific aspects of game performance. Synthetic tests definitely show many pure fillrate advantages for NVidia (but certainly not a clean sweep):

    http://www.hardware.fr/articles/952-9/performances-theoriques-pixels.html

    ROP performance has disappeared into a black box for games.

    I won't argue that AMD needs to catch up, I just don't think it's the ROPs (quantity or architecture) that's the problem. AMD is doing too much work that's discarded which wastes compute, ROPs, bandwidth and power.
     
    Razor1 and function like this.
  3. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,135
    Likes Received:
    2,248
    Location:
    Wrong thread
    This sounds intriguing - I don't understand a great deal about what's going on inside these things beyond the large block diagrams you get with reviews. Could you point me in the direction of some more info about this?
     
  4. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    I think Jawed is talking about the type of rasterization nV cards are doing since Maxwell (tile based).
     
  5. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,135
    Likes Received:
    2,248
    Location:
    Wrong thread
    He mentioned things that weren't ROP related - seems that there's some interesting stuff happening elsewhere too!
     
    Razor1 likes this.
  6. function

    function None functional
    Legend Veteran

    Joined:
    Mar 27, 2003
    Messages:
    5,135
    Likes Received:
    2,248
    Location:
    Wrong thread
    Perhaps I've misunderstood and used terms incorrectly.

    I'd filed tile based rasterisation under improvements to the ROPs, but it looks like the magic happens in a separate block of the hardware, and when you're working out which groups of fragments to work on concurrently based on post transform geometry in a cache.

    I need to sit down and go through this. Again.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    I think the biggest problem has been (before Maxwell/Pascal and async compute) that AMDs compute units are not well utilized in most rendering workloads. There are too many different bottlenecks. Graham's GDC presentation (http://www.frostbite.com/2016/03/optimizing-the-graphics-pipeline-with-compute/) is a good example of this (includes occupancy charts). Async compute during rendering tasks gives AMD easy 20%-30% performance gains. Without bottlenecks in rendering hardware this 20%-30% could be used for the rendering task, helping all games, not just console games and PC games specially optimized for async compute using DX12 and Vulkan.

    I don't think AMD did that much excess work prior to Maxwell's tiled rasterizer. Nvidia had delta color compression earlier, and it was more efficient, so that cost AMD a bit of bandwidth, but nothing dramatic. Nvidia's warp size of 32 (vs 64) also slightly reduces work in code with non-coherent branches (but finer granularity adds scheduling hardware cost). Maxwell's tiled rasterizer however is a big problem, unless AMD adapts similar technology. Both depth compression and delta color compression are based on tiles. Partial tile updates cost bandwidth. If a single pixel of a tile is touched, the whole tile must be loaded back to ROP cache, uncompressed, then again compressed and written back as whole (*) when it gets kicked out of the ROP cache. Nvidia's tiled rasterizer collects a big batch of triangles and coarse rasterizes them to tiles. Then each tile is rasterized as a whole. This is perfect for tile based compression methods. Bigger rasterizer tiles are aligned to smaller compression tiles, meaning that most compression tiles get written back once the tile is filled fully. This is a big improvement.

    We don't know whether Nvidia is also doing any pixel occlusion culling in the tiled rasterizer (render depth of the whole tile before running any pixel shaders). This would be another big gain, but would require lots of on-chip storage to store (post VS) vertex attributes (**). Seems that someone needs to write a test case. Super heavy pixel shader, N full screen triangles (N = a few hundred) with increasing depth, compare front-to-back vs back-to-front render times with z-buffer enabled.

    (*) A single pixel change likely changes most of the tile data, as most deltas will change. Also lossless compression tends to "randomize" all data even from a minor change.

    (**) Also they could separate vertex shader to two. First depth only shader only includes instructions that affect SV_Position. Second vertex shader is the full shader (including SV_Position math again). This would somewhat increase the vertex shader cost (worst case 2x), but wouldn't require extra memory.
     
    #28 sebbbi, Sep 20, 2016
    Last edited: Sep 20, 2016
    Alexko, homerdog, fellix and 7 others like this.
  8. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Personally I like AMD GPUs. During the past year I have been writing almost purely compute shaders. I have been consistently amazed how well the old Radeon 7970 fares against GTX 980 in compute shaders. I gladly leave all the rasterization and geometry processing woes to others :)
     
    Alexko, entity279, fellix and 2 others like this.
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    NVidia GPUs would have behaved much the same, except with warps of 32, before Maxwell.

    A lot of discussion once centred upon tessellation killing AMD. Partly the problem there is that AMD can't load-balance geometry (HS-TS-DS are stuck with the CU/SIMD for their lifetime). Similarly NVidia has been load balancing rasterisation across the GPU, something that AMD doesn't do. So before Maxwell there were other things that hurt GCN, but the lack of tile-binned geometry is a kind of permanent disadvantage as opposed to the prior "extreme tessellation" disadvantage. AMD's problem is more important now.

    We see similar gains from clustered geometry/occlusion. Both of these types of gains benefit NVidia too - though the degrees of improvement can be argued. NVidia has less compute available for smarter algorithms.

    I wonder how many pixel shaders with incoherent branching are in production? I bet they're rare.

    I was under the impression that the tests already discussed indicated that vertex attribute count affected tile size.

    NVidia's tiling enables a tightly-bound ROP-MC architecture which then means that render target cache lines only ever visit a single ROP unit. In AMD the ROPs are almost continuously sharing cache lines amongst themselves (via L2/off-chip memory). On its own, just the work of shifting a single cache line to all of the ROPs over the lifetime of a frame (over and over again in a random walk) is going to cost power if nothing else.

    When AMD frees itself of the shader engine shackles (maximum of 4, maximum of 16 CUs per) perhaps it'll tie ROPs to MCs and do NVidia style GPU-wide load balancing of tessellation and rasterisation and tile-binned geometry. It's a hell of a lot of catching up to do. It's arguably the most that either AMD or NVidia has been behind, since the GPU that shall not be named.

    If we take the view that it's impossible to get bottlenecked on VS then 2x cost in the worst case is prolly going to go un-noticed...
     
    pharma and Razor1 like this.
  10. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    It has more implications than that, as Jawed and Sebbbi explained.
     
  11. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    This is true for GCN 1.0 and 1.1. There are clever hacks to improve the situation, but it's not pretty. I thought GCN 1.2 fixed the VS/DS/HS load balancing issue. Haven't programmed GCN 1.2 myself, so I can't validate this.
    Many engines still use full screen pixel shader passes (instead of compute shader) for screen space AO, post process AA (FXAA, morphological), etc. We used branching a lot on Xbox 360 pixel shaders (mostly full screen passes, including tiled deferred lighting). PS3 was lot worse in dynamic branching.
    Vertex shader outputs are exported (not stored to LDS). One CU could transform a triangle and then all CUs could be running pixel shader instances originated from that single triangle. This is a common case in full screen triangles, and it utilizes the GPU perfectly. I am not familiar with similar pixel shader load balancing issues than we have with VS/DS/HS. If you could point me to documentation or discussion about these issues I would be happy.
    AMDs ROP caches are directly connected to the memory controller. They are not backed by L2 cache. No cache coherency either (always need to flush ROP caches before sampling render target as a texture). ROPs are also completely separate from CUs. As far as I know ROP cache lines (spatial tiles) bounce back to memory when evicted (just like any cache). AMDs ROP caches are super small so cache trashing occurs all the time. It's not designed to handle overdraw.

    My own experiments clearly show that GCN ROP caches work very well when repeatedly rendering geometry to the same area of the screen. As long as the area is small enough, all data stays nicely in the ROP caches. Test case: render tens of thousands of 64x64 pixel alpha blended quads on top of each other. Use RGBA16f (full rate export with alpha blending). Result: You get maximum ROP rate. RGBA16f with full ROP rate would require almost double memory bandwidth than is available.
     
    Alessio1989 likes this.
  12. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    The ROPs do not share cache lines amongst themselves via L2, since the ROPs aren't included in the "normal" cache hierarchy. Therefore, any spills/reloads have to be to off-chip memory (that said, the ROP cache is probably large enough for several ROP tiles per ROP).
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Sorry guys, you're right, render targets aren't cached through L2 in GCN. The point still stands: render target tiles will generally appear in all shader engines.
     
    Razor1 likes this.
  14. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    I think they are for nV cards though right? L2 caching of RT's I mean.


    @sebbbi,
    the PS3 used a gf 7x line chip so yeah dynamic branching would be an issue on those class of GPU's.
     
  15. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Yes, since Fermi (though I don't know if there's any kind of reservation mechanism, so that ROPs wouldn't use up all of L2 or things like that).
    I thought though maybe AMD would "fix" this with Tonga - but nope didn't happen. Therefore I thought surely it's going to be changed with Polaris - wrong again (albeit in retrospect, Polaris only has very minimal architectural changes therefore there was no hope for this happening to begin with).
    So I predict Vega is going to change this :).
     
    Razor1 likes this.
  16. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    LOL well I think we might have to wait a bit longer. Just can't see any major changes in Vega :/, really doesn't make much sense either if they want to consolidate their console development with pc development, but ya never know!
     
  17. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,432
    Likes Received:
    261
    There are some misunderstandings here. No GCN part has required the DS to execute on the same CU as the HS, though it sometimes does. Also, AMD does load balance rasterization across the GPU. Probably in a similar fashion to Nvidia at a high level.

    I suspect people give far too much credit to Nvidia's tiled rendering though without a benchmark that can disable this feature there's no way to prove anything. Low voltage and high clock speeds are the primary weapons of Maxwell and Pascal.
     
  18. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,012
    Likes Received:
    112
    Not saying this isn't the case, but you can't deny that Paxwell is more bandwidth efficient by quite a bit (well for rendering tasks generally - I'm sure there's things where you'd really need more raw bandwidth). A RX 480 but with the bandwidth of only a GTX 1060 would tank quite a bit (and the GTX 1060 already has generous bandwidth compared to GTX 1070/1080). (Though my favorite example for this is always the ludicrous no-bandwidth gm108 64bit ddr3 (albeit there's now actually gddr5 versions available a bit more widely), which is doing sort of ok whereas AMD's discrete solutions using 64bit ddr3 (and there's about 200 of them by name using 3 different chips) are simply no match. Kind of a pity since bandwidth efficiency is super important for APUs too where you can't just as easily tack on a wider memory interface or use more esoteric memory.)
     
  19. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    I'd have to agree with this. The tiled rasterization likely helps, but there should be ways to overcome it and it would be situational. We'd be seeing far larger performance gaps if it made that large of a difference. If AMD had the cash to fine tune critical paths and were clocked significantly higher would there really be that much of a difference?

    Doom might present an interesting test case for this with all their render targets. Has anyone else tried presenting the scene as a bunch of tiles with a low level API or bundles? ESRAM would have already mimicked this to some degree. Present the scene as one giant tile versus 64(?) smaller tiles in separate draws. That should mimic a minimal amount of cache eviction. Memory clocks could also be adjusted to even out the bandwidth a bit. I suspect that core clock, and therefore cache bandwidth, likely plays a significant role there. That might provide some insight into the exchange of cache bandwidth versus binning costs.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...