AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,484
    Likes Received:
    4,445
    Location:
    Well within 3d
    The table seems to imply bringing the RBE arrangement back in-line with the shader engine count, versus shader arrays in RDNA1. Other elements, like the geometry units and L1 caches were also arranged along those lines, so how that would be handled could be an area of change as well.

    I'm not sure I've seen the context around that term. I have seen driver references to RB+ modes, but those would be a much older concept than RDNA.

    Over the generations, the RBEs have been moved from direct links to the memory controllers to be clients of the L2, then L1, in addition to compression. The relatively modest debut of the DSBR included some indications of bandwidth savings as well.
    There are optimizations for particle effects tuned to the tiny ROP caches that can lead to significant bandwidth amplification, and moving the RBEs inside larger cache hierarchies can give at least some additional bandwidth.

    For clarity, is the first statement about not needing SerDes carrying over to the description of the IFOP link?
    I thought IFOP still used SerDes, at 4x transfers per fabric clock.
    I'm hazy on whether this figure includes the controllers along with the PHY blocks, which may change the per-link area. Whether a GPU subsystem with more thrash-prone caches would also prefer symmetric read/write bandwidth may also be an area adder.

    The 16 seems to be inherent to the way the L2 slices serve as the coherence agents for the GPU's memory subsystem. A little less clear for RDNA/GCN is how the write path's complexity has changed. The RDNA L1 is listed as a read-only target, so how CU write clients are handled may add additional paths. One area I'm curious about is the RBEs and how their write traffic works the the L1, since AMD stated the RBEs were clients of the L1.

    (late edit: One thing I forgot to add is that in the 5/16 arrangement, each L1 can make 4 requests per clock, so the L2's slice activity isn't limited by L1 count.)

    One thing to evaluate at some point is what it has meant in the past that AMD's subsystem has maxed out at 16 texture channel caches, which are another term for the L2 slices. At least internal to the L2, per-clock bandwidth would seem to be constant between RDNA1 and RDNA2, barring a change in the L2 design. If the RBEs are L1 clients like they were in RDNA, what that means for L1 distribution in big Navi and the internal bandwidth situation could be interesting angles to investigate. A straightforward carry-over from RDNA1 would leave the metric of per-clock internal bandwidth the same across RDNA1 and RDNA2 implementations with 256-bit buses or wider.
     
    #2521 3dilettante, Aug 10, 2020
    Last edited: Aug 10, 2020
  2. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    283
    Yes.

    The floorplan in AMD's Zen 2 ISSCC deck does not have separate desginations for IFOP PHY and controller. So presumably it means PHY and controller combined.

    It is true that GPU can have a vastly different access pattern that may require a different balance and/or provision for read-write traffic.

    I am not sure if it is inherent since the number of L2 slices have always been scaling alongside the number of memory channels. But in hindsight, it could be an overprovisioned, independent design parameter for parallelism in either the interconnect or the L2 cache itself, judging by the fact that Fiji has 32 HBM memory channels but still having only 16 L2 slices.

    This perhaps indicates also that L2 cache is unlikely to go off chip, since it seems to have a role in enabling MLP not only in say the effective # of MSHR, but perhaps also lowering the probability of hotspot routes by overprovisioning the interconnect.

    But it doesn't rule out another level of memory-side cache. :razz:

    L0 writing through to L2 did not change. L1 is no write allocate.
     
    #2522 pTmdfx, Aug 10, 2020
    Last edited: Aug 10, 2020
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,484
    Likes Received:
    4,445
    Location:
    Well within 3d
    That's been an area where the higher end hasn't shown clear scaling.
    From the _rogame table, the number of texture channel caches tops out at 16 for multiple GPUs.

    AMD's RDNA whitepaper said 4 L2 slices per 64-bit memory controller, and the 4-stack HBM GPUs would have even more unsustainable crossbar dimensions if that constraint held.

    For example, Fiji is indicated to have 16 in the following patch:
    https://people.freedesktop.org/~agd5f/0001-drm-amdgpu-update-Fiji-s-tiling-mode-table.patch
    There were some attempts at analyzing why a 4-stack HBM GPU showed areas of limited scaling over Hawaii, and one architectural corner may have been tests that may have isolated L2 bandwidth versus memory controller bandwidth.

    Is the claim that reads went to 16/5 from 16/64, but writes did not consolidate or did not need to consolidate?
     
    Lightman likes this.
  4. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    283
    Sigh, I figured I read the diagrams drastically wrong. The GL1 has 4 banks, and apparently the 4x 64B/clk L1-L2 figure applies to each GL1. It now makes perfect sense in how it simplifies the L1-L2 fabric design — it consolidates from one plane of 16/64 to four planes of 4/4 (persumbly interleaved by the same lower channel select bits used by the GL1 bank select).

    Seems even more unlikely that any part of L1-L2 hierarchy goes off-chip then.
     
    #2524 pTmdfx, Aug 10, 2020
    Last edited: Aug 10, 2020
    Lightman likes this.
  5. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
    #2525 Bondrewd, Aug 17, 2020
    Last edited by a moderator: Aug 17, 2020
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,356
    Likes Received:
    3,349
    Location:
    Finland
    BRiT likes this.
  7. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
    AMD would disclose shit themselves like they did the previous year if they wanted to.
     
  8. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    17,360
    Likes Received:
    17,855
    Or maybe, just maybe, this is part of their contract with AMD and it's entirely set for Microsoft to detail and release.
     
  9. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    812
    Likes Received:
    595
    [​IMG]

    Looking at this, it looks pretty much the same as NAVI10 except there is more cache and more CUs. The shader engines are still the same structure. We can maybe assume RDNA2 cards to be very similar.
     
    disco_ and iroboto like this.
  10. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
    They're strictly 5 WGPs per SA vs 7 for XSX.
     
  11. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    11,751
    Likes Received:
    2,728
  12. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
  13. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    812
    Likes Received:
    595
    Maybe Microsoft will add ML resolution scaling to DirectX. Would be good to have something for it that is hardware agnostic at the API level. Since they are likely just using the shaders to do the inference without the use of tensor cores, it could probably work on any GPU, just depends on the performance they can get out of the hardware for inference while still running the game on the GPU.
     
  14. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    11,751
    Likes Received:
    2,728
    I agree with both of you. Esp if we see it make it into games and is hardware agnostic. FidelityFX is not a bad option but DLSS is a much better option
     
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,372
    Likes Received:
    3,754
    The RT solution in RDNA 2 is shared with Texture units, you can either do texturing or ray tracing, and not both at the same time, this will negatively affect RT performance.
     
    xpea and disco_ like this.
  16. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
    I don't think tex units are busy during egregiously long RTRT pass.
     
    disco_ and Krteq like this.
  17. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,717
    Likes Received:
    1,082
    Location:
    France
    Why can't we ? (True question, I don't get why on the diagram)
     
  18. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,356
    Likes Received:
    3,349
    Location:
    Finland
    Yes, if they wanted to, MS isn't doing this without AMD knowing about it long long time ago (in AMD or MS building far far away)
     
    Lightman, disco_ and BRiT like this.
  19. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,042
    Likes Received:
    441
    Yeah also it's not a full disclosure anyway.
    I expected less though.
     
  20. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,059
    Likes Received:
    13,442
    Location:
    The North
    Here's the compare for those of you wanting it:

    [​IMG]
     
    Pete and BRiT like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...