AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Quoting from the RDNA whitepaper:
    Are you certain?
    Store-and-forward pattern with a proxy acting as client to both L2 shards is also applicable when partitioning the LLC, a monolithic crossbar above all L2 slices is not strictly required. That has even been proposed in that specific patent.

    I have to disagree on that bandwidth figure. With only a simple phy, there is still a practical limit of 2 Gbit/s per pin. Targeting only the current performance level of 2TB/s (full duplex!), that would require an >2000 pin interconnect. More than twice as much as a GPU die with 4 stacks of HBM2 would require. Blow up the phy, and you can save on the number of pins, but not for free. Either way, a combined 4TB/s bandwidth over an interconnect with strict requirements on latency, is, while not impossible, still far from a realistic option yet, when minimizing die size is a goal. Only hurdle I'm not worried about is power consumption, if my math is correct, that should go as low as ~20W for such an interconnect.
     
    #3241 Ext3h, Sep 21, 2020
    Last edited: Sep 21, 2020
    Lightman and BRiT like this.
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    There are four L1s in RX 5700 XT. One per shader array.

    GPUs don't really care about latency, especially when the primary victim of latency in a chiplet design would be TEX (and MEM). Hiding the latency of TEX is easy. The side-effect, ultimately, is on the count of work items in flight (because they are waiting for data), affecting register file use and cache coherency.

    As for interconnect bandwidth, remember the ROPs will be using none! That's what tiled rasterisation (as already seen in RDNA) gives you. It should be years before 4TB/s is required.
     
    Lightman, Ext3h and trinibwoy like this.
  3. Nebuchadnezzar

    Legend

    Joined:
    Feb 10, 2002
    Messages:
    1,061
    Likes Received:
    328
    Location:
    Luxembourg
    That sentence is meant to say that it shares it over several dual-CUs, 5 of them to be precise in Navi10. You have the diagram in the same paper.
     
    Ext3h likes this.
  4. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    GL1 in RDNA 1 is a quad-banked 128KB cache. It absorbs all traffic (ingress from L2/egress to L2) from the 10 inner L0 V$, 5 L1 K$, 5 L1 K$ and other non-CU clients (e.g., RBEs and Prim Unit). The bullet points for GL1 also checked some boxes that the mechanism in this patent were set to solve, e.g., reducing L2 traffic & increase effective cache capacity.

    The patent doesn't seem to fit the RDNA paradigm, unless another level of CU caches are to be introduced, or GL1 got ditched in a way that non-CU clients get compensated for the loss of cache capacity. :-?
     
    #3244 pTmdfx, Sep 21, 2020
    Last edited: Sep 21, 2020
  5. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,878
    Likes Received:
    4,724
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    I'm pretty sure they were couple days ago, part of same set of which JayzTwoCents released 1 photo of
     
    eastmen and CarstenS like this.
  7. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    149
    Likes Received:
    263
    Some info taken from ROCm 3.8, AMDGPU kernel driver and firmware


    Navy Flounder seems to have 40CUs and 192-bit bus
     
  8. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    There are enough exceptions in L2 configurations through GCN's history that, in conjuction with RDNA 2 allegedly being clocked higher, these combinations are logical guesses IMO:

    1. 40 CUs with 12 L2 slices and 256-bit GDDR6 bus
    2. 80 CUs with 16 L2 slices and 384-bit GDDR6 bus

    :mrgreen:

    L2-L1 bandwidth loss due to fewer slices can be offset by higher clocks, and also doubling L2-L1 fabric bus width if desired (given that RDNA 1 introduced 128B cache lines).
     
    #3248 pTmdfx, Sep 22, 2020
    Last edited: Sep 22, 2020
    DegustatoR likes this.
  9. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    161
    Likes Received:
    179
    Anyone has any idea what num_packer_per_sc is? Appears to be doubled from RDNA1.
     
  10. SimBy

    Regular

    Joined:
    Jun 21, 2008
    Messages:
    700
    Likes Received:
    391
    192-bit 40CUs
    256-bit 80CUs

    Yeah, no.
     
    Frenetic Pony and Picao84 like this.
  11. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    Packers are the subsystem behind ordered append/consume buffers, and rasterizer ordered view.

     
    Frenetic Pony, Jawed, T2098 and 5 others like this.
  12. szatkus

    Newcomer

    Joined:
    Mar 17, 2020
    Messages:
    38
    Likes Received:
    26
    I would prefer something between, like 64CUs. Unless Big Navi would be cheap, then no problem, I can take that.
     
  13. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    149
    Likes Received:
    263
  14. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    If history is any indication, there will be one if not two cut-down SKUs from the full 80CU Navi21, so you will likely get your wish. 72CU and 64CU cut down versions would align exactly with what AMD did with Navi10 and RX5600 + RX5700.
     
  15. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    Seems like RDNA2 GPUs are all code named something fishy.
     
    Lightman and Per Lindstrom like this.
  16. szatkus

    Newcomer

    Joined:
    Mar 17, 2020
    Messages:
    38
    Likes Received:
    26
    That's right, but 5600 was an outlier. I don't think they ever cut down any GCN chip that much. Probably something related with yields on N7.
     
  17. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    With the 3 Navi2 GPUs, AMD seems to have a lot of space between their tiers.

    Navi 21 with 80 CUs then Navi 22 with 40 CUs then Navi 23 with 20? CUs.

    This is a lot of performance difference between the GPU tiers.
     
  18. Geeforcer

    Geeforcer Harmlessly Evil
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,320
    Likes Received:
    525
    GPUs <> Card SKUs.

    They could have:

    N21: 80cu = 6900 (XT) and <80ccu = 6900 (plain) or 6800.
    N22: 40cu = 6800 (XT) and <40ccu = 6800 (plain) or 6700.

    etc.
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    In the sense of having more of any kind of unit means there's more redundancy, I suppose it might be possible depending on the chip counts for both bus types.
    If there were multiple HMB stacks and a significant number of GDDR6 channels, maybe.
    If the idea is that there are significantly fewer channels overall, perhaps it's worse. The failure granularity for HBM is seemingly at the stack level, while GDDR6 might be at the 64-bit or 32-bit granularity.
    If it's 2xHBM and 4x64-bit GDDR6 (32-bit per chip, but AMD seems to like pairing channels) versus a potentially impractical 512-bit GDDR6 bus, it's worse.
    It's a wash with a 384-bit bus, unless AMD went with more granular GDDR6 controllers.

    Even at the roughly equivalent level of redundancy, I would have some concern that the HBM side makes yields incrementally worse overall due to the higher failure rate of interposer integration versus PCB mounting--coupled with the fact that there's some multiplicative yield failure due to needing both interposer integration and then some small yield loss due to the GDDR6 bus.

    Perhaps it would help from a product mix standpoint, if there were GDDR6-only products and HBM-only products, then the same production could more readily satisfy multiple categories versus AMD's historically poor ability to juggle multiple product lines/types/channels. However, given the very specific needs of one type over the other, and maybe questions like die-thinning or whatever preparatory work is needed for an interposer, it may not be that flexible. Even for yield-recovery purposes, there could also be a question of when failures can be detected. Some issues may not be fully detectable until after some point of no return, like after interposer mounting.

    If the chip were to always have an interposer, even if HBM weren't in use (discounting what is needed to have GDDR6 through an interposer) it would be a cost adder that would probably be more significant that whatever yield gain there was from redundancy in the controller.


    The L2 is globally shared and tends to be significantly longer latency. As the LLC, it is also in high demand with contention making things worse.
    The exact threshold is something that isn't necessarily clear.
    For Nvidia, the L2 has been benchmarked as being nearly 7x longer latency for Volta and Kepler, whereas the Maxwell/Pascal generations were in the nearly 3x range. This is more related to the much lower L1 latency for Volta and Kepler rather than a massive shift in L2 latency.
    https://arxiv.org/pdf/1804.06826.pdf
    GCN's L2 is only 1.7x the latency of the L1, mainly due to an estimated latency of ~114 cycles per a GDC2018 presentation. The L2 is apparently a little faster in cycle terms than Nvidia's although this may be closer in wall-clock terms due to Nvidia's typically higher clocks in the past.

    Ampere's a question mark, though if it's similar to Volta the L1 latency should be on the lower end of the GPU scale.
    I haven't seen number for RDNA1/2 officially, although if we believe the github leak for the PS5 it might be roughly in the same area as GCN to a little lower-latency in the L1.
    The overall picture wouldn't be that different at maybe 90-100 cycles.
    The improvement or degradation in latency may depend on how quickly L1 remote accesses can be resolved. If every leg in the transaction were on the order of 90 cycles, I'd say it wouldn't be a good latency bargain. The significant constraints when it comes to L2 bandwidth and contention might still provide an upside, and there seem to be at least some reasonable ways to skip parts the the L1 pipeline in a remote hit scenario given how straightforward it should be to detect that an access won't be locally cached.

    I ran across a link from Anandtech's forum that points to a paper by the individuals involved in the patent concerning this sort of scheme:
    https://adwaitjog.github.io/docs/pdf/sharedl1-pact20.pdf

    I haven't yet had the time to really read the paper. One thing I did notice is that the model GPU architecture isn't quite a fit for GCN or RDNA, with 48KB local scratchpad, 16KB L1, and a separate texture cache. There's perhaps some irony that an AMD-linked paper/patent has analysis profiled on an exemplar that reminds me more of Kepler, although I think I may have seen something like that happen before.
    Whether the math necessarily holds at 90-114 cycles latency versus 28 is an unanswered question.

    edit: Actually, I just ran across a mention of the coherence mechanism assumed by the analysis, and it's Nvidia's L1 flush at kernel or synchronization boundaries. That could significantly deviate from the expected behavior of a GCN/RDNA cache.

    Depends on the context of L1 vs L0. The terminology is being handled loosely, and some portions of the description that indicate L1 capacity changes with CU count may mean L0/L1 depending on the GPU.

    The CUs have 16KB L0/L1 caches, and one of the main benefits of this scheme is that its area cost should be lower than increasing the capacity of AMD's miniscule caches. I'm not sure how many tenths of a mm2 the L0 takes up in a current non-Kepler GPU, however.


    It's an odd arrangement with the ROPs versus the L1. The RDNA whitepaper considers the ROPs clients of the L1 and touts how it reduces memory traffic. However, if their output is considered write-through, what savings would making them an L1 client bring?
    Given screen-space tiling, a given tile will not be loaded by any other RBE but one (no sharing), and unless the RBE loads a tile and proceeds to not write anything to it, the L1 at best holds ROP data that is read once and must be evicted once it leaves the RBE caches (no reuse).
     
  20. Esrever

    Regular

    Joined:
    Feb 6, 2013
    Messages:
    846
    Likes Received:
    647
    Yea but even then, usually the cut down chips are at most 20% less performance but that still leaves a massive gap between the tiers. A cut down 80 CU would probably still be 72CUs or close to it which is still massively above a 40CU part.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...