AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    217
    Likes Received:
    239
    Because AFAIK 512bit and GDDR6 is quite complex - I heard many people saying that you need quite clean signalling (in terms of noise) already or a 384bit bus (with GDDR6), and with 512bit it starts to become prohibitive. I think even the GDDR6X PCB on A102 boards must have very strict requirements. Maybe for AMD the reasoning was that increased cost of die <= increased cost of PCB and RAM (especially high cost RAM like HBM) and this gives more possibility of easier mobile designs, too.
     
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,150
    Likes Received:
    1,651
    Location:
    New York
    Anyone taking bets on Infinity Cache being gigabytes of 3D stacked memory? It goes without saying that would be pretty amazeballs.

    https://www.freepatentsonline.com/y2020/0183848.html - Cache for Storing Regions of Data

     
    Lightman and pjbliverpool like this.
  3. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,641
    Likes Received:
    6,664
    Lightman likes this.
  4. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    The linage of DRAM cache researches and patents from AMD have been around for half a decade, including tricks to lower/hide DRAM access latency, coarse-grained caching and region-based coherence. Most literature I’ve read put focus on evaluating its potential as CPU LLC, and its latency & performance impact on cache coherence protocols.

    and TBH if they are going to put stacked DRAM, I feel like doing it as multi VRAM pools with HBCC page migrations is more likely a starting point, especially if the main VRAM pool would not be drastically larger than the DRAM cache.
     
    #4104 pTmdfx, Oct 25, 2020
    Last edited: Oct 25, 2020
    Pete and pjbliverpool like this.
  5. yuri

    Regular Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    263
    Likes Received:
    270
    Is this really surprising at all? It's been 7 years since AMD went for 300+-25W highend (Fury X, Vega 64, Vega VII).

    So all those reports like "top model 250W" were strage.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,536
    Likes Received:
    4,635
    Location:
    Well within 3d
    The code has existing functions for registering which blocks like the RBEs are active or have been disabled by fuses or BIOS settings.
    It seems new to this is the disabling of shader arrays. Before this, there seemed to have been facilities for disabling at the shader engine or individual RBE level, so Sienna Cichlid would appear to be adding an intermediate level of salvage.

    I was able to search for a mention of new function (gfx_v10_3_get_disabled_sa) that showed up:
    https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg53790.html

    What PBB mode does exactly, I'm not sure of. It does seem that there is at least some distinction between a fully-enabled GPU and one with one or more deactivated shader arrays. Perhaps this means load-balancing is handled differently due to a shift in the ROP versus rasterizer capacity, or the algorithm for allocating screen space is altered if the RBEs and rasterizers remained linked at a shader array level.


    For RDNA, the RBEs were clients to the L1, which is per shader array for Navi 10.

    Having the option to disable at a shader array level may be a change due to how redundant many of the resources are. There are many CUs in an array, and the code seems to have a separate pre-existing bitmask for handling disabled RBEs.
    This may indicate 8 shader arrays is enough to warrant the trouble versus a similar amount of rasterizer and RBE hardware per array in Navi10, or that additional less-redundant hardware is at that level.

    Another possibility is that the function gfx_v10_3_program_pbb_mode from my earlier link actually goes through quite a bit of setup just to check whether a shader engine's shader arrays are active. Perhaps it's meant for future scalability or consistency in the code, but building a bitmask based on system parameters when the traditional configuration is 1 or 2 per SE may mean a larger number could be possible for this family.

    3dmark leaks seem to happen with regularity for AMD, including multiple console chips. It happens enough that I'd have to suspect it's on purpose or policies are such that AMD doesn't stop it from happening. I'd imagine the benchmark is a readily available non-trivial 3D application for early testing and validation, and one that the vendor has put more effort into optimizing for or dedicating functions to in the driver. This might make it more likely that there's programming and debugging resources available, and possibly special frameworks or driver paths explicitly for getting early testing functional on it.
    That it happens to upload results to the world at this point would be well-understood, and might be part of a controlled leak for marketing or maybe giving certain interested parties an idea of how to plan their market segmentation.


    3D on the SOC seems unlikely given the TDP numbers bandied about.
    On-package would be possible, if AMD committed to an interposer solution. That might not scale down as well for the smaller SKUs that allegedly have something like this.
    As a counterargument, I'm not sure as much die area would be needed for this compared to what rumors indicate the extra area is. The concerns for cache coherence and latency aren't traditionally GPU-related.

    The rumors also didn't clearly cite this sort of MCM package or links, and while HBM2 has bandwidth the amount it provides versus GDDR6 is similar enough that I'd be interested if it would be worth the complexity balancing between the types.
     
    w0lfram, Jawed, PSman1700 and 5 others like this.
  7. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    631
    Likes Received:
    323
    I'm not sure the tgp numbers matter for stacked ram. Memory access as source of heat is going to be negligible for these gpus regardless of how they're structured.

    And it could be that the new cache structure is as much about being forward looking and planning for a chiplet arch as it about rdna2. Bandwidth requirements are obviously starting to outrun gddr, and so some solution is needed for it. What it technically looks like, other than hbm, still baffles me though. Sram is very die expensive, once you get to an l3 cache the latency advantage over just going out to main memory starts getting pretty small, and unlike CPUs memory access patterns for gpus are fairly coherent and predictable. So while having enough sram is important, needing a lot of data in cache just in case isn't as much of a thing.

    Well, only a few days. Hopefully they'll offer at least an initial explanation of it's not hbm.
     
  8. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,536
    Likes Received:
    4,635
    Location:
    Well within 3d
    If it's in the same stack as the SOC, it matters more since nominal refresh rates are typically maintained at 85C or below. Higher temperatures force higher refresh rates, which AMD's papers on stacked GPU+DRAM combinations listed as threshold to avoid. That proposal capped stack power at ~8-17W for logic, which is an order of magnitude below the chip TDPs mentioned in the rumors.
     
    Lightman likes this.
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    Yes. I've talked about L2s being distributed across chiplets and that L1s need to be able to talk to all L2s, therefore the interposer (or other 3D stacking tech) becomes vital for this network at several TB/s.

    GPUs care most about bandwidth (thence power), so any argument against a large L3 cache based on latency is moot.

    Bandwidth on-die is always going to be far higher, and at far lower power, than GDDR or HBM off-die.

    At 60fps with 512GB/s of memory, the GPU can only access 8.5GB of memory per frame. Ideally the GPU would read or write no byte more than once.

    I still don't subscribe to the monster last level cache theory though. I suspect cache mechanics in GPUs have a lot of unexploited depth.
     
    Lightman likes this.
  10. MuteyM

    Joined:
    Jul 30, 2019
    Messages:
    6
    Likes Received:
    32
    I can't post links yet, but there's been several very interesting public commits to the amd-staging-drm-next branch of the Linux kernel:
    • The first one is titled "add GC 10.3 NOALLOC registers" and adds a bunch of register bitfields with "LLC_NOALLOC" in their name.
    • The next one is "add support to configure MALL for sienna_cichlid" where MALL is "Memory Access at Last Level"
    • The last one is "display: add MALL support" and includes this gem:
    + // TODO: remove hard code size
    + if (surface_size < 128 * 1024 * 1024) {

    Putting it all together I'm guessing that this 128MB "Infinity Cache" last-level-cache rumor has some truth to it, and at least one use will be to pin framebuffers to it (including displayable color buffers and depth/stencil) for some crazy high fillrate.
     
    PizzaKoma, Pete, Lightman and 14 others like this.
  11. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,446
    Likes Received:
    2,626
    Location:
    Guess...
    So similar to the XB360's esram but without requiring developer intervention?
     
  12. MuteyM

    Joined:
    Jul 30, 2019
    Messages:
    6
    Likes Received:
    32
    There was also this in one of the commit messages:
    "We need to add UAPI so userspace can request MALL per buffer"
    So it seems like it's under control of the GL and/or Vulkan drivers exactly which buffers go into the MALL region.
     
  13. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    913
    Likes Received:
    347
    I love that surface_size is unitialized ... :roll:
     
    Lightman likes this.
  14. Lurkmass

    Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    227
    Likes Received:
    226
    Can this be used to store the framebuffer state and accessed by the fragment shaders ?
     
  15. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,641
    Likes Received:
    6,664
    That actually makes a good amount of sense. Instead of a general cache you have a cache for specific framebuffers that will have high read/write access.
     
    Lightman likes this.
  16. andermans

    Newcomer

    Joined:
    Sep 11, 2020
    Messages:
    24
    Likes Received:
    38
    While GPUs will always crave bandwidth I believe GPUs these are slowly getting more latency sensitive. Remember that for a fixed latency, if we want to increase bandwidth we also need to increase the number of operations in flight, which has two limitations:

    1. More operations in flight means we have to keep more shader invocations alive, which results in bigger register usage. Note that as shaders become more complicated over the years they're already taking more registers per invocation while the number of registers per flop hasn't really. When you are register limited (or LDS space etc.) then a latency reduction can result in a real speedup.
    2. As we see with some of the wider GPUs these days and 1080p/1440p/4K, the workload needs to be parallel enough to actually do that which is hard due to actual usecases.
    Probably still significantly less sensitive than a CPU, but I think the days of pretty much not caring at all are over. Consider that GCN->RDNA1 mostly improved on this metric (lower cache latencu for non-filtered fetches, and allowing to use more compute with a low number of shader invocations).

    Given that the flags are for NOALLOC instead of ALLOC, I suspect everything will be cached by default, and I asusme MALL is just the option of also reading the framebuffer to the display through the last level cache as well (instead of first requiring a flush to memory. I think that gets expensive with 128 MiB cache ...)

    Funnily enough I think the size check is an indication this might be 128 MiB for real.
     
  17. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,536
    Likes Received:
    4,635
    Location:
    Well within 3d
    I'm listing entries that appear to be related:
    https://lists.freedesktop.org/archives/amd-gfx/2020-October/055005.html
    https://lists.freedesktop.org/archives/amd-gfx/2020-October/055006.html
    https://lists.freedesktop.org/archives/amd-gfx/2020-October/055024.html
    https://lists.freedesktop.org/archives/amd-gfx/2020-October/055007.html

    That is a possible size for storage, although in the context of surfaces for display or buffers 128*1024*1024 shows up in multiple places as a maximum buffer size even for architectures without large amounts of dedicated storage.
    Perhaps there's limitations to what the handling hardware can address, or the TODO indicates that some parameter will be introduced to give a per-implementation limit. That could leave the possibility that Sienna Cichlid's implementation happens to have a resource that matches this upper limit.

    It's apparently visible to some kind of software, although it's the driver that is mentioned so far and that can have varying levels of visibility to client software. There's an addition to the page table format for Sienna Cichlid that indicates that allocation for this mode can be handled at a page granularity. Using page table entries for that purpose has been done before, such as with the ESRAM for the Xbox One. That could go either way as far as developer visibility. Some elements point to the driver making some decisions based on the type of target or functionality.
     
    Pete, Lightman, pharma and 5 others like this.
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,150
    Likes Received:
    1,651
    Location:
    New York
    This is likely a key benefit of the RDNA vs CDNA split. The LLC scratchpad is more amenable to pinned render targets than it would be to general HPC.
     
    w0lfram likes this.
  19. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    Yeah does seem like it. The other leaks have been coming from AIBs supposedly, and more on N22. If AMD is being secretive, unlikely they'd have given AIBs those slides.
    With Zen3 I'd argue that it's more like rearranged cache rather than reworked. It's got much bigger changes in the other parts of the chip.
    Yea nothing surprising really, it was to be expected. And even if AMD themselves don't release 300W+ cards, seems very likely that AIBs will.
     
    Lightman and PSman1700 like this.
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,536
    Likes Received:
    4,635
    Location:
    Well within 3d
    There's already blocking for matrix multiplication to fit within the register files and on-die caches. A large cache might be another level for bandwidth optimization or locality. AMD's posited more complex tiers for HPC already to keep hot data sets on-board or on-die, although maybe the counter is that it's getting complex enough?

    On the RDNA vs CDNA split, there's also mention of a Navi 10 blockchain SKU.
    Example: https://lists.freedesktop.org/archives/amd-gfx/2020-October/055070.html
     
    w0lfram, Lightman, Krteq and 2 others like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...