AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,213
    Likes Received:
    1,409
    Location:
    London
    RDNA 2, according to the capabilities matrix:

    https://videocardz.com/newz/amd-navi-21-to-feature-80-cus-navi-22-40-cus-and-navi-23-32-cus

    reduces the count of hardware threads per SIMD: 20 down to 16.

    Overall I'd agree that latency is rising in relevance. A large unanswered question is how much of a GPU's latency sensitivity is self-inflicted, due to such things as an out-dated architecture or a driver that allocates resources badly.
     
  2. andermans

    Newcomer

    Joined:
    Sep 11, 2020
    Messages:
    19
    Likes Received:
    31
    I don't think that one really matters. You're also limited by the number of VGPRs (512 / SIMD for 64-lane waves). The limitations here are really a problem for shaders that use > 64 VGPRS (i.e. occupancy < 8 or < 4 for GCN, I think I haven't looked at this that closely on RDNA yet to be fair) and 16 vs 20 doesn't really make much of a difference. Shaders for which we can reach an occupancy of 20 (24 VGPRs due to rounding) are typically short and simple and at that point rasterization and/or wave launch speed becomes an issue.

    (On RDNA1 launch speed was <= 1 wave/clock, so if the shader is less than 40 64-lane VALU operations and less than 3 texture sampling operations then you're launch speed limited)
     
    Krteq and iroboto like this.
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,520
    Likes Received:
    4,578
    Location:
    Well within 3d
    There's been an increase in context expected for some architectural elements, such as the shift to static 128 register allocations for scalar registers. It's possible with all the additional features that VGPR usage is also gradually increasing.
    On top of that, other occupancy or occupancy-related pressures like instruction buffers and demands on shared caches and are growing.
    As a higher-clocked architecture, shaving off a few entries from wavefront evaluation could streamline a pipeline stage as well.

    Other latency measures have increased with recent generations. RDNA separated out vector writes into their own tracking category to better stream out writes, and it has a greater waitcnt capacity for memory operations in general.
     
    Krteq and BRiT like this.
  4. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    801
    Likes Received:
    266
  5. NightAntilli

    Newcomer

    Joined:
    Oct 8, 2015
    Messages:
    44
    Likes Received:
    69
  6. P_EQUALS_NP

    Newcomer

    Joined:
    Jun 17, 2020
    Messages:
    12
    Likes Received:
    3
    I know I am late but, I want to chime in on the latency vs bandwidth debate. I believe that not only would a large LLC would not only provide more bandwidth at lower latency, but it it could also help AMD relax the amount of latency hiding per cu. for example, NVidia runs 32 threads per shader to hide the very long latency of gddr6x, with a large LLC AMD could play the odds and say most memory accesses hit the Cache and lower the amount of threads per shader to something like 16* or 8* threads freeing up registers and thus transistors!

    *note I don't know for sure how many threads of latency hiding RDNA 1 has since i haven't worked with AMD gpu's at the low level since a very long time.
     
    Alexko and Frenetic Pony like this.
  7. xEx

    xEx
    Veteran Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    1,015
    Likes Received:
    487
    Wait, are we gonna see AIBs cards at day 1 reviews? :runaway:
     
    Lightman likes this.
  8. andermans

    Newcomer

    Joined:
    Sep 11, 2020
    Messages:
    19
    Likes Received:
    31
    the 32-threads per shader thing from nvidia (and 32/64 for AMD) are how wide the SIMD is. on top of that that have SMT for latency hiding. Not sure how many for nvidia, but for RDNA1 is is 2-20x depending on how many registers are used (they're going to 2-16x likely in RDNA2, but I don't think that is a material change).

    I think the cache with lower latency is a great counterbalance for RT though (walking the BVH tree is multiple dependent loads, which hence is pretty latency sensitive.)
     
    Lightman likes this.
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,987
    Likes Received:
    1,309
    Location:
    New York
    It’s 16 warps max per scheduler in Ampere.
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,213
    Likes Received:
    1,409
    Location:
    London
    AMD's GPU frequencies have roughly tripled over the course of about 10 years (900MHz 6870 in October 2010), meanwhile memory latencies are the same or longer.

    (Well I admit I don't know actual GDDR5 and GDDR6 latencies.)
     
    Lightman likes this.
  11. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    49
    Likes Received:
    80
    Guys, why are you still mentioning MALL in reference to cache subsystem?

    It's part of DCN (Display Controller) and it's about timings of framebuffers etc.

    Or do I read that code incorrectly?
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,520
    Likes Received:
    4,578
    Location:
    Well within 3d
    GCN has 10 wavefronts per SIMD, RDNA1 has 20, at least some of the RDNA2-based consoles have 20, and Sienna Cichlid appears to have 16.

    The different SIMD widths and cadences make comparisons more complex. A single GCN wavefront stalling on an instruction might have up to 9 other wavefronts that might have one non-stalling instruction each on a 4-cycle cadence, giving 36 cycles if they each issued one instruction. An RDNA CU would have 19 other wavefronts, but they would be single-cycle. AMD provided a wave64 mode, which among other things emulates a 2-cycle cadence and brings the number of lanes per wave into a more equivalent relationship with a GCN wavefront. The apparent recommendation for that mode is apparently workloads like pixel shading where latency hiding is a major need, and the 2-cycle cadence gives roughly the same number of cycles as GCN.
    Sienna Cichlid reduces the amount somewhat in this scenario, although the majority of the hiding is still possible.

    Not evaluated is how many memory accesses can be issued sequentially per stalling. Vega doubled the amount over earlier GCN, and RDNA further split reads and writes to allow for more flexible batching of accesses and to avoid stalls on generally fire-and-forget writes.

    Wave32 would be more latency-sensitive, but it may make more sense for less parallel workloads or ones better contained in the LDS or register file.

    It depends on whether this MALL functionality is interpreted as only applying to the display controller, or that it's a larger functionality change with a subset relating to the display controller.
    Changes relating to cache allocation masks and a change to page table entries seems to imply things like depth tiling, SDMA, and other functions could be related.
    https://lists.freedesktop.org/archives/amd-gfx/2020-October/055005.html
    https://lists.freedesktop.org/archives/amd-gfx/2020-October/055006.html
    Values introduced in the first show up in the second, which deals with MALL.
     
    Lightman, Jawed, PSman1700 and 4 others like this.
  13. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    49
    Likes Received:
    80
    Thx, it starts to make sense to me now.
     
  14. Globalisateur

    Globalisateur Globby
    Veteran Regular Subscriber

    Joined:
    Nov 6, 2013
    Messages:
    3,799
    Likes Received:
    2,674
    Location:
    France
    If rumors about infinity cache are true, could it be using some kind of stacked chips (or chip on chip) tech?
     
    Lightman likes this.
  15. SlmDnk

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    631
    Likes Received:
    306
  16. xEx

    xEx
    Veteran Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    1,015
    Likes Received:
    487
    Only 18 hours left...I was very surprise the reviews for the 3070 came today...Nvidia gave AMD another target and price to beat...
     
  17. Wasmachineman_NL

    Newcomer

    Joined:
    Jun 24, 2019
    Messages:
    32
    Likes Received:
    29
    Location:
    GFX CARD CONN
    So I guess I answered my own question, more than half a year later: RDNA2 will come with AV1 decode. I wonder how Vegas will run on RDNA2? GPU acceleration would be awesome for the stuff I do.
     
    PSman1700, BRiT and Lightman like this.
  18. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    372
    Likes Received:
    319
    Within the patch, there are declarations with the term LLC, which is usually an abbreviation to last-level cache. Combined with terms like “no alloc” and “GCMC”, these patches sound like they are adding support for memory-side LLC bypass* on a per page basis, while some blocks (e.g. SDMA copy engine) can override the page level settings.

    * probably like SLC=1 policy for L2: write no-allocate, read miss-evict
     
    Lightman, BRiT and andermans like this.
  19. andermans

    Newcomer

    Joined:
    Sep 11, 2020
    Messages:
    19
    Likes Received:
    31
    Looks like the LLC is enabled by default which is why we've seen so little driver stuff about it. I suspect MALL may be something like skipping the LLC so that it can be powergated (though with it being only enabled for size < 128MiB maybe it is the other way around, use the cache so you can stop or clock down the memory)? It sounds like it is only enabled when it "idle", i.e. nothing changes about the image, so that very much sounds like an opportunity to shut things off.
     
  20. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    372
    Likes Received:
    319
    Guess also worth noting that the earlier link contains a condition with a magic number: surface_size < 128 * 1024 * 1024.

    So, ehm, maybe an interpretation is:
    • It has 128 MB last level cache
    • The hardware feature (& hence the flag) is called Memory Access at Last Level (MALL)
    • It can be turned on & off. (for lower idle power?)
    • The driver probably allows only render targets to allocate in LLC in some phases, in which the display controller can be assured that any <128MB RT to be presented always hit the LLC, and uses way tighter timing. (Eh, or maybe all the times? It is an IMR GPU after all)
    • Edit: ^^ is nonsense if you consider basics like double buffering... So maybe it is like what andermans said, MALL allows the 128MB LLC to be used as a scratchpad (hence "Memory Access"), while the GDDR6 pool is powered off?
    ?_?
     
    #4140 pTmdfx, Oct 27, 2020
    Last edited: Oct 28, 2020
    Lightman likes this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...