AMD Radeon RDNA2 Navi (RX 6700 XT, RX 6800, 6800 XT, 6900 XT) [2020-10-28, 2021-03-03]

Discussion in 'Architecture and Products' started by BRiT, Oct 28, 2020.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,819
    Likes Received:
    3,976
    Location:
    Finland
    No, but a product by one company isn't a standard by any stretch of the imagination before some accepted organization makes it such.
    JEDEC is the most prominent one when it comes to memory standards and even if there's some others, none of them has made GDDR6X a standard.
     
  2. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    17,821
    Likes Received:
    7,883
    Like the L1/L2/L3 caches on CPUs? Like register space on all computing devices? I mean once upon a time CPUs didn't have any cache other than register space, but memory speeds couldn't keep up.

    Obviously if main memory were fast enough then CPUs wouldn't need L1/L2/L3 cache or even register space. Everything is a workaround in computing because rarely is anything as fast as one wants it be ... or more importantly rarely is anything as cheap (either in transistor cost or energy cost) as one wants it to be.

    There may (probably eventually will) come a time when NV can't pay for custom memory with the bandwidth they require then what? I'd imagine something similar to infinity cache.

    I mean, thinking about it another way. Isn't DLSS just a workaround with IQ pitfalls because NV GPUs can't render fast enough at higher resolutions with RT enabled? Something that AMD desperately needs, IMO.

    Just because it's a workaround for a technology limitation doesn't mean it's bad. After all, everything is limited by technology. Just look at games, EVERYTHING in games is a workaround to try to get something to render fast enough even though it comes with plenty of "pitfalls".

    Regards,
    SB
     
    Leovinus, Lightman, Scott_Arm and 3 others like this.
  3. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    547
    Likes Received:
    862
    I wasn't talking whether it's good or bad. It all comes down to engineering trade offs.

    As of now at 7 nm, that cache eats up more die than 2 additional memory controllers would have taken otherwise.

    That looks like a net perf/mm loss to me and smaller die 5700 XT having the same IPC as 6700 XT at the same frequency only confirms that.

    AMD claims better perf/watt, so probably it was worth it in the end, though I have a hard time to believe that the same config as 5700 XT with faster GDDR6 would have scaled any worse.

    It would be interesting to compare PS5 with similar frequencies against something like 6700 in different games to figure out whether the infinity cache was worth the additional area.

    And that additional area translates into higher cost products - 12 GB of memory looks like overkill for the middle end segment and 33% higher die area in comaprison with Navi10 is not free either.
     
    DavidGraham, DegustatoR and sonen like this.
  4. troyan

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    344
    Likes Received:
    666
    You know that GPUs scale with units, too? Like for example double FP32 throughput?
     
  5. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,130
    Likes Received:
    510
    Ah yes, "double" FP32.
    Let's do the count regfile operand sources dance.
     
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,214
    Likes Received:
    1,794
    Location:
    New York
    It’s a little annoying that AMD doesn’t know how to show off its own products. I’m sure IC is really good for certain things, maybe spatially and temporally coherent data access patterns like all the post processing that’s done today.
     
    Lightman likes this.
  7. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    547
    Likes Received:
    862
    It shines whenever there are lots of read-modify-write operations, such as particle blending, it simply destroys everything in 3D Mark Fire Strike for example.
    The funny part that most of the games do not use blending that much for obvious reasons - they have to look great on consoles with way lower bandwidth shared between both CPU and GPU.
    Post-processing today is mostly done in compute shaders and highly optimized manually via the on-chip shared memory usage available in CS (and GA102 has more shared memory per chip vs Navi21), so I doubt the large L3 would benefit a lot (because even the way smaller L2 is almost never a bottleneck here).
    Neither do I see benefits in games where PP is done in pixel shaders (all vanilla UE4 games), in fact, I see the opposite - 3080 is faster in such games since it has way more L1 cache and pixel shaders are sensitive to it.
     
  8. xEx

    xEx
    Veteran Newcomer

    Joined:
    Feb 2, 2012
    Messages:
    1,060
    Likes Received:
    542
    The price is set my the buyer, companies will always try to sell their product to the higher price must of the buyers will pay for it.

    Tbh I'm surprise AMD and Nvidia haven't raise their MSRP higher these days, they have to sell a card for 480 while watching everyone in the middle resell that card for 1200.
     
  9. neckthrough

    Newcomer

    Joined:
    Mar 28, 2019
    Messages:
    68
    Likes Received:
    171
    "Fast" has two attributes - latency and bandwidth. CPUs are extremely latency-sensitive and in fact use caches more for latency reduction than for bandwidth amplification (although maybe AVX has been rebalancing that equation somewhat). GPUs on the other hand are bandwidth machines and can shrug off latency to a large degree (except for some specific latency-sensitive operations as @OlegSH pointed out). And so they use caches primarily for bandwidth enhancement. You cannot buy latency, but you can always buy more bandwidth, which makes the tradeoff for cache on GPUs a different ballgame than for CPUs. Navi 2x certainly made some bold choices in that tradeoff space, but the benefits aren't as slam-dunk as, for example, the hilariously awesome Zen L3$.
     
  10. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    547
    Likes Received:
    862
    Yep, there are all kinds of prefetching and speculation hardware in Zen that helps getting all the necessary data in advance in the large L3 to decrease latency rather than make it worse.
    Those smarties is what makes this cache relevant and usefull, not just the sheer size.
    In Imagination graphics, there is a whole pipeline built around tiled optimizations and on-chip SRAM, this architecture can live up with way smaller caches and capture the same benefits.
    Besides caches and tiled pipelines, consoles exploited the EDRAM and SRAM to pin certain resources in this fast memory.
    There is also both large L2 and cache control knobs in the Ampere A100 GPU so that one can control data residency and pin resources in these caches manually.
     
    BRiT, PSman1700, DavidGraham and 2 others like this.
  11. neckthrough

    Newcomer

    Joined:
    Mar 28, 2019
    Messages:
    68
    Likes Received:
    171
    Which leads into the following question -- does the RDNA2 L3 include any hooks for programmer-controlled data pinning, or is it purely transparent? FWIW I find cache data pinning a little awkward because it breaks abstraction. Scratchpads are a different beast because they are meant to be explicitly managed, but caches are supposed to be programmer-transparent.
     
  12. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    547
    Likes Received:
    862
    Well, data pinning is likely a requirement for good perf with producer-consumer pipelines in compute or machine larning. The graphics pipeline is itself a producer-consumer pipeline and all the data goes thru L2 at least, but I don't know how data residency is controlled there, certainly this is not exposed in APIs.
     
  13. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,130
    Likes Received:
    510
    Ugh should we tell him?
     
  14. neckthrough

    Newcomer

    Joined:
    Mar 28, 2019
    Messages:
    68
    Likes Received:
    171
    If you have something useful to contribute maybe it’s better to do it directly instead of trying to create a grandiose mystique about yourself first.
     
    trinibwoy, nutball, Qesa and 4 others like this.
  15. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,130
    Likes Received:
    510
    That's the best part.
    Either way Zen LLCs are all 4/8c sized exclusive chunks aka you get GORE in all and every LLC-bound actual workload.
    Everything has its tradeoffs and all.
     
  16. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    382
    Likes Received:
    345
    Linux driver had indicated that each entry in the GPU page table can be individually marked as LLC No Alloc. RDNA 2 ISA documentation indicated that image resource (texture) descriptors use 2 bits to specify LLC allocation policy, which apparently supports either a resource-wide override value, or using the LLC No Alloc setting in the page table entry.
     
    ToTTenTranz and Lightman like this.
  17. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,144
    Likes Received:
    7,108
    In the PCGamer interview, Scott Herkelman said that devs are looking into further optimizations to be made around Infinity Cache, and that they're getting (I think he said) "very interesting results".

    From there I assumed it could be directly accessed by game developers (I guess through proprietary extensions). He specifically said game developers and not driver developers.

    Over time, it'll be interesting to see if IC implementations get more performance like AMD's Radeon boss suggests, or it'll follow the AMD Doom&Glooming (growing datasets) from the usual and predictable suspects.
     
    Lightman likes this.
  18. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,242
    Likes Received:
    1,675
    Location:
    msk.ru/spb.ru
    You can optimize for cache size by managing your dataset - making it fit into the cache size. If that wasn't clear to the usual and predictable suspects.
     
    DavidGraham and PSman1700 like this.
  19. fellix

    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,533
    Likes Received:
    492
    Location:
    Varna, Bulgaria
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,556
    Likes Received:
    4,727
    Location:
    Well within 3d
    I'm curious what is being measured for some of the graphs.
    The low end at ~16KB would be where you'd expect the CU-level caches would hit.
    The L1 texturing latency for GCN was stated in an optimization presentation to be ~114 cycles, which for an 7950 wouldn't be ~40ns.
    The ~27ns for RDNA2 seems like it would be too low for standard clocks, although there was discussion of potential latency savings in RDNA when fetching non-filtered data.
    The L1 scalar cache could potentially have a lower cycle count, although in this instance it would have fewer ns.

    Also, the relative flatness of the latency curve in the 16kB-32kB range seems like there may be some other confounding factor like some kind of prefetching or some kind of access pattern optimization, since I don't think the CU-level caches go beyond 16KB before a more substantial hit occurs.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...