AMD Radeon RDNA2 Navi (RX 6800, 6800 XT, 6900 XT) [2020-10-28]

Discussion in 'Architecture and Products' started by BRiT, Oct 28, 2020.

  1. marifire

    Newcomer

    Joined:
    May 13, 2007
    Messages:
    46
    Likes Received:
    41
    wasn´t it 2 slots cooling for 6800 and 2.5 for 6800XT/6900XT?
     
  2. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    217
    Likes Received:
    239
    I looked again at the presentation and there is no declaration of which boards use 2,5 slot cooling and which use 2 slot cooling.
     
    Lightman likes this.
  3. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,057
    Likes Received:
    1,241
    From the looks of it, this RT test scene has only about 20 BVH nodes, each using custom intersection code for some SDF primitives but no triangles at all.
    So this SDK sample does not tell us anything about RT perf. in practice.
     
    Lightman likes this.
  4. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    4,115
    Likes Received:
    3,239
    Food for thought ...


    https://www.3dcenter.org/news/hardware-und-nachrichten-links-des-28-oktober-2020
     
    #284 pharma, Oct 29, 2020
    Last edited: Oct 29, 2020
  5. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    611
    Likes Received:
    357
    Based on all that they have spoken of the L3 cache, and all the leaks, I think that the Infinity Cache consists of 16 slices of 8MB with a 512b interface on each of them. The easiest, most sensible method of varying it's size is to increase/decrease the amount of those slices. As the driver will definitely need to know the amount of cache on the card, I think it's a good guess that they would include this as a new property in the drivers. Looking at the mac os driver properties, there are 3 new ones. unknown2 seems to be the total CU count, and unknown0 doesn't fit, but unknown1 is 16 on Navi 21, which is exactly right.

    Based on entirely this, I predict that the 40-cu Navi 22 will have 96MB of infinity cache, that the 32-cu Navi 23 will have 64MB, and that the upcoming APUs will both have 32MB.

    The APUs sound about right, assuming they are targeting 1080p gaming (framebuffers scale linearly with resolution, and 1080p is 1/4th of 4k). The middle chips are a bit weird. They definitely can't do 4k (not just because lack of cache, but also compute power), but if you are targeting 2560x1440, you don't really even need 64MB. Maybe the amount of slices is high for the bandwidth, not for the cache amount?
     
  6. Leoneazzurro5

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    217
    Likes Received:
    239
    Pete and Tarkin1977 like this.
  7. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,926
    Likes Received:
    901
    Location:
    Somewhere over the ocean
    Because amd has always contributed to public standards instead of using the market to force proprietary technologies, and smart access memory just increase a little the fps and nothing more.
     
    Lightman likes this.
  8. SimBy

    Regular Newcomer

    Joined:
    Jun 21, 2008
    Messages:
    700
    Likes Received:
    391
    Lightman likes this.
  9. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    808
    Likes Received:
    276
    Yea it's not anything proprietary, just a slight performance increase similar to smartshift in laptops. And AMD is obviously trying to incentivize people to buy their own CPU+GPU combinations, nothing wrong with it. The competition are free to do something of their own.
     
    RedVi, Pete, PSman1700 and 1 other person like this.
  10. yuri

    Regular Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    263
    Likes Received:
    270
    This is really exciting. Rembrandt is a 2021/2022 product, but Van Gogh should be revealed soon!
     
    Lightman likes this.
  11. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    Unknown0 seem to be the same as DCUs per SE.

    They have enough compute power for 4k if you don't play the newest games and do not want to have more than 60 FPS. Most 4k displays are still on 60 Hz. I played some (non-FPS) games at 4k even at low-end RX460.

    But they are optimized for 1440p (to play also new games at better 60 FPS), not for 4k. So for these it's acceptable/normal that the cache capacity will be a bottleneck at 4k.

    The pixel count of 1440p is 44% of pixel count of 4k, so if 128M is sweet spot for pixel-related data in 4k, then 64 MiB should be the sweet spot for 1440p.

    But, there is one major thing consuming memory which is does not scale by the resolution: The geometry data and the BVH tree required for ray tracing, and they easily consume couple of tens of megabytes.

    We really want to fit most of the BVH tree into the cache because traversing the tree is very latency-critical. So that's why the "optimal" cache size will not scale linearly with the pixel count, instead we want to have the cache capacity like (typical geometry + BVH size) + (C * target_pixel_count )
     
    Lightman and T2098 like this.
  12. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,866
    Likes Received:
    6,807
    From AMD's bandwidth numbers, we arrived at the conclusion that the GPU LLC is 4096bit wide.

    4096bit * 2.25GHz = 9216Gbps = 1152GB/s. 1152GB LLC + 512GB GDDR6 = 1664GB/s is how much bandwidth AMD is claiming on their slides.
    Your suggestion of 16 slices at 512bit each would result in 8192bit wide total. I guess it could be that too, but the LLC would need to clock at half the value of the core clocks.

    Or perhaps we're looking at 8 slices, one for every 10 CUs / 5 WGPs in Navi 21.
    However it doesn't look like the cache amount is related to the CUs, considering even the 6800 with 25% of its CUs deactivated is getting access to the full 128MB of LLC.


    I wonder if SAM is only effective if the GPU is connected through a PCIe 4.0 x16 bus, in which case an Intel implementation would need a Tiger Lake, Rocket Lake or later CPU model.
     
    Lightman likes this.
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,272
    Likes Received:
    1,529
    Location:
    London
    Lightman, Pete, PSman1700 and 4 others like this.
  14. Nebuchadnezzar

    Legend Veteran

    Joined:
    Feb 10, 2002
    Messages:
    1,034
    Likes Received:
    272
    Location:
    Luxembourg
    The Zen2 L3 saturates at 667GB/s on my 3700X at around 4.3GHz, based on that it's around 256b per slice at 4MB per slice. 16 * 8MB seems reasonable, as that would match the slice size of Zen3 and give you a 32B cacheline.
     
    Kej, Lightman, Silent_Buddha and 2 others like this.
  15. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    829
    Likes Received:
    478

    The L3 cache is described in one of the slides footnotes, as was mentioned here before:
    "" Measurement calculated by AMD engineering, on a Radeon RX 6000 series card with 128 MB AMD Infinity Cache and 256-bit GDDR6. Measuring 4k gaming average AMD Infinity Cache hit rates of 58% across top gaming titles, multiplied by theoretical peak bandwidth from the 16 64B AMD Infinity Fabric channels connecting the Cache to the Graphics Engine at boost frequency of up to 1.94 GHz. RX-535"

    So bus to the cache is 16 x 512 bit, at 1.94 GHz, For 1.94 TB/s.
    This L3 cache seems work completely transparent, caching reused pages with 58% cache hit rate. That must be data reused within one frame. They could be pinning the frame buffer to L3 cache, but that would even not be needed or contra productive I guess.
     
    Kej, Lightman, Pete and 4 others like this.
  16. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    611
    Likes Received:
    357
    Thank you. I was certain I read that yesterday before going to sleep, but had no idea where it was and was looking for it to source my previous statement.


    That would somehow at the same time be extremely surprising and not surprising at all. I think there has to be ways to get better utilization than just a simple LRU policy like that, but at the same time those would easily get very complex.

    I think at 4k they would almost certainly want to pin the framebuffer, because otherwise that and the texture data would just flush everything every frame and only leave them gaining advantage in locality in address, without any benefit from locality in time.
     
  17. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    611
    Likes Received:
    357
    Zen cache lines are 64B, like every cache line in any remotely recent x86 cpu. (They have to keep it, as too many programmers have depended on that size so that it is now essentially part of the living x86 spec.)

    The L3 in Zen serves a single line over two cycles.
     
    Lightman likes this.
  18. dskneo

    Regular

    Joined:
    Jul 25, 2005
    Messages:
    608
    Likes Received:
    149
  19. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    RDNA has 128B cache lines, btw.

    Infinity Cache sounds very likely a proper hardware cache that is transparent to IPs, not a software managed memory pool like HBCC.

    So it shouldn't need to fit a whole render target/buffer to be effective. Say, pixel shader dispatch is already locality aware with the introduction of DSBR.

    For BVH though, the farther away from the root, the lower the reuse rate it is by nature. So I feel like for large BVH that doesn't fit the cache, it is more likely that only several levels of the tree would be cacheable for speeding up first few iterations of traversals.
     
    Lightman and Nebuchadnezzar like this.
  20. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    829
    Likes Received:
    478
    I'm not convinced, reading a 4k frame buffer at 100 FPS, requires only 3.2 GB/s, and could be even less with compression.
    GPUs render the frame buffer in some tiled way, frequently reusing tiles preventing them being swapped out the cache by other data.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...