AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    L2 (global memory) atomics are effectively a superset of ROP functions, but it could be argued that when a rasteriser generates work items, there is no need to invalidate L1 lines. Compute atomics must invalidate simply because any CU anywhere on the GPU could use an atomic on that same address, so all L1 lines that cache that L2 line need to be invalidated.

    128B cache lines are "relatively large" compared with render target pixels (at least simple 32-bit per pixel formats), and of course the alignment of rasterised fragments to cache lines is quite coarse and will generally suffer "quad misalignment". So it would seem best to talk about the bandwidth amplification of L1 in terms of cache lines rather than pixels.
     
  2. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,878
    Likes Received:
    4,724


    bit tongue in cheek .
     
    Lightman likes this.
  3. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    15,134
    Likes Received:
    7,679
    I guess this guy missed Gamers Nexus's previous video

     
    Cuthalu, Malo, pharma and 2 others like this.
  4. Cyan

    Cyan orange
    Legend

    Joined:
    Apr 24, 2007
    Messages:
    9,734
    Likes Received:
    3,460
  5. Pressure

    Veteran

    Joined:
    Mar 30, 2004
    Messages:
    1,655
    Likes Received:
    593
    Honestly, how many people own an 8K monitor or TV?

    It’s such a useless thing to be arguing over at this point.
     
    Cyan, Cuthalu, Lightman and 2 others like this.
  6. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Maybe on blowout sale already, but you can get 5600XT's (which is basically a 5700-ish class of card) for 250 € already. If it's with 8+ GByte and DXR hardware, then ok, fair point.
     
    Cyan likes this.
  7. Rurouni

    Veteran

    Joined:
    Sep 30, 2008
    Messages:
    1,101
    Likes Received:
    432
    What I want is around 2060 performance at RX580 price. So far it has been a side grade if you want to upgrade but only have RX580 money (even at launch price!)
     
  8. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    13,878
    Likes Received:
    4,724
    so would a 4-8gig hbm2 with a 256bit gddr bus make sense for them ?
     
  9. I guess it depends on how hard/easy it is to make 8x 32bit channels of GDDR6 coming out of an interposer.

    If they could be used effectively at the same time, 256bit + 1 HBM2E stack would make for some very interesting combinations, though.
    Cards with that chip could range from 336GB/s (e.g. cut-down 192bit 14Gbps GDDR6) all the way up to 922GB/s (256bit 16Gbps GDDR6 + 3.2Gbps HBM2e), or even more if HBM2e goes out of spec.

    Desktop midrange cards could use just GDDR6, then higher-end offerings could have HBM2e + GDDR6. They could also have premium mobile versions with only HBM2e for maximum power efficiency, and then they could have cheaper high performance mobile cards with 192/256 GDDR6.
     
    eastmen likes this.
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    Some recent LLVM changes have started fleshing out details on the BVH instructions for GFX1030.
    Among other things, some stubs for instruction errors dating back to June 2019 now have more context as to what they were referring to.
    https://github.com/llvm/llvm-project/commit/91f503c3af190e19974f8832871e363d232cd64c

    Code:
    image_bvh_intersect_ray v[4:7], v[9:24], s[4:7]
    // GFX10: encoding: [0x01,0x9f,0x98,0xf1,0x09,0x04,0x01,0x00]
    
    image_bvh_intersect_ray v[4:7], v[9:16], s[4:7] a16
    // GFX10: encoding: [0x01,0x9f,0x98,0xf1,0x09,0x04,0x01,0x40]
    
    image_bvh64_intersect_ray v[4:7], v[9:24], s[4:7]
    // GFX10: encoding: [0x01,0x9f,0x9c,0xf1,0x09,0x04,0x01,0x00]
    
    image_bvh64_intersect_ray v[4:7], v[9:24], s[4:7] a16
    // GFX10: encoding: [0x01,0x9f,0x9c,0xf1,0x09,0x04,0x01,0x40]
    
    image_bvh_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20, v19, v37, v40], s[12:15]
    // GFX10: encoding: [0x07,0x9f,0x98,0xf1,0x32,0x27,0x03,0x00,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x13,0x25,0x28,0x00,0x00]
    
    image_bvh_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20], s[12:15] a16
    // GFX10: encoding: [0x05,0x9f,0x98,0xf1,0x32,0x27,0x03,0x40,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x00]
    
    image_bvh64_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20, v19, v37, v40, v42], s[12:15]
    // GFX10: encoding: [0x07,0x9f,0x9c,0xf1,0x32,0x27,0x03,0x00,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x13,0x25,0x28,0x2a,0x00]
    
    image_bvh64_intersect_ray v[39:42], [v50, v46, v23, v17, v16, v15, v21, v20, v19], s[12:15] a16
    // GFX10: encoding: [0x05,0x9f,0x9c,0xf1,0x32,0x27,0x03,0x40,0x2e,0x17,0x11,0x10,0x0f,0x15,0x14,0x13]
     
    Krteq, Lightman, Jawed and 2 others like this.
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Wow I forgot the 580 was $220 at launch. I think there's a very good chance of seeing 2060 performance under $250.
     
  12. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    You know, the easy answer to the "small" bus width is that RDNA2 just doubles the l2$ size per slice. That'd put the CU count at exactly the same bus width and l2$ per. Makes more sense to me than any of the other proposals so far.

    Regardless we can assume that somehow the CU to performance ratio has either stayed the same or improved, and with the PS5 clockspeeds, and the fact that 18gbps GDDR6 is demonstrated but not known to be shipping anywhere, we can hazard a guess at the performance metrics. Note: All performance is highly dependent on title.

    SC (big): low expectations; performs between a 3080 to 15% faster than a 3090. high expectations; better than a 3080 on all tests; up to 30% faster than a 3090 (though the 3090 should be faster on Id Tech titles, they really like throwing low bandwidth shaders at the screen it seems).

    flounder (mid-high): low expectations; performs between a 2080ti and a 3080. high expectations; performs between just below a 3080 to a 3090..

    cavefish (mid-low): low expectations; performs same as a 5700xt, but you know, with raytracing and watnot. high expectations performs between a 2080 super and a 2080ti.

    From there we can take a guess that big will cost $750+, mid high will cost between $600-800, and mid $300-450.
     
    Krteq, Lightman and NightAntilli like this.
  13. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    Perhaps an even easier answer is that L2 cache slices simply need not be 1:1 to 16-bit GDDR6 channels?

    (or 4:1 if you prefer to express in terms of 64-bit memory controller blocks)

    GCN started with Tahiti and Tonga (HD 7970, R9 280) that broke the strict power of two arrangement, albeit only for render backends. This was followed by Xbox One X (Scorpio) having a 384-bit GDDR5 bus behind the 8 L2 cache slices. Many other products are also operating not on the 1:1 ratio (the HBM ones, and Renior) albeit still under a power of two arrangement.
     
    #3293 pTmdfx, Sep 26, 2020
    Last edited: Sep 26, 2020
    Frenetic Pony likes this.
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    No, but being on a 1:1 saves you one very complex crossbar in design.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    So this is an old and perhaps relevant thread, from 2015:

    https://forum.beyond3d.com/threads/gpu-cache-sizes-and-architectures.56731/

    And relevant posting by sebbbi in 2013:

    Tahiti saw benefits from this "fully-tiled" benchmark and we now have GPUs with ~6x more fillrate, but only ~3x more bandwidth (RTX 3090, of course, has tiled rasterisation - but that wouldn't be relevant to sebbbi's HDR particles test).

    Still, we don't know how RDNA works and whether it uses the ROP depth/colour cache that we saw in GCN.

    You know, B3D forum is epic:

    128MB L4 :)
     
    Kej, LordEC911, Pete and 4 others like this.
  16. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    Complexity is relative. In absolute terms, that is four 2x3 crossbars for Xbox One X (Scorpio), and presumably the same for Tahiti/Tonga’s RBEs. It is simpler in switching complexity than an Infinity Fabric router node, IIRC routing maximally 5x5 (mesh), and we have these IF nodes everywhere in AMD’s SoCs.

    Bespoke crossbars are also not the only option. One could make use of the existing Infinity Fabric mesh-like setup (introduced in Vega) that likely has already the per-node switching capacity (L2/GMC, DCT, two neighbours), if not wider data bus width.
     
    #3296 pTmdfx, Sep 26, 2020
    Last edited: Sep 26, 2020
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Efficient L2 Cache Management to Boost GPGPU Performance

    This paper, from 2019, is a direct study of GCN, which improves an existing simulator (to achieve substantially closer simulated performance versus actual chip performance) and then goes on to propose a new cache architecture:

     
    Kej, LordEC911, Love_In_Rio and 7 others like this.
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I think there's a reason, why IdF is in addition to IcF and dedicated memory busses in AMDs chips. What's the current transfer rate of IF in the server grade IOD?
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3

    The focus here, for graphics, appears to be purely textures.

    On 22nm:

    https://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/4

    20mm² on TSMC 7nm for 128MB?
     
    Lightman likes this.
  20. itsmydamnation

    Veteran

    Joined:
    Apr 29, 2007
    Messages:
    1,349
    Likes Received:
    470
    Location:
    Australia
    Looks like the rumors from the banned man were rigtt.

    40cu, 2.5ghz clock lower power then 5700xt.... That's a nice bolder.....

    80cu at 2.2 is impressive as well

    32cu on 128bit bus, will be interesting to see

     
    Cat Merc, Cyan, Pete and 5 others like this.
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...