AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

Tags:
  1. Digidi

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    417
    Likes Received:
    237
    trinibwoy likes this.
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,495
    Likes Received:
    1,853
    Location:
    London
    So the diagrams in this document are interesting because the output of Hull Shader and Tessellator are synchronisation/data-transfer workloads between two APDs (accelerated processing devices). This is the "gnarly" part of a multi-chiplet architecture because down-stream function blocks determine which chiplet will use which chunks of output produced by HS or TS. The routing of the work is "late", when screen-space is used to determine workload apportionment.

    The synchronisation/data-transfer tasks require queues ("FIFO" in the document), which is where an L3 (Infinity Cache) would come in. The document locates the FIFOs within each APD, but if there's a monster chunk of L3, let's say 512MB, shared by two chiplets, that would seem to be a preferable place. These buffers would not waste die space if they were dedicated memory blocks within each chiplet when tessellation is not being used.

    AMD always struggled with stream out functionality of the geometry shader, compared with NVidia. NVidia better-handled SO with on-chip buffers (cache) whereas AMD always decided to use off-chip memory (AMD's drivers over time messed about with GS parameters that tried to avoid the worst problems associated with the volume of data produced by GS). Similarly, tessellation has always caused AMD problems because on-die buffering and work distribution were very limited. In the end SO buffering is effectively no different from the FIFOs that are required to support HS and TS work-distribution. So whether we're talking about a single die or chiplets, fast, close, memory is a key part of the solution.

    So if AMD is to solve the FIFO problem properly, it will need to use a decent chunk of on-package memory.

    Similarly the "tile-binned" rasterisation in:

    https://www.freepatentsonline.com/10957094.html

    would seem to depend upon an L3. We've seen from NVidia's tile-binned rasterisation that the count of triangles/vertices that can be cached on die varies according to the count of attributes associated with each vertex (and the per-pixel count of bytes defined by the format of the render target). I don't think we've ever really seen a performance degradation analysis for NVidia in games according to the per-vertex/-pixel data load, but as time has gone by it appears NVidia has substantially increased the size of on-die buffers to support tile-binned rasterisation.

    It seems to me that a monster Infinity Cache lies at the heart of these algorithms for AMD. Well, I imagine that comes across as "stating the obvious", but NVidia has been using a reasonably large on-die cache for quite a long time so it's time AMD caught up.

    In theory, with RDNA 2, Infinity Cache is already making tessellation work better. But I don't remember seeing any analysis.

    Stupid question time: I can't find a speculation thread for NVidia GPUs after Ampere. Is there one?
     
    Lightman, Rootax and Digidi like this.
  3. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    153
    Likes Received:
    380
    A couple of new patents presumably for plumbing operations across multi dies


    https://www.freepatentsonline.com/y2021/0192672.html

    20210192672: CROSS GPU SCHEDULING OF DEPENDENT PROCESSES
    upload_2021-6-26_19-9-5.png

    https://www.freepatentsonline.com/y2021/0191890.html

    20210191890: SYSTEM DIRECT MEMORY ACCESS ENGINE OFFLOAD
    upload_2021-6-26_19-9-35.png
     
    Krteq, Newguy, Lightman and 3 others like this.
  4. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    153
    Likes Received:
    380
    Another patent application for concurrent traversal of the BVH tree
    https://www.freepatentsonline.com/y2021/0209832.html



    upload_2021-7-9_11-53-33.png
    upload_2021-7-9_11-53-46.png
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,495
    Likes Received:
    1,853
    Location:
    London
    It seems to me that this model of concurrent traversal is possible on RDNA2.

    The limitation appears to be the size of the "collection" of nodes to be queried as the BVH is traversed: the collection might be limited to 1024 nodes, for example. The collection is merely a set of IDs - the nodes themselves, as they are retrieved, can be analysed and then discarded once all the decisions for every ray have been derived.

    The document is quite explicit in saying that a massive count of parallel ray queries is preferable, since they will, in aggregate, hide the latency of BVH fetch.

    As a developer with the opportunity to write a custom traversal kernel, it appears that it's possible to jointly use concurrency and multiple queries per work item.
     
  6. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,796
    Likes Received:
    7,808
    Some stuff from known leakers has been appearing about RDNA3, most of them confirming what @Bondrewd has been saying.
    Wccftech made a compilation of the tweets, but I'll leave them here.














    Navi 31 really does look like a monster on the larger SKU.
    Are we looking at chiplets with 90CUs or more? 256MB of Infinity Cache per chiplet?

    Or maybe it's the same 128MB per chiplet, but with an optional 128MB V-cache underneath. The 120CU SKU has no V-cache, but the fully enabled 180CU version does.

    And in the middle of it all, 256bit GDDR6 sounds almost inadequate.. except of course for the massive cache amounts.

    Regardless, these are exciting times ahead!
     
    Lightman, Jawed and Man from Atlantis like this.
  7. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,577
    Likes Received:
    764
    120 per GCD, but they're 30WGP and you should count them as such.
    No, those are discrete blobs attached to the MCDs.
    Well technically yes but magic abound here.
     
    Lightman and Jawed like this.
  8. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,796
    Likes Received:
    7,808
    Each WGP has 4 CUs in RDNA3?
    So the 2*GCD part has 180 CUs but could actually go up to 240?


    So there's no LLC in the gaphics core dies?
     
  9. PSman1700

    Legend Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    5,671
    Likes Received:
    2,482
    Sounds like RDNA3 will be true monsters indeed.
     
  10. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    305
    Likes Received:
    326
    If RDNA3 follows the patent we saw some time ago, IC is on the package, and acts also as a high-bandwidth interconnect between the modules (which are seen as a single GPU, load-balancing signals should be passed in the same way). <Of course there would be quite some cache on the dies, too, but it would not be "LLC".
     
    pjbliverpool and Lightman like this.
  11. Lurkmass

    Regular Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    452
    Likes Received:
    518
    A small bus width with a large LLC can make for a reasonable HW design. Mobile GPUs proved that we can optimize deferred renderers by storing a small slice of the g-buffer in tile memory but there's a penalty to be paid for doing full screen passes since it will flush this memory. With a large LLC we can store our entire g-buffer in this memory and we don't have to worry about this penalty which is incidentally compatible with way how IMR GPUs operate ...

    We are possibly so close to bringing back EQAA/MSAA for deferred renderers or we could afford to store more parameters in the g-buffer to enable more complex materials/shaders if these rumors hold ... (register pressure could very well be a thing of the past depending on the layout of the g-buffer)
     
    Lightman and BRiT like this.
  12. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,577
    Likes Received:
    764
    There's no "CU" anymore.
    Just WGP.
    240 the old ways?
    I think.
    None of, yes.
     
    pjbliverpool, Lightman and Jawed like this.
  13. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,911
    Likes Received:
    17,289
    Location:
    The North
    What is GCD acronym?
     
  14. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,796
    Likes Received:
    7,808
    Graphics Compute Die?
    Graphics Core Die?
     
    Lightman and iroboto like this.
  15. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,911
    Likes Received:
    17,289
    Location:
    The North
    so the official term for a gpu chiplet then
     
    Lightman likes this.
  16. Leoneazzurro5

    Regular Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    305
    Likes Received:
    326
    Kej, w0lfram, Lightman and 7 others like this.
  17. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    13,911
    Likes Received:
    17,289
    Location:
    The North
    exciting times really. I recall the day chiplets came to CPUs and shortly after leadership positions reversed on price/performance against Intel.
    I really have super high expectations here for something similar given their past history, combined with 3D stacked cache, it's going to have significant price/performance.
    Curious to see if it can come to APU form factors in the future.
     
  18. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,577
    Likes Received:
    764
    nnnnnnnnnope.
    Not in a thousand years.
    MCP GPUs are a win more setup.
    You pay $2500 and you get more!
    Yeah, later (think late'23 timeline and all).
     
  19. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    153
    Likes Received:
    380
    Greymon55 is bascially Broly_X1.
    Broly_X1 had to delete his posts because of certain reasons, but that guy nailed everything right.
    While exciting for outsiders to get a sneak peek, as someone who also work with a lot of NDA tech it is a also concerning.
     
    pjbliverpool, Lightman, BRiT and 2 others like this.
  20. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    153
    Likes Received:
    380
    It should be on MCD.
    I imagine AMD would take the best of both worlds. N5P GCD for absolute logic density and performance and N6 MCD with HD/SRAM optimized libraries for lower cost per MB IC.
    N5P SRAM density gain over N7/N6 is very mediocre.
    512MB SRAM on N6 with optimized libraries would only be 280-300m2 (Figures estimated from wikichip data, behind paywall). On N5 hardly any better around 250+mm2 but much costlier.
    But all those logic blocks can scale very high almost 1.48x with N5P (assuming AMD goes with N5P for GPUs else 1.85x on plain N5)
    I suppose 2x GCD + 1x MCD would be closing in around 1000mm2 or maybe even more. Will cost a pretty penny.

    I don't know if @Bondrewd can give a hint if N5 or N5P
     
    T2098, Lightman and BRiT like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...