AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

  1. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,065
    Likes Received:
    7,026

    Each chiplet seems to be able to function in a fully autonomous way. They all have their own memory PHY, set of fixed function blocks (probably video codecs?), and they communicate with each other through the L3.

    To me this looks a bit like first-gen Zen but in a GPU. It also means there could be a lot of wasted space due to a bunch of fixed-function blocks being replicated and useless for all chiplets but one. Though if they could get the video codecs to work in parallel it would be awesome, especially if more chiplets = higher resolution or higher framerate encoding).
     
    Lightman likes this.
  2. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    2,047
    Likes Received:
    1,477
    Location:
    France
    Won't you waste space with MCM solutions ? I mean you still need the transistors for compute functions, etc, but now you need to have more external buses to make the multiples parts of the chip communicate between them ? I'm sure I'm missing something here. Or it's "just" about yield ?
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    In the end, with Ryzen, it was about going way past the conventional reticle limit in terms of transistor count. It also helped yield.

    HBX will cost power and power is going to be a pain point for huge transistor counts. Look at Threadripper 3990X versus any Ryzen.

    One thing that's not clear is the effect of Infinity Cache on power usage. We don't have a good "performance equivalent" estimator and probably never will. Some of that huge transistor count that chiplets allows for might simply be to increase the size of Infinity Cache.
     
  4. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,065
    Likes Received:
    7,026
    Not just yield but also versatility on using the same chip for a multitude of SKUs and performance ranges, the economy of scale achieved by it, and the money saved by developing only one chip.
     
  5. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,360
    Likes Received:
    3,096
    Location:
    Germany
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)

    This has been floating around for a while. I can't work out from this whether there's any meaningful gain possible in a physical device.

    It feels like I've been talking about this for 10 years:

    AMD: R7xx Speculation

    Conditional Routing was floating around back in 2004. Honestly, I don't feel motivated to compare this document and the Conditional Routing paper. I'll pay more attention when I see it in hardware.
     
    xpea and Lightman like this.
  7. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,360
    Likes Received:
    3,096
    Location:
    Germany
    Maybe more recent turn of events in rendering made such an investment, as this dynamic/analytical re-sorting more feasible to explore and investe die space in?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    Ray tracing you mean? :lol:

    Yes I would tend to agree.
     
    Lightman and CarstenS like this.
  9. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,206
    Likes Received:
    1,775
    Location:
    New York
    Ironically, DXR 1.0 was setup to give the hardware/driver an opportunity to sort threads before execution. Inline raytracing with DXR 1.1 doesn't offer that luxury.

    DirectX Raytracing (DXR) Functional Spec

     
    Lightman likes this.
  10. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    Not sure how ironic this is. If this sorting to limit divergence can be applied to any task (not just RT), it would be really interesting in general.
    Might be a way to increase RT perf also without having fixed function traversal units as well, and still no HW dependency on the classical RT algorithm - still flexible.

    But just dreaming... my method of reading patents became 'just look at the images and their description, but ignore the text' :)
     
  11. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,206
    Likes Received:
    1,775
    Location:
    New York
    The patent isn’t ironic. General purpose sorting for coherence is sure to benefit both software and “hardware assisted” RT.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,286
    Likes Received:
    1,551
    Location:
    London
    What kind of coherency-sorting would that be? What kind of scale, memory usage and latency characteristics would it have? Would the sorting occur entirely on-die? Would it be a driver-originated CPU process?

    Are we talking about categories of ray which would steer the sorting process? Or is it related to the originating triangles? Screen-space derived using a space-filling curve (e.g. Morton curve)?

    If the driver or hardware is doing sorting then it's always better than sorting coded by the developer?

    Sorting of work items, which the patent document refers to, is a more general kind of sorting, for any kind of shader (e.g. vertex shader). For ray tracing on AMD it would be applicable to every loop iteration as a workgroup traverses sub-trees of the BVH, since each step requires a decision: follow BVH children or report result of triangle hit? Or abandon the sub-tree. The decisions cause work group divergence.

    Coherence of work item execution (keeping work group divergence to a minimum) is a different topic than finding coherent sets of BVH queries to issue in parallel. In the latter case you might have 1000 rays that share 10,000 BVH nodes or you might have 1000 rays that share 10,000,000 BVH nodes. Preferably you would want to issue queries for the second set with some sorting and binning, e.g. issuing sets of 10,000 BVH-localities. Of course the problem is to work out those BVH-localities in less work than just running without sorting/binning.
     
  13. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,135
    Likes Received:
    1,290
    I could imagine the compiler simply looks for branches in code, adds 'sorting barriers', workgroups go idle on that and SIMDs continue with another one. Another unit performs the sorting / reordering on the idle tasks concurrently?
    If so, ray tracing may have been listed just as an example in the patent.

    Edit: probably i get it wrong, as reordering threads this way would destroy their context to LDS memory. But 'single threaded' shader stages like pixel or ray shaders might work.
     
  14. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,206
    Likes Received:
    1,775
    Location:
    New York
    Those would be good questions for the IHV driver teams. It's anybody guess whether there's any sorting at all taking place on DXR 1.0 implementations.

    All else equal? Probably since the driver knows a lot more about the hardware than developers do, especially at runtime.

    Yup, that's why it's the stuff of dreams.
     
  15. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    921
    Likes Received:
    356
    All else equal? Probably not since the developer knows a lot more about the data coherency and utilized resources than driver programmers do, especially at compile time. ;°P
     
  16. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    921
    Likes Received:
    356
    If you're interested in the context of raytracing in general: https://www.andyselle.com/papers/20/sorting-shading.pdf
     
    Lightman and Jawed like this.
  17. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,717
    Likes Received:
    238
    Some very juicy rumors about RDNA 3 and Navi 31 performance.



    MCM / 2 Chiplets. 160 CU.

    AMD aiming for 2.5x the performance of 6900 XT.
     
  18. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    2,047
    Likes Received:
    1,477
    Location:
    France
    Is it really chiplet if it's only 2 big ones ? Looks like Fury maxx...
     
  19. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,783
    Likes Received:
    3,953
    Location:
    Finland
    Tbh sounds like someone just adding Navi31 specs, which are practically same as Navi21, to the recent multi-chiplet patent
     
  20. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,717
    Likes Received:
    238
    There will be key architectural improvements over RDNA 2

    They include:
    Zen 3 like cache
    Scalability
    Improved front end.
    Much better ray tracing performance
    Greatly upgraded geometry capabilities, based on their work for PS5.
    some actual answer to DLSS. hardware? software?
    TSMC 5nm, likely late Q2 2022 release.
     
    Lightman likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...