AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,046
    Likes Received:
    13,421
    Location:
    The North
    [​IMG]
    Not a diagram ;)

    I also missed the 2nd last bullet point.
     
    Lightman, Rootax, yuri and 5 others like this.
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,372
    Likes Received:
    3,754
    Yep, either texturing or RT, but not both at the same time.
    You would be surprised.
     
  3. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,350
    Likes Received:
    3,340
    Location:
    Finland
    What else did you expect from high level block diagram?
     
  4. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    812
    Likes Received:
    595
    Maybe more shader engines or more workgroups per SE. Or they change something big like move some of the functions into or out of the WG. There were rumors that RDNA2 was a big departure from previous architectures but this pretty much confirms that theres nothing big that changed at the high level.

    It makes it interesting to guess what AMD even changed to get better IPC from RDNA1 to RDNA2.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,881
    Likes Received:
    1,101
    Location:
    New York
    Were you honestly expecting RDNA2 to look just like RDNA even at a high level? After all the buzz I was expecting something to change. So far it looks like RDNA2 = RDNA + RT.
     
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,046
    Likes Received:
    13,421
    Location:
    The North
    [​IMG]
    I've been comparing this slide against RDNA 1 whitepaper and I guess if you're loose with terminology I think everything is exactly the same.

    But if you're wanting to be nit picky maybe the following could be different.

    Launch 7 instructions/clk per CU vs 4 instructions/clk per CU

    And the biggest difference is that in the slide above they seemed to have separated 2 Scalar and 2 control as well as the Vector Data instruction. This is not really clear is the 2 Control and the 1 Vector data are also part of RDNA 1 just not indicated. So I'm not sure if RDNA 1 issued 4 instructions per CU, and broke it down to 7 instructions per CU.

    Second difference I might be seeing here is the:
    32 Scalar FP32 FMAD per SIMD, 128 per Dual CU with Data sharing.
    It's not really clear if the scalar units can perform a FMAD in the scalar units in the RDNA 1 whitepaper.
    Lastly the machine learning inference bit. Possibly also available for RDNA 1.
    And I think MS took the option to do this. This is slightly different than just packing in Int8 and Int4 into vector registers listed here:
    So there is support to hold them in the registers (RDNA 1), but it would appear you need to have a variant of CUs that can perform some of these other 8bit and 4bit operations listed above. I'm going to assume it can just do rapid packed math normally.

    Some of the bullet points here:
    • Unified Geometry Engine
    • Distributed Primitives and Rasterization is in RDNA 1
    • Mesh Shading Geometry Engine is RDNA 2
    • Multi-Core Command Processor > No indication on RDNA 1 about multi-core
     
    #2546 iroboto, Aug 18, 2020
    Last edited: Aug 18, 2020
    Lightman, pharma, Esrever and 2 others like this.
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,881
    Likes Received:
    1,101
    Location:
    New York
    Is Microsoft even in the algorithm business? Modern DirectX defines a high level workflow, interfaces and data structures but most logic is left up to the game programmer. I would be surprised if DirectX ships with any sort of upscaling implementation. That would be like MS providing their own depth of field shader.
     
    Lightman and xpea like this.
  8. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    17,341
    Likes Received:
    17,821
    Maybe ... They are for adding HDR to SDR titles like some Original Xbox and X360 Games.
     
    Krteq likes this.
  9. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    282
    AMD's explanation of ACE/HWS in the open for OSS Linux drivers partly answered this question.

    Ever since the first eight "ACE" GPU, all GPUs come with two "MCE" microcontroller core each of which has four "pipes" (i.e., quad threaded, probably temporal). The initial iterations have all 8 pipes configured as "ACE"s, while later GPUs reappropriated some pipes for GPU multi-process scheduling with support of user mode queue oversubscription (aka "HWS").

    Not much information on the graphics CP that I know of though, which has its own core(s?).
     
    Lightman likes this.
  10. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,046
    Likes Received:
    13,421
    Location:
    The North
    Yea that wasn't clear to me either.
    RDNA continues to separate ACE for Compute Pipeline and GCP for 3D Graphics pipeline, that's about all I can gather from the RDNA whitepaper unfortunately.
    I'm not sure where the Mesh Shading Engine would fit in there, I suppose on the GCP side of things.

    I also don't know if the multi-core GCP thing is a MS thing. I suspect if it was perhaps they would have mentioned it. They have been really focused on customizing their GCP for the last 2 iteration of XBox, this might be an evolution of what they learned.
     
    #2550 iroboto, Aug 18, 2020
    Last edited: Aug 18, 2020
  11. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    282
    It is an optional cog in the graphics pipeline machine after all like the tessellation DLC, so it can hardly escape Graphics CP.

    Doesn't seem like they are explicitly claiming "Graphics Command Processor" being multi-core, unless I have missed something.
     
    iroboto likes this.
  12. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,046
    Likes Received:
    13,421
    Location:
    The North
    Is there a generic command processor that is separate from the GCP?
    I was just assuming they were the same thing.
     
  13. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    341
    Likes Received:
    282
    Given the subtle divergence in terminology, and the absence of ACE/HWS as freestanding colorful blocks, I wouldn't be surprised that "Multi-core Command Processor" is meant to refer to all blocks that eat PM4/AQL packets.
     
    iroboto likes this.
  14. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    12,046
    Likes Received:
    13,421
    Location:
    The North
    @Rys can you provide any light on this aspect here? I read some of your article on context rolls here, but I don't get it all, and not sure if CP and GCP are being used interchangeably.
    https://gpuopen.com/learn/understanding-gpu-context-rolls/
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,348
    Likes Received:
    184
    Location:
    San Francisco
    DirectML is just an API for implementing certain classes of DNNs. It won't magically put a superres/supersampling solution in the hands of developers.
     
  16. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,372
    Likes Received:
    3,754
    There is now two distinctions with RDNA 2 RT acceleration:

    1- It can't accelerate BVH Traversal, only ray intersections, traversal is performed by the shader cores.
    2- Ray Intersection is shared with texture units.

    In comparison, Turing RT cores:
    1- Accelerate BVH traversal on their own
    2- Ray Intersection is independent and is not shared with anything else

    So in a sense RDNA2's solution is hybrid, as it is shared between both textures and shaders compared to Turing's solution.
     
    #2556 DavidGraham, Aug 18, 2020
    Last edited: Aug 18, 2020
    pharma, Cuthalu, Lightman and 3 others like this.
  17. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    99
    Likes Received:
    52
    Location:
    Melbourne Aus.
    And as result can I infer that RT performance might scale in a very differently on RDNA2 vs Turing.
    RT perf on RDNA2 should be closely matched to the number of texture units and shader cores, Whereas on Turing RT perf is more closley determined by the number/amount of dedicated RT resources?

    So if your not gonna use RT all that RT hardware on Turing is a waste, if you choose not to do RT on RDNA2, you get more texture units to use..
    (assuming a RDNA2 core has more texture units than a Turing core, to compensate for loss due to RT usage. )
     
  18. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,717
    Likes Received:
    1,080
    Location:
    France
    Ah thx !
     
  19. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,472
    Likes Received:
    4,410
    Location:
    Well within 3d
    One extra box that is shown is the Shader Input block that is per-SE. This could be the SPI block, and while it seems likely that Navi has a similar arrangement, it is something not diagrammed for RDNA. This could point to one reason why the Shader Engine exists as an entity when it seemed like almost all hardware in it had been made per shader array, where wavefront launch is a resource shared between shader arrays in an SE.
    Another element is the capacity and number of arrows into the L2. That could point to 20 L2 slices, although another diagram only had 10 fabric links going to the L2.
    I'd ask whether there are more than 16 slices, and if the L1s can request more than 4 accesses per cycle. More than 16 could give more bandwidth internally, but not if the L1s cannot make more requests than they already do.
    What stands out to me is that if there are 20 slices, the so-called "Big Navi" leak would indicate an L2 with fewer slices, despite having a wider GDDR6 bus.


    A number of the RDNA instruction throughput claims are per-SIMD and there is a diagram with 4 instruction types being considered for issue. That would be 8, although one of the types is vector memory that contends for the same MEM/TEX block, so that's not necessarily out of line since there are two SIMDs per CU.


    I think scalar in this case is the regular math op in the SIMD, rather than packed instructions or some kind of matrix/tensor operation.

    A Turing SM has 4 texture units in an SM. A CU has a texture block with 4 texture filtering units.
    I don't interpret the slides as indicating the ray-tracing hardware can do additional texturing, and so it seems both architectures have an RT block and 4 texture units per SM/CU. What each block can do or how their functions interact with other work would need to be evaluated.
     
  20. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    235
    Likes Received:
    426
    Both implementations scale the same. More compute units more performance. Difference is that nVidia's RT Cores are doing more work and are fully indepentend from the other units.
     
    pharma and DavidGraham like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...