AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    919
    Yes, that's why I said the board was 180W. That's the relevant number for PCIE compliance, if I'm not mistaken.
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I think there's merit in this idea but it also appears to be somewhat limited, at least in terms of how much performance might be left on the table.

    The 64-wide threads are described as hiding inter-instruction latency (which I interpret as referring to complex register file banking (e.g. indexed reads), lane interchange and LDS operations) better than 32-wide. These operations are more typical of compute than of graphics. Additionally, scalar instructions will often be shared by all 64 work items in this case, halving the count of required scalar instruction issues required - though the new architecture appears to make scalar instructions less of a bottleneck than previously. Scalars are heavily used in compute to generate addresses and to control branching and related queries (any, all).

    There's a slide in the deck which shows how 64-wide threads issue at a rate of 2 instructions from the same thread to the same ALU, consecutively:

    [​IMG]

    interestingly you can see from this example that Navi cannot issue two instructions consecutively when there's a store to load. v0 is written in cycle 2 but cannot be consumed until cycle 7, four cycles of latency. GCN doesn't have to switch to another hardware thread in this case, while Navi does.

    In this case 64-wide is preferable - except that two 32-wide threads could have ran in the same time. So then it's a question of coherency of work item execution versus cache thrashing and other side effects of switching amongst lots of threads.

    I don't understand why there are "dual compute units". They appear to share instruction and scalar/constant caches, which seems like a weak gain. I don't fully understand the slide that refers to a "Workgroup Processor", it seems to be saying that because LDS and cache are "shared" huge gains in performance from issuing large workgroups are possible. So perhaps If you have a workgroup of 128 or 256 running in Workgroup Processor mode, then you get twice as much LDS capacity and 4x the cache bandwidth as on just a single compute unit.
     
  3. Love_In_Rio

    Veteran

    Joined:
    Apr 21, 2004
    Messages:
    1,455
    Likes Received:
    115
    Isnt that precissely what was described in the super simd patent?.
     
  4. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    565
    Likes Received:
    652
    Twice the LDS capacity without the usual occupancy penalty?

    I guess it will take some time until we understand such changes. Where do those slides come from?
     
  5. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    How is it decided whether to use wave32 or wave64 mode?
    I mean, the compiler could know the additional stalls with wave32 mode, but it seems like there would be a lot more factors which would decide what is optimal (and for things like potentially divergent branches that might be factors the compiler cannot know).
     
  6. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    1,551
    Likes Received:
    736
    Yes, but you shouldn’t scale the RAM power by the CU count then.
     
  7. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    20,814
    Likes Received:
    5,916
    Location:
    ಠ_ಠ
    Raven Ridge uses IF. It's just their defacto interconnect for everything since it arrived (c. 2017), chiplets are just another usage case.
     
    Globalisateur likes this.
  8. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    184
    Likes Received:
    152
    Lightman, TheAlSpark and JoeJ like this.
  9. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    347
    Likes Received:
    93
    Unfortunately that's probably not going to work, in large part because of ram limitations. Another reason to be disappointed in this launch is the 8gb of ram. AMD knows they'll be doing 16gb cards next year to keep up with the PS5/Xbox launch, but they limit the ram anyway to the competition, instead of to what would be more futureproof.

    Of course Nvidia knows this nigh as well as AMD, and still do it to. So no points there, but it does mean anyone buying an RTX 2080, or even a 5700... blah blah blah (what's the highest end one with maximum wordage again?) anyway, anyone buying those this year and expecting them to last at high settings is going to have a bad time.

    Though what bothers me most is that "Next Gen" on the roadmap, for all of next year.
    With the lack of graphically related instructions/features on this launch it's not too hard to guess Next Gen is what the PS5/Scarlett are actually being built on.
    Variable rate shading, programmable primitive shaders (ala Nvidia's mesh shaders), raytracing support in some fashion, does this launch even have conservative raster and etc.?

    All these things are either announced, or being expected/heavily pushed for by devs, including devs that work directly for both MS and Sony studios. So I'd easily expect most if not all of them and more to show up in these consoles. Which doesn't bode well for anyone buying these cards this year at the very least in terms of features, if not efficiency as well.

    I tried doing a per mm comparison with Turing, but after tearing my hair out over the inane charts of node comparisons and variants of nodes I gave up. So here's a rougher estimate: A 2070 die has roughly the same transistor count as a 5700 series. And while a 5700 might be a bit faster, it's also clocked faster. A 2070 meanwhile has raytracing support (even if it is narrow) tensor cores (even if it isn't relevant to gaming) and poor performance per mm as it is. So right now Navi's efficiency per mm, while technically better than Vega's, still isn't that impressive.
     
    #869 Frenetic Pony, Jun 11, 2019
    Last edited: Jun 11, 2019
  10. Pressure

    Veteran Regular

    Joined:
    Mar 30, 2004
    Messages:
    1,355
    Likes Received:
    283
    High end part could have 5120 stream processors and be on par with VEGA.

    Screenshot 2019-06-11 at 23.17.47.png
     
  11. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,499
    Likes Received:
    919
    Well it's just a rough estimate, of course, especially since such a board would probably use some flavor of HBM anyway.
     
  12. Digidi

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    227
    Likes Received:
    99
    Is there any news about Polygon output per cycle or per Second?
     
  13. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,301
    Likes Received:
    397
    Location:
    Australia
    Only the bit about 8 primitive shaders into the geometry processor and 4 out a cycle, nothing else really.
     
  14. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    347
    Likes Received:
    93
    It also seems like the block diagram is for the most basic type of Navi chip, which might be a 20 "Compute Unit" part? So a 5700 could be two of these put together.
     
  15. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    No, that block diagram is exactly for Navi10, 20 double-CUs packed into 4 macro-blocks which themselves are packed into 2 SEs
     
    Cat Merc likes this.
  16. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    347
    Likes Received:
    93
    I've been trying to figure that out, but haven't gotten a definitive answer as to what a "double compute unit" is from someone that knows for sure. I'd first assumed it was 20 CUs per Se, but saying "double compute unit" is weird, especially with their Super SIMD like 32/64 wavefront, which could be a "double compute unit"
     
    #876 Frenetic Pony, Jun 12, 2019
    Last edited: Jun 12, 2019
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I don't know. I didn't give that a very close read when it appeared and haven't looked since...
     
  18. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    240
    It sounds like nV's TPC, where two SMs share some blocks together.
     
  19. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    Yes, every AMD architecture has been able to vary these numbers. That's why the 5700 works, not just the 5700 XT.
     
  20. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    611
    Likes Received:
    1,052
    Location:
    PCIe x16_1
    Each Shader Engine contains 10 Workgroup Processors, which in turn each contain 2 CUs. The CUs inside of a WGP can be grouped up to cooperate on workloads, if the compiler deems it beneficial.

    [​IMG]
     
    Cat Merc, Lightman, w0lfram and 4 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...