AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    475
    Likes Received:
    196
    Huh, an interesting idea. Take JPEG Xl, the jpeg foundation's finally produced next spec, for an example compression scheme and hey 20:1 or better compression ratio with little artifacts at the cost of... whatever the decompression shader cost is. Could easily be worth it for the right titles, tons of ultra high res textures with zero pop in for the cost of X performance.

    I wonder if this will actually make it into PS5/Xsx, or if this is just some random patent they felt like applying for.
     
    PSman1700 and pjbliverpool like this.
  2. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Pretty sure this is already in. There isn't much missing. The first stage lookup is effectively provided by the normal texturing hardware, second stage is text book tiled resources but with spare / transient tiles. From there on, what's missing is essentially "cache line lock" intrinsic, for providing a concurrent access save approach to filling in the tiled resource tile-wise.

    I assume even the later one had almost been in place already. It wouldn't actually be tied to the L1/L2 cache, but rather arbitrary memory region with initialization protected under a critical section. Required feature is to block until the generating shader is done, and only to invoke the generating shader if not hitting the cache.

    And what makes me suspect that AMD is adding these capabilities? Because it's a building block for another extension AMD can't provide in GCN / RDNA yet: VK_EXT_fragment_shader_interlock (specifically see beginInvocationInterlockARB() and endInvocationInterlockARB() in
    https://www.khronos.org/registry/OpenGL/extensions/ARB/ARB_fragment_shader_interlock.txt ). In other terms, device wide critical sections on "arbitrary" tags.

    The patent just describes a clever application of that missing feature, respectively of the generalized form which applies to tags other than fragments.

    Actually, that application might have been devised in the process of implementing ROP independent device wide critical sections. Sounds like a typical AMD move, to head straight for a generic hardware implementation (not patent-able on it's own), and then to figure out what else it could be good for later on.
     
    #2182 Ext3h, May 29, 2020
    Last edited: May 31, 2020
    Lightman, Frenetic Pony and BRiT like this.
  3. Lurkmass

    Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    106
    Likes Received:
    97
    Actually, shader interlocks are supported on recent AMD HW. The reason why they don't expose them in either their GL/VK drivers is because it's a bad idea to use them since executing critical sections is a high latency operation. Their recommendation is that you're better off using linked lists for doing arbitrary blending or OIT.
     
    Ext3h likes this.
  4. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    Ordered shader interlock is implemented, you mean? And only on Vega / RDNA family. (SOPP, S_SENDMSG, MSG_ORDERED_PS_DONE).

    There is no native unordered shader interlock support, and the ordered one appears to be hard-wired with severe implications on efficiency of rasterization (not just denoted CS is blocked off, but whole work generation is stalled).
    With the instructions supported yet, you could only construct unordered CS support by using mix of atomics and sleeps. Worst case scenario, as you get serialized execution with with random latency in between serialized parts on top.

    Special CS support with first-to-arrive logic is actually simpler (as you may cache locally once shared mode has been reached), but still inefficient to implement in software:
    Code:
    if(*init_guard == 2) {
       // NOP, lucky cache hit
    } else {
       int state = atomicCompSwap(init_guard, 0, 1);
       if(state == 0) {
           init();
           atomicExchange(init_guard, 2);
       } else while(state != 2)  {
           sleep();
           state = atomicCompSwap(init_guard, 2, 2);
       }
    }
    With sleep instruction (SOPP S_SLEEP) not exposed by any intrinsic, atomicCompSwap loop is still a bad choice. So there got to be some hardware arbitration or at least an intrinsic to handle that properly without an (unthrottled) spin-lock.

    The whole thing is then probably interleaved with memory management. No visible page fault handler in RDNA, but in order to provide the benefits as described in the patent, that logic has at least to operate on a virtual memory segment which is subject to being dropped on L2 cache eviction. LDS or GDS don't fit the size requirements, and spilling to main memory is failing the point of using texture compression.
    At least for RDNA 1.0, I don't see such a capability documented yet, but doesn't sound too far off either.
    For the purpose of texture space shading, sub-allocations linked from instanced lookup table should suffice. Effectively good old tiled / partially resident texture, but with device managed allocation strategy.
     
    #2184 Ext3h, May 29, 2020
    Last edited: May 29, 2020
    pharma and BRiT like this.
  5. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    388
    Likes Received:
    331
    It's still used these days, just for other tasks, for exaple, nvJPEG is used to make NN's training faster.
    There is even the nvJPEG hw decompression block in A100 for these purposes:-D
     
    pharma likes this.
  6. Lurkmass

    Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    106
    Likes Received:
    97
    Yes ...

    'Unordered' shader interlock doesn't require any special HW support since it was the "default case" prior to ordered interlocks. Any HW/driver combination can implement unordered interlocks with UAVs or images by doing atomic R/W ops on those resources and then observe the race conditions afterwards a result.

    The reason why interlocks are a bad idea on AMD HW is that you're trading off decreased memory bandwidth consumption for decreased parallelism so it has a crappy pay off for them in the end. It cannot be a good idea performance wise to stall fragment shader execution unless you're Intel or one of those tiler GPUs that you'd see on mobile devices.

    I think we might've painted ourselves in a corner with shader interlocks since it has massive performance implications for future discrete GPU HW designs ...
     
  7. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    205
    Likes Received:
    179
    Lightman and pharma like this.
  8. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,614
    Likes Received:
    3,677
    Location:
    Pennsylvania
    Cichlid? As in freshwater fish?
     
  9. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,070
    Likes Received:
    2,942
    Location:
    Finland
    Plot twist: "AMD Ryzen Mobile" is real despite mispelling Gauguin and it's 4CU Navi2x is Sienna Cichlid
     
  10. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    39
    Likes Received:
    73


    Slide is not AMD, just an analysis of the patches by Locuza and he made it.
    I did find some multiple clock domains for the DCN 3.0 and VCN 3.0 from the source code.
    The SDMA engine is indeed updated. v5.2.
    Also SMU is updated.
    But hard to say anything about the important parts.
     
    Lightman likes this.
  11. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    475
    Likes Received:
    196
    Better performance per watt was one of the expected RDNA2 benefits, especially since they need to put in the next mobile APUs and licensed it to Samsung for mobile GPUs.

    Tinkering with voltage domains and sleep states and stuff should be pretty par for the course as such. Wonder what "display/video core next 3.0" and whatever will entail. AV1 decode I hope?
     
  12. eastmen

    Legend Subscriber

    Joined:
    Mar 17, 2008
    Messages:
    10,843
    Likes Received:
    2,032
    plot twist , it was lockheart all along
     
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,363
    Likes Received:
    3,942
    Location:
    Well within 3d
    There was a second graphics queue referenced in driver commits for Navi as well, though it would frequently be left off or inactivated. Not sure what would distinguish it for the successor.
    The ACE queue count being 4 sounds like a possible reduction if confirmed.
    The ACE count and at least one reference to 128-bit GDDR6 sound more appropriate for a portable or lower-range product.
     
    Lightman and ethernity like this.
  14. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    205
    Likes Received:
    179
    The 128b reference is strange since they have been obfuscating this sensitive pre-launch info by using binary firmware - check all those calls to amdgpu_atomfirmware.
     
    Lightman likes this.
  15. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    39
    Likes Received:
    73
    From the RDNA architecture whitepaper, what I could find is that Navi10 has four ACEs. Each ACE handles one shader array.

    Yeah, I saw this too, but then I see that it is only for emulation mode so not sure what to make of it, I think it will disappear soon because the value is read from the firmware usually.

    Some stuffs that I could quickly glean from the patches
    • The DCN and VCN indeed seems to be major changes. VCN 2.0 was introduced with Navi. And now DCN and VCN 3.0
    • There are two clock sources for VCN and two clock sources for DCN from the patches.
    • There are 2 additional SDMA engines for a total of 4 compared to 2 from Navi10.
    • Firmware identifies if the chip is aircooled or liquid cooled
    • XGMI support!? So far this was seen only for Arcturus. I wonder what is up here. This means that the chip is foreseen to link up with other chips to do workload sharing.
    • PP table clocks conveniently removed with TODO
    • New PCI audio device.
     
    Lightman likes this.
  16. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    384
    Likes Received:
    389
    There has been only one ACE since early GCN, and there still only is. What's shown as 2 ACEs is a single core with 2x SMT, and each thread polls from a number of queues.

    AMDs presentation of that implementation detail is more artistic freedom than anything else.
     
    Lightman, trinibwoy, pharma and 3 others like this.
  17. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,498
    Likes Received:
    775
    What could this be?
     
  18. BRiT

    BRiT Verified (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    15,473
    Likes Received:
    13,972
    Location:
    Cleveland
    So a different or new PCI ID for AMD TrueAudio Next Next?
     
  19. ethernity

    Newcomer

    Joined:
    May 1, 2018
    Messages:
    39
    Likes Received:
    73
    Yes it seems like a new PCI device. TrueAudio Next Next.
     
  20. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,025
    Likes Received:
    5,562
    Playstation 4 Portable confirmed!

    :runaway::runaway::runaway:

    Well one of my theories for the PC to catch up with next-gen consoles in I/O speed is that future graphics cards may get a direct connection to a fast SSD, without having to send data through the main system RAM.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...