AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,815
    Likes Received:
    2,637
    It's not going to be just 10% under Vega 20, AMD showed their absolute best result with their weird Strange Brigade choice (which is not even a popular AA game), expect much worse real world results than this.
     
    A1xLLcqAgt0qc2RyMz0y likes this.
  2. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,299
    Likes Received:
    249
    AMD has some marketing deal with Rebellion. That's a known thing. Your opinion, that Strange Brigade showed the absolute best result is just a speculation.
     
    Lightman likes this.
  3. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,530
    Likes Received:
    875
    The L0 operand/result cache would lower utilization of the SRAM register file allowing more ALUs to be served by the same register file, or at the very least save a lot of power. It would be interstring to know if conflicts are handled in hardware or in software. If it is software, scheduling would be isolated to a single shader program, if it is in hardware you could have independent shader programs running on the same CU. Maybe the super SIMD extension is a means to utilize spare register file bandwidth (because request/stores are served by the L0 operand cache).

    Cheers
     
  4. snc

    snc
    Newcomer

    Joined:
    Mar 6, 2013
    Messages:
    117
    Likes Received:
    62
    What ? "1.25x improvement per clock, and 1.5x inprovement per watt"
     
  5. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,815
    Likes Received:
    2,637
    It really is not, AMD GPUs are known to punch above their weight in this title, the fact they showed this only title in comparison to 2070 speaks volumes about their real world performance. They did the same with Radeon VII by the way, showing it beating the 2080 in this title by a big margin, and we all know how that turned out in the end.
     
  6. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    239
    2080 currently beats client Vega20 in it by ~5% median so no, it wasn't the best case, not even close.

    Please stop with this bullshit.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Another wild arsed guess: there will not be a three-operand fetch from the register file for any instruction. A maximum of two instructions will come from the register file and the other has to come from operand cache (e.g. for FMA).

    It's basically pointless in this modern era to make the register file support three-operand fetches when there's so few instructions that can use three operands.

    Additionally, operands that go to cache but never need to go to the register file (having a short lifetime, e.g. one or two cycles) save write power/bandwidth to VGPRs.

    Since VGPRs consume quite a lot of space per CU, power will be relatively high to perform reads/writes (power is a function of distance), so less data making these round trips will save power. Reducing addressing bandwidth and having a loosened banking configuration will also make the VGPRs consume less power.
     
  8. PSman1700

    Regular Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    507
    Likes Received:
    118
    It is not BS, its logical for a company to use the best case scenario when demonstrating a product for the first time. I think we have to wait for real world game tests to see if it can match a RTX2070, let alone a 2080 or higher.

    https://www.pcgamer.com/amds-latest-gpu-driver-aims-to-boost-performance-in-strange-brigade/
     
    del42sa and DavidGraham like this.
  9. anexanhume

    Veteran Regular

    Joined:
    Dec 5, 2011
    Messages:
    1,546
    Likes Received:
    732
    When they launched Radeon 7, they claimed 1.25X performance iso power.
     
  10. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    523
    Likes Received:
    239
    It is BS because it's no longer the best case scenario ever since that nV driver update.

    Turing is super-competetive there now, and they did compare the unnamed Navi against 2070.
     
  11. PSman1700

    Regular Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    507
    Likes Received:
    118
    AMD not using a benchmark that favours them is hard to believe. Anyway, better to wait for real world game performance tests and reviews. Usually AMD and Nvidias own benchmark are what they are, BS.
     
  12. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,815
    Likes Received:
    2,637
    Only in DX12, AMD used the Vulkan API to do their Vega 7 vs 2080 comparison, in which they still have a very large lead. So they DO insist on using their absolute best case scenario.

    Please use a more civlized manner of conversation or I will be forced to retaliate harshly.
     
    #592 DavidGraham, May 30, 2019
    Last edited: May 30, 2019
    A1xLLcqAgt0qc2RyMz0y and del42sa like this.
  13. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,436
    Likes Received:
    264
    There are sync points like maintaining raster order so if one SE runs ahead it won't get to far.

    IPC means better performance per clock given the same configuration.
     
  14. itsmydamnation

    Veteran Regular

    Joined:
    Apr 29, 2007
    Messages:
    1,299
    Likes Received:
    396
    Location:
    Australia
    You know this how?
     
  15. w0lfram

    Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    159
    Likes Received:
    33

    What if... the L0 operand/result cache is connected directly to the HBCC …?
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    Smaller GPUs usually have a larger proportion of their area taken up by controllers and hardware outside of the primary graphics area, so even if smaller on absolute terms, a 256-bit GDDR6 setup can have more of an impact than a quad-HBM2 Vega 7, depending on other factors.
    However, one of the other factors is that Vega 7 has an unusually large amount of area around the graphics core, which seems to contribute to the area bloat versus an ideal shrink from 14nm.
    One of the possible residents on the die is additional infinity fabric blocks to connect the other HBM controllers, and possibly more mesh connecting the two sides--taking up non-zero area. GDDR6 likely takes up 3 or possibly 3.x sides, and may require a more sprawling interconnect.


    There's comments about bank conflicts that seems to indicate that there's a best-effort attempt at gather operands by the CU, and if the conflicts are significant enough there will be stalls. There's no indication of any encoding changes for the instructions to indicate that software has any other means of handling conflicts besides paying attention to the register IDs that belong to the same banks.
    https://github.com/llvm-mirror/llvm...b3561b2#diff-1fe939c9865241da3fd17c066e6e0d94

    (from GCNRegBankReassign.cpp -- note that it's not called RDNARegBankReassign)

    As far as L0, there's more than one context that term has been used. If discussing a destination cache in register file patents, AMD described the output flops of the register file as serving as an L0 for repeated accesses to the same ID, rather than describing the register output cache.
    The L0 in the LLVM changes appears to be a CU-local memory pool that plays a role in memory access ordering and can impact data visibility to wavefronts in other CUs, which seems distinct from the question of augmenting the register file and result forwarding within a SIMD.

    There seems to be an implication from the above code comments that there are 4 banks of vector register file, and unlike prior GCN architectures it's not a foregone conclusion that an instruction has guaranteed access to them. Going by the description of the stall behavior, it's possible that an FMA could source 3 in the same cycle with the appropriate register allocation pattern.
    A significant motivation for the super-SIMD patent is to use the lost operand access cycles and this could entail dual-issue or faster issue latency. The odd way the GFX10 changes document latencies may be consistent with something along those lines.

    I'm not clear on the full purpose of it, but it's called an L0 and there's mention of an L1 as well. Possibly, there's an L2 or something else beyond. The HBCC in Vega is past all the cache layers, and since its job is paging resources into the local VRAM pool it's not specced to handle something like all the local cache output of the CUs.
     
  17. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    2,815
    Likes Received:
    2,637
    [​IMG]
     
  18. snc

    snc
    Newcomer

    Joined:
    Mar 6, 2013
    Messages:
    117
    Likes Received:
    62
    And now they claim 1.5x
     
  19. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    169
    Likes Received:
    90
    anexanhume likes this.
  20. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,299
    Likes Received:
    249
    So it's not only a speculation, but it's a speculation based on some doubtful "facts"
    1. AMD presented RVII performance in three games representing three APIs (DX11, DX12, Vulkan). They haven't cherry picked only "their absolute best result". Not even the Strange Brigade performance was "their absolute best result" at that time. E.g. in Call of Duty: Black Ops 4 RVII was 20 % faster than RTX 2080.
    2. Nvidia released drivers, which boosted performance in Strange Brigade, so situation changed.
    3. Why should we expect, that a game, which was fine with Vega architecture, will be the best choice for Navi architecture? There are several games, which like Polaris, but don't like Fiji.
    4. Strange Brigade was never "their absolute best result" and is not even today. E. g. Radeon VII in World War Z @2560×1440 performs 28 % faster than RTX 2080. That would be probably "their absolute best result" to present Navi if(!) the architecture prefers the same workloads.
    5. It's not likely, that Navi prefers the same type of workloads as Vega, because the compute/fillrate and compute/geometry ratios were completely changed. It will have significantly lower compute performance, but significantly higher geometry performance than Vega.

    So, your findings are really just a speculation and it's not even a speculation based on facts. It's a speculation based on other speculation.
     
    w0lfram, Lightman and AlphaWolf like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...