AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    572
    Likes Received:
    256
    None of what they said is a GCN problem.
    All either their legacy from days before it or just plain lower quality circuit design.
     
    naenrda likes this.
  2. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,485
    Likes Received:
    5,018
    I'd call GCN to what AMD calls GCN. I
    n practice, it's all GPUs so far that use Compute Units with 64 ALUs each using RISC SIMD.
     
    pharma, Heinrich4 and Rootax like this.
  3. Heinrich4

    Regular

    Joined:
    Aug 11, 2005
    Messages:
    596
    Likes Received:
    9
    Location:
    Rio de Janeiro,Brazil
    pharma likes this.
  4. GPUCurious

    Joined:
    May 25, 2019
    Messages:
    2
    Likes Received:
    0
    So I guess everyone has decided that the rumor/leak from KOMACHI that Navi has 8 shader engines is false then? And everyone must have also decided therefore that Navi is using more than 40 compute units to compete with the 2070? Because I'm not sure how a GCN based GPU is supposed to beat a 2070 with 40 CUs at a vaguely sensible TDP without some sort of architectural advancement to GCN?

    [​IMG]
     
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,494
    Likes Received:
    2,226
    Location:
    Finland
    Komachi in general has been trustworthy leaker for what I can remember, but didn't he quickly delete the tweet where he said 8x5CU?`Which could indicate it wasn't solid
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I'm sorry to see my old thread vandalised by being closed for further replies.

    Did Beyond3D ever provide the source code for these tests? I'm reasonably sure they were at least partly debunked.

    Also, NVidia's "efficiency" had a big problem with games that use HDR, didn't it? While AMD cards suffered no problem. NVidia did eventually solve this problem though, as I understand it.

    I think this rumour could have merit. Though it might be for the wrong GPU?

    Apart from the increased fixed-function throughput this would offer, it could also change the ratio of scalar:vector instruction throughput. If a CU consists of 2 VALUs that are 32-wide (while retaining a 64-wide hardware thread group size) there would be twice as many SALU instruction issues available per VALU instruction issue.

    My overall feeling with GCN has been that the fixed function hardware and work distribution (that has to deal with a mixture of fixed-function and compute work) has failed to scale because it is globally constrained in some way. The mysteries of the use of GDS have made me wonder if GDS itself has been a relevant bottleneck, but regardless I feel there has long been some kind of global bottleneck.

    More CUs on their own won't help with this bottleneck. The only real solution is to pull apart the way that work distribution functions, minimising the effort required of the global controller. Part of this requires better queue handling, both globally and for each distributed component. This requires more internal bandwidth (since the definition of work can be quite complex) and interacts with how the on-chip cache hierarchy is designed.

    I've held this theory about AMD's failure to scale for pretty much the entire time we've had GCN, because the 4 CU limit has been around forever (though it took a while to discover that it was there).

    It might be worthwhile to consider why AMD ever thought it necessary to share resources between CUs. This has always smelt like a false economy to me. Some would argue that this is a side-effect of AMD considering GCN to be a compute architecture first, since AMD has spent about 10 years arguing that graphics is compute and fixed function is just a side-show for compute. (Which, I believe, is why consoles are so amazing these days as console devs have embraced this perspective.)
     
    Kej, DavidGraham, Lightman and 3 others like this.
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,196
    Likes Received:
    3,160
    Location:
    Well within 3d
    The Zen vs BD comparison becomes a question of architecture versus microarchitecture. The x86 architecture defines instructions and a range of behaviors to software, and the Zen, BD, or Skylake architectures are implementations of said behaviors. The particulars of what they use to carry out their architectural requirements and how well or not well they handle them are things the architecture tries to be somewhat agnostic about. That being said, implementation quirks throughout history can be discerned if one know the context of things like x87, FMA3, the often winding encoding and prefix handling, etc.
    That being said, x86 generally doesn't commit to something like cache sizes, or instruction latencies, or how many ALUs/load queues/or other lower-level details. In part, there are too many x86 implementations that provide contrary examples for saying a given resource allocation or pipeline choice is architectural.

    I think AMD might share some of the blame. If we go by the GCN architecture whitepaper, GCN is effectively the 7970 with some handwaving about the CU count. If we go by the ISA docs, we lose some of the cruft, but there's still a lot of specific microarchitectural details that get rolled into it.
    Which elements are what AMD considers essential to what is GCN are embedded in whatever else is in the CU and GPU hardware stack.

    There are other elements that at least so far would hold:
    4-cycle cadence--there examples of instructions whose semantics recognize this, such as lane-crossing ops
    16-wide SIMD--various operations like the lane-crossing ones have row sizes linked to the physical width of the SIMD
    incomplete or limited interlocking within a pipeline--courtesy of the cadence and an explicitly stated single-issue per wavefront requirement
    multiple loosely joined pipelines--explicitly recognized in the ISA with the waitcnt instructions
    very weakly ordered memory model with incoherent L1 with eventual consistency at the L2
    multiple memory spaces and multiple modes of addressing
    integer scalar path and SIMD path with predication
    separate scalar memory path and vector memory path

    The ISA docs also tend to commit to rather specific cache sizes and organization, data share capacity and banking, and the sizes of various buses

    These would be elements more present in the ISA docs rather than the overarching GCN doc. One thing the ISA docs do not help with is establish a consistent set of encodings versus instructions, and in a few GFX9 FMA instruction cases there are instructions AMD gave the architectural name that belonged to pre-existing instructions and so changed the behavior in a way that threw things like code disassemblers out of whack.

    As far as the ISA being RISC. Other than there being either 32-bit or 64-bit instructions, I think GCN is very complex. Multiple memory spaces, multiple ways to address them, multiple ways to address the same memory inconsistently. Many special registers that are implicit arguments to many operations, addressing rules, complex access calculations (such as the bugged flat addressing mode in GFX1010). Vector instructions source operands from LDS, scalar, special registers, or a number of special-purpose values rather than a straightforward register ID.
    To a limited extent, there are some output modifiers for results.

    I think another interpretation was 2x16, so perhaps allowing for multiple issue of instructions whose behavior would be consistent with prior generations. The lane-crossing operations would have the same behavior then, as it might be difficult to broadcast from an ISA-defined row of 16 to the next row if they're executing simultaneously.
    It might also help explain why register banking is now a concern, while a physically 32-wide SIMD and register file would still be statically free of bank conflicts. The latency figures seem to be in-line with a cadence similar to past GPUs, which might not make sense with a 32-wide SIMD.

    Some of the patents for the new geometry pipeline and those extending it cited the need for an increasingly complex crossbar to distribute primitives between shader engines bottlenecking scaling. One alternative was to use what was in effect a primitive shader per geometry front end and streaming out data to the memory hierarchy to distribute them. Although if that were the case for the ASCII diagram, needing a fair amount of redundant culling work at each front end might leave the relatively paltry number of CUs per SE with less CUs for other work like asynchronous compute, and relying on the memory crossbar that's being used by everything else to save on a geometry crossbar may be shifting from one specialized bottleneck to another global one.
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    What do you mean by redundant? You're referring to a single primitive being shaded (culled) by each instance (tile, effectively) it appears? Well that's the trouble with hardware implementing an API: brute force is always going to result in wasted effort.
     
  9. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,196
    Likes Received:
    3,160
    Location:
    Well within 3d
    Yes, the primitive stream is broadcast to all front ends, and the same occupancy and throughput loss would be incurred across all shader engines. It's proportionally less of an impact in a GPU with 16 CUs per shader engine versus that ASCII diagram that has less than a third of the compute resources available.
    Also unclear would be how salvage SKUs would be handled. A balanced salvage scheme would be cutting off resources in 20% increments.

    As far as attributing blame to the API, what specifically is inherent to the API that requires this? If there are many physical locations that may be responsible for handling all or part of a primitive, the question as to which ones are relevant needs to be answered by something, and then somehow the whole system needs to be updated with the answer.
     
  10. BRiT

    BRiT (╯°□°)╯
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    13,570
    Likes Received:
    10,510
    Location:
    Cleveland
    For sale in July.

    More details at E3 on June 10th.
     
    Lightman likes this.
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,060
    Likes Received:
    3,063
    The Sapphire rep was right on the money, top Navi competes with RTX 2070 (barely winning by 10% on AMD's favorite Strange Brigade). No mention of hardware Ray Tracing. High End Vega will continue to present on the high end for the foreseeable future .. all that remains is the confirmation of the 500$ price.
     
    pharma likes this.
  12. Ike Turner

    Veteran Regular

    Joined:
    Jul 30, 2005
    Messages:
    1,884
    Likes Received:
    1,758
    RDNA is supposedly not GCN..but..well...the reality is that it's probably still an evolution of GCN (which isn't a bad thing contrary to what some folks are crying about..). Anyway's it's clear that this seems to be streamlined evolution of GCN aimed at gaming while Vega (and its successor) will be the "compute" version of the GCN arch..BTW the die doesn't have HMB. Navi is the new Polaris as expected.
     
    #472 Ike Turner, May 27, 2019
    Last edited: May 27, 2019
  13. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,494
    Likes Received:
    2,226
    Location:
    Finland
    By the looks of it GCN will be indeed split into Compute-GCN and Gaming-RDNA.
    What "new polaris as expected because it's not using HBM"? Memory solution has little to nothing to do with the architecture, GCN(/RDNA) isn't tied to specific memory type, they can fit any memory controller they choose. Heck, even the Polaris architecture you specifically mentioned has products using both GDDR (desktop GPUs) and HBM (Intel "Vega" which is really Polaris)
     
  14. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    572
    Likes Received:
    256
    "same old GCN" - t. every redditor ever
     
    Wasmachineman_NL and Lightman like this.
  15. Ike Turner

    Veteran Regular

    Joined:
    Jul 30, 2005
    Messages:
    1,884
    Likes Received:
    1,758
    "New Polaris" as in "new mid-range GPU arch" (wasn't related to my HMB remark sorry)

    https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf
     
    #475 Ike Turner, May 27, 2019
    Last edited: May 27, 2019
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,196
    Likes Received:
    3,160
    Location:
    Well within 3d
    At least looking at the code commits thus far, there are certain hints at what could be considered significant departures, possibly.
    There was an announced new cache hierarchy. How different or new it is isn't clear, but there are some code comments with new naming conventions like indicating there is a per-CU L0 cache, rather than an L1.

    There are some indications of better, though not complete, interlocks in the pipeline--although I recall discussing in past architecture threads how I thought a good improvement to GCN proper would be to have those interlocks.
    Some things, like how differently the SIMD path is handled, and why certain instructions related to branching, memory counts, or skipping instructions changed/dropped could be other areas of notable change.

    Whether that's enough to be called "new", I suppose is up to AMD. The introduction of scalar memory writes and a new set of instructions for that in Volcanic Islands would be on the same level of some of these changes, and that didn't prompt AMD to declare Fiji or Tonga as not being GCN.
    Maybe GFX10 is different enough for AMD, but that's counterbalanced by how AMD has muddied the waters as to what is in GCN as an architectural model versus a collection of product minutia.

    I also don't see why a number of the Navi changes wouldn't be desired for the compute line. There are new caches, HSA-focused forward progress guarantees, memory ordering features, and pipeline improvements that would help a Vega successor as well, so how different a Vega successor would be--or why it would be similar enough to old products to still be called GCN isn't clear.
     
    Lightman, pharma, entity279 and 11 others like this.
  17. rSkip

    Newcomer

    Joined:
    Jan 10, 2012
    Messages:
    11
    Likes Received:
    20
    Location:
    Shanghai
    https://www.amd.com/en/press-releas...ion-leadership-products-computex-2019-keynote
    Footnotes:
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,196
    Likes Received:
    3,160
    Location:
    Well within 3d
    I didn't see which GCN product this was compared to.
     
  19. yuri

    Newcomer

    Joined:
    Jun 2, 2010
    Messages:
    195
    Likes Received:
    170
    Getting rid of the GCN branding just to calm the haters down. Well played :)

    Let's hope those "up to" improvements will be achieveable using regular products.
     
  20. del42sa

    Newcomer

    Joined:
    Jun 29, 2017
    Messages:
    173
    Likes Received:
    93
    https://www.anandtech.com/show/1441...ducts-rx-5700-series-in-july-25-improved-perf
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...