AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    715
    Likes Received:
    220
    Location:
    india
    Too specific indeed, but they got the code wrong, calling it Navi12. Perhaps the AdoredTV 'leak' swayed them that way. They also get right that Navi will be a new uarch, which not many were expecting till AMD put it out as RDNA. They say the internal code was KUMA.

    But they mention it being Vega56 in performance and that no Vega 7nm would show up for gamers. Amusingly, one of the top comments is about getting 2070 performance from AMD.

    There's this bit about 'Navi10', which imo is Navi12

    Doesn't make much sense if the other Navi chip was to slot in Vega56 position.

    According to Komachi, Navi14 is low end chip, so the Navi12 could be a bigger chip if not a small die too, and if it's not been scrapped.



    I'm also curious if AMD can improve the density of their chips while keeping the clocks the same. Perhaps console chips can be denser and do lower clocks? And of course, if AMD are able to do 3SEs or not?
     
    iMacmatician likes this.
  2. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,059
    Likes Received:
    1,021
    I’ve thought a bit about this one, and the answer to the first question is probably that they can’t do a lot about density at the same clocks without either going to another litographic variation at 7nm (such as using EUV with SDB) or for instance changing memory type and thus memory controllers.
    Whether AMD can increase density if they are prepared to lower frequencies, well, yes! But how much could they gain in density, and how much would they loose in frequency?

    Apple for instance have a transistor density in their A12 SoC of 82.8 million/mm2 as opposed to their second gen 16nm FF A10 that had 26.4.
    AMDs new Navi die has a transistor density of 41 million/mm2 (where the Vega10 die had a transistor density of 25.7 million/mm2 on GF 14nm). So while we are comparing mittens and gloves here, it does seem to suggest that the mobile variant is significantly denser, and that the HP scaling is a lot less. There are a number of reasons why HP designs should have a harder time scaling in density, but the specifics of just how much you would gain in density by making sacrifices in frequency is beyond this armchair expert.

    I’ll say this though. The enthusiast desktop GPU market is where it really pays to push your silicon to its bleeding edge. The more performance you squeeze out, the more money you can ask for your product, observe how much more Nvidia can demand for the 30% or so that 2080Ti has in performance over the 2070Super.
    And you don’t have to have any concerns about power supplies, or even coolers and noise - all of that is pushed to the end users and to some extent partner manufacturers. Whereas in consoles, the costs and consequences of needing larger power supplies, more expensive cooling, higher demands on ventilated placement, higher noise levels and consequently disgruntled customers that suffer failures due to trying to shut noisy devices away all fall squarely on the shoulders on the party, Sony or MS, that comissioned the design of the chip. It wouldn’t be surprising if a somewhat different compromise in size/density/frequency/power was struck.
    But again, jwhat that would mean exactly is not clear.
     
    #1222 Entropy, Jul 6, 2019
    Last edited: Jul 6, 2019
    Pixel and gamervivek like this.
  3. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,298
    Likes Received:
    247
    Well, NDA is lifted, Radeon RX 5700 XT offers 94 % performance of Vega 20 / Radeon VII:
    https://www.computerbase.de/2019-07/radeon-rx-5700-xt-test/2/#abschnitt_benchmarks_in_2560__1440

    Good luck next time. But I'm affraid that the audience of your purely negative "speculations" will be reduced significantly.
     
  4. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    Any confirmation on the amount of LDS per CU/WGP? I can only find AMD slides that don't have numbers for it.
    Some of the compute results seemed far off enough that I was curious if there's a wavefront occupancy issue with the LDS, or perhaps iffy register allocation choices due to the banked register file needing optimization.
    More optimized games that use intrinsics may see regressions, depending on how they detect whether a given GPU is one that can use them or if they apply with a GFX10 GPU whose instruction behaviors may be subtly different.
     
  5. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    AFAIK, it's unchanged per CU and two LDS' can be combined at the WGP level. I don't have anything in writing on it though and I did not ask too specific, i.e. there might be implications as to latency or occupancy between exclusive and inclusive modes.
     
    Lightman likes this.
  6. DmitryKo

    Regular

    Joined:
    Feb 26, 2002
    Messages:
    687
    Likes Received:
    556
    Location:
    55°38′33″ N, 37°28′37″ E
    They did specify 256 KB VGPR per full execution unit (64-lane wavefront), which is the same total size as in GCN (64KB x 4 SIMD16 vector units). This makes it twice as big per each SIMD32 vector unit (i.e. 128 KB x 2 vector units).

    Slide 11 presents RDNA-era shader unit as "4 Scalar/SIMD32/SFU8" - this is actually RDNA "workgroup processor" (WGP), a group of 2 CUs (exactly 4 each of Scalar, SIMD32 and SFU8 blocks in total).

    Then there is slide 12 in the same presentation where they compare x-rays of RDNA WGP/CU wih GCN CU. On this slide, RDNA LDS is pictured as shared across a WGP of 2 CUs, and its area is about two times as large in comparison to VGPR in each SIMD32 unit - which per above shall be 128 KB. On the contrary, GCN LDS has the same area as a single SIMD16 VGPR block, which were 64 KB each.


    So I would say that LDS should be 4 times as big in total size as it was in GCN CU - i.e. 128 KB per each CU (with 2 SIMD32 vector units) and 256 KB per WGP (2 CUs and 4 SIMD32 vector units).
     
    #1226 DmitryKo, Jul 9, 2019
    Last edited: Jul 9, 2019
    Lightman and BRiT like this.
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    In the context of the poor compute performance, I was curious if the early support for Navi's banked register file could have failed to spread register references across all banks.
    An equivalent amount of register IDs allocated between a GCN and RDNA shader can experience more stalls in the latter due to bank conflicts, and possibly to some additional loss can occur if the compiler didn't handle the longer dependent instruction latency well.

    This conflicts with the impression given to CarstenS, and there is some risk in using an conceptual diagram to guess at physical dimensions. Part of the reason why certain items are shown stretched twice as wide is that those items are shared. However, it's not a given that they're actually that size when shared, or that items like the scheduler and scalar unit blocks shrank as much as the diagram showed just because the artist didn't need to stretch them across the diagram.

    Perhaps further details will come out if a white paper or ISA doc is released.

    That diagram is also the one that I noted showed the branch/message block going missing in the GCN to RDNA transition without comment.
     
  8. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    520
    Likes Received:
    239
  9. Betonmischer

    Newcomer

    Joined:
    Jun 30, 2019
    Messages:
    16
    Likes Received:
    33
    Good catch. It's the same in SiSoftware Sandra. Probably has something to do with the new CU grouping in RDNA. In GCN up to four Compute Units used to share the vector instruction and scalar caches. RDNA groups CUs by two, and LDS is now also a shared resource. It may well be that apart from those changes CUs in RDNA ended up sharing some of the front-end logic involved in instruction fetching and decoding.
     
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,564
    24 Navi CUs sounds like a Polaris 11 replacement.
    Or rather a Polaris 11 sized chip with Polaris 10 performance.

    Could be a very decent chip for laptops.
     
  11. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    The RDNA diagrams indicate that there is a shared scalar cache and shared instruction cache. A slide also indicated that unlike the shared decode and issue units for GCN, there are separate decode and issue units for the vector path and also scalar.
    No mention was made for the other instruction types, although the vector memory path could readily have duplicate decode/issue since there are two texture blocks and L0 caches.

    The LDS is shared, so what that means in terms of decode and issue is uncertain.
    The export, message, scalar memory, and other types are not mentioned and some may be more suitable to having shared decode and issue. There's still the one scalar cache, and elements like the export path are arbitrating for a common bus that wouldn't allow independent instruction issue anyway.
     
    AlBran likes this.
  12. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    331
    Likes Received:
    85
    I'm surprised by the CU count. Underclocking would have to produce huge power efficiency gains for that many to run on most laptops, unless this is some gaming laptop specific GPU.
     
  13. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    9,971
    Likes Received:
    4,564
    $1500 gaming laptops use the TU106, which the 40 CU Navi 10 competes with in graphics cards using similar performance and power consumption.
    A 24 CU Navi would proportionally be closer to a TU116 which is going on $1000 laptops.

    Though this could end up being exclusive to the rumored 16" macbook pro.
     
  14. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    715
    Likes Received:
    220
    Location:
    india
    Likely Navi14, mobile chip(NV_NAVI14_M_A0),

     
    Lightman and PizzaKoma like this.
  15. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    331
    Likes Received:
    85
    That's, a lot of codenames.

    If AMD can easily scale their "double CU" count inside each block, which it seems like they can far more easily than they used to be able to with GCN, then we've little idea of what CU count each of these could even have. EG Navi21 is a differently scaled Navi 20 maybe?

    Either way I do wonder if they can get more performance per mm by scaling up Cu count vs cache/bandwidth/Primitive shader engines. RDNA does catch up to Nvidia in titles Nvidia used to dominate on, seemingly geo heavy stuff like GTAV/Total War. But some compute heavy stuff like Raindbow Six Siege it falls a bit behind relative to other performance scaling. Maybe putting six double CU's per block rather than five would an overall win, or it could even scale to seven per if you want high end 4k performance. Samsung already produces 16gbps GDDR6, a 15% overclock from Navi 10, which could in turn accommodate 6 double CUs per block over 5, if bandwidth is a bottleneck.
     
    #1235 Frenetic Pony, Jul 13, 2019
    Last edited: Jul 13, 2019
  16. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    715
    Likes Received:
    220
    Location:
    india
    I'm not sold on RDNA's supposed improved over GCN until computerbase do a comparison of 5700XT against 390 instead of every other chip in existence.
     
  17. PizzaKoma

    Newcomer

    Joined:
    Apr 29, 2019
    Messages:
    39
    Likes Received:
    64

     
  18. bridgman

    Newcomer Subscriber

    Joined:
    Dec 1, 2007
    Messages:
    58
    Likes Received:
    102
    Location:
    Toronto-ish
    Just curious, why 390 ? The RX480/580/590 has a similar number of CU's and generally outperforms the 390, doesn't it ?
     
  19. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    331
    Likes Received:
    85
    40 was easy to guess at, AMD had 20 CU single "Shader Engine" GPUs last year, a change over their 16CU ones already. That this was a change that'd extend to Navi, they had be doing ti for some reason, was easy to guess.

    In fact, looking at the 6 double CU count per "shader engine" whatever that means in RDNA, for Navi 14 or whatever it is I'd easily guess any refresh/new chips will probably hit that. Looking at 4k and 1440p results of Navi10, that GPU actually drops relative to Nvidia's performance when going up in resolution. Meaning, most likely they're hitting compute/bandwidth bottlenecks rather than geo or work distribution bottlenecks. So moving each "SE" to 6 double compute units (from 5) and faster GDDR6 (if Bandwidth is bottlenecked anywhere) would give a good boost to 4k benchmark results compared to the relatively small increase in die size it'd incur.

    It'd also mean adding just 2 more SE's, and a 384bit bus total, would compete with a 2080ti at 4k (at least at current clocks_. Which is very good for retail, as it'd need to curtail the clock speed much less than a theoretical 80 CU Navi card and compete at a smaller die size.
     
    #1239 Frenetic Pony, Jul 15, 2019
    Last edited: Jul 15, 2019
  20. gamervivek

    Regular Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    715
    Likes Received:
    220
    Location:
    india
    Because the 390 had 64ROPs and Polaris needed higher clocks. So the per-FLOPS comparison looks better for Navi vs. Polaris than vs. Hawaii.

    40CU being the first navi chip as well, certainly looks like they had a decent source. I didn't pay much attention to that single shader engine chip, any pointers?

    The reversal of AMD doing better at higher resolutions is certainly interesting, maybe bandwidth and the abnormally low pixel fillrate noted in AT review is the cause.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...