AMD: Navi Speculation, Rumours and Discussion [2019]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

  1. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    I'm thinking that if Navi 21 is actually more or less on par with GA102 - and all signs point to this, if only for the XTX config - then there must be something in addition to the rumored 256 bit bus and it's unlikely to be magic.
    It's also possible that reaching higher clocks and improving power consumption could result in some die area increases.
    But suggesting that they somehow crammed 80 WGPs - or even twice the SIMDs per WGP - while staying inside 300W and on the same process and reaching 2.4GHz clocks isn't realistic at all.
    When you're saying that NV did it you're forgetting that Turing actually had these lanes already, they just weren't capable of FP32. So not the same situation.
     
    PSman1700 and pharma like this.
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    There's no sign of such in XSX die though. PS5 die (based on approximate area) shows no sign of a gross die size increase versus Navi 10 or Navi 14.

    There's 30%+ die area that can't be explained!

    The XT's rumoured base clock is 1500MHz or slower!



    Well the 52% increase in transistor count for 17% more SMs and ROPs (1 extra GPC) needs to be explained... Sure, transistor counts can't translate directly across foundries, let alone nodes, but 52% is a different ballgame.
     
  3. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    1,794
    Likes Received:
    713
    Location:
    msk.ru/spb.ru
    XSX GPU seem to be a mash up of RDNA1 and RDNA2 - or to be more precise RDNA2 h/w configured similarly to RDNA1. I don't know how much we can assume based on XSX die of PC RDNA2 parts.
    And there is a clear explanation if Navi 21 will in fact come with a 256 bit G6 bus. If the cache rumors are true it could be a very interesting design which will scale differently to what we may assume from console and RDNA1 parts.

    Base clock is for "power viruses" like Furmark, never to be used in real world workloads. Game clock of 2-2.1 GHz will actually be higher than that of Navi 10.

    There are many changes in Ampere besides the new SIMD configuration. FP32 ALUs should be more complex than INT32 and they are likely in addition to the latter there but these 52% are possibly distributed evenly between everything new in Ampere - new caches, new regfile, new RT cores, new TCs, new MCs, new ROPs, etc.
     
    PSman1700 likes this.
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    I agree, generally XSX (and PS5) are problematic here because, like with their use of Zen 2 not Zen 3, they should be assumed to be "0.x% of RDNA 2". But for clocks, specifically, PS5 shows no signs of a substantial change in die area beyond what's seen in Navi 10/14, despite being much faster.

    Yes, I agree, if this is a monster cache, then performance scaling analysis is a new mystery. Even if it's not a monster cache, it will be a new mystery. Similar to the "tiled" rasterisation that Maxwell brought, which was a radical improvement.

    Navi 23's die size discrepancy is about half of Navi 21's. If it's a single shader engine with 32 ROPs (even with doubled zixel-rate) on 128-bit bus targetted at 1080p gaming that leaves us contemplating a 64MB last level cache, if Navi 21 is 128MB.

    So GA102 in sustained maximum FP32 compute runs at much higher than 1400MHz (ish)? I can't find any data on this...

    The items in your list, all added together, are a small percentage of the entire die.
     
    BRiT and PSman1700 like this.
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,847
    Likes Received:
    1,044
    Location:
    New York
    Let’s look at it another way. TU102 -> GA102 required 50% more transistors for 17% more SMs, same bus width and a few tweaks to existing functionality. Maybe GDDR6X took a chunk of that. Or the new FP32 data path was actually not that cheap. But let’s assume it is.

    We know RDNA 2 is introducing raytracing hardware and promises much greater power efficiency. I think it’s fair to assume RDNA2 CUs will grow significantly just for those 2 items alone.

    Is it really realistic to quadruple the number of these beefier CUs on the same process while keeping power in check? If I had to guess, your area estimates are just off or AMD spent transistors somewhere obvious like a wider memory bus.
     
    #3805 trinibwoy, Oct 18, 2020 at 11:42 PM
    Last edited: Oct 18, 2020 at 11:51 PM
  6. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,847
    Likes Received:
    1,044
    Location:
    New York
    [​IMG]
     
    pharma likes this.
  7. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    25
    Likes Received:
    43
    My current hypothesis is that the widening gap between base clocks and boost/game clocks is related to the fact that in most 'legacy' games the RT hardware is completely clock gated and powered down, so the entire TDP/TGP envelope can be spent on keeping the clocks up for the other functional units. Heavy workloads with RT enabled are where I would expect to see lower clocks. At any rate, we shall see soon enough. This is definitely one of the more interesting product cycles I've followed in many years.

    One other thing to consider is that with tight supply @ TSMC for all of their advanced nodes, this may well cost more per die than GA102. TSMC is definitely charging more per unit of die area and likely even more per transistor than Samsung is on their 8nm process. Since we don't have detailed yield numbers on the Samsung side it's going to be tough to really know exactly the cost difference, but I wouldn't be surprised if GA102 is slightly cheaper to produce. **with the caveat that lower yields on Samsung's process means a higher ratio of 'harvested' dies to fully enabled and functional ones.
     
    #3807 T2098, Oct 19, 2020 at 2:54 AM
    Last edited: Oct 19, 2020 at 3:00 AM
    Lightman and BRiT like this.
  8. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    791
    Likes Received:
    410
    If all you want to do is compute fractals in FP32 like AIDA64 does, the 3080 definitely is the best.
    If you have a real interest in fractals, you would want to have at least FP64 precision, and there a 7820X CPU / AVX512 or Zen 16 core / AVX2 easily beats a 3080.
     
    #3808 Voxilla, Oct 19, 2020 at 10:19 AM
    Last edited: Oct 19, 2020 at 10:33 AM
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    2080Ti is 23% faster than rated while 3080 is 5% faster (both compared with boost clock). So a 14% cut-down GA102 is throttling dramatically more than a 6% cut-down TU102.

    Is that sustained or over quickly? If it's not sustained, then left for longer, GA102 would get slower and slower.

    In this video:



    2080Ti completes the single precision test in "under 10s" (video is at 10x speed). So that's not sustained.

    Clock is locked at 1980 (far above base clock) here on 3080 stress test:



    Clock is variable on 2080Ti stress test, but falls only as far as boost clock:



    So GA102 is not running at "power virus" clock (base) in Furmark as claimed by @DegustatoR

    So the evidence right now appears to indicate that compute is throttling GA102 and that Furmark doesn't.

    Furmark stress test on 5700XT never falls below "game clock" of 1905MHz:

     
  10. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    697
    Likes Received:
    148
    Couple of other things to add on what might be driving the increased transistor cost/HW in addition to the rumoured cache (and of course rumuored die sizes):-

    1. VRS in addition to RT, probably a few % for both. Possibly higher IPC per CU
    2. Redesigned ROPs. As per MS, there is some additional Memory compression HW as well
    3. ML - int8/int4 support
    4. HW Decode - Likely support for 8K AV1 decode and possibly more display outputs/VR related HW
    5. IO decompression - Possibly solutions similar to XSX/PS5
    6. Transistors spent on optimizations for area and power/dark silicon as suggested. More fine-grained clock-gating is also likely. This can be non-trivial

    All of these can add up. FWIW, i'm not so sure on N23 being only 32 CUs if the N22 being 40 CUs rumour is also true. One of them has to be wrong.
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    XSX CUs show approximately 0 growth in area compared with Navi 10 and 14. Supposedly the AMD ray tracing patent document indicates a minor increase in die size. (Maybe one day I'll bother reading it.)

    Navi 21 is rumoured to support less hardware threads per SIMD (16 instead of 20). I can't see that this would make a major difference in CU size though. In theory more complex shaders are likely to result in less hardware threads being able to fit into a SIMD, therefore supporting 20 is pointless. Less hardware threads in flight per SIMD also helps with L0 cache coherency (reduced thrashing there). But 16 versus 20 isn't some amazing difference... 8 would make me think for longer...

    If only XSX CUs were dramatically bigger...

    Rumours are for a 60% performance per watt improvement. That's more than normally occurs with a full node transition! It seems to indicate that Navi 10 was a terrible failure.

    Can we expect Navi 23 (6500XT?) to give the same performance as 5700XT? 32 CUs, 32 ROPs, 128-bit bus, 9.8TFLOPS at 2400MHz (guessed clock). That seems pretty reasonable to me. It supports the monster-cache rumour.

    How much power would that consume? 130W like 5500XT? That's a 50% bigger die than Navi 14 (235 versus 157), which is rated for 130W.

    My modelled
    • Navi 14 derived from Navi 10 is off by about 10mm² (167 versus 157 actual)
    • Navi 23 (with 32 WGPs) derived from Navi 10 is 13mm² too large (248 versus 235 rumoured).
    Those are both 6% out. I think Navi 23 will actually be smaller than 235mm², because packaging makes the size too large. My Navi 10-based models are a few weeks old now and could be improved by my Navi 14 analysis, but 6% is crushing my enthusiasm.

    (There is an error in what I published a few weeks ago, but it's a self-cancelling error in the "uncore" versus "miscellaneous IO + edges" areas - they're mutually derived, so the error cancels in other modelled GPUs. Uncore is twice as large, so miscellaneous IO + edges is half-sized. I haven't come up with a way to improve the model for these two areas, but it's notable that Navi 14 has a smaller miscellaneous IO + edges area - less HDMI/DP? Seems strange, since there's a "pro" variant... The error in these, jointly, could be 10-15mm², out of about 60-70mm² when modelled for other GPUs, I suppose, but the Navi 10 die shot is too poor in quality to understand these areas.)

    My modelled PS5 die size is about 8mm² too large (316 versus 308) which is only 3%, but 308mm² is from the PS5 teardown video, so that seems murky and packaging makes it "measure" too large.

    Navi 21 die size, 536mm², is too vague in my opinion to make much of a claim (modelled 524mm² with 80 WGPs). Packaging-derived measurement error would be about 20mm², e.g. 516mm².

    On-chip or >256-bit GDDR6?

    An improved on-die memory system, "Infinity Fabric" combined with "Infinity Cache" should use more transistors, definitely, even without a monster cache. XSX die shows a pair of fairly large blocks that are labelled as "SOC fabric, coherency, G6 MCs" (about 25mm² for the two). I dunno how to translate that into Navi 2x... Some of that should be directly associated with the CPU, but some of it is "Infinity ...".
     
    PSman1700, Lightman and BRiT like this.
  12. DDH

    DDH
    Newcomer

    Joined:
    Jun 9, 2016
    Messages:
    33
    Likes Received:
    29
    Do we know with certainty what TSMC node XBSX soc is made on? Perhaps enhance 7nm is TSMCs 7+nm which possibly accounts for the 0% increase in CU size
     
  13. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,331
    Likes Received:
    3,310
    Location:
    Finland
    No, we don't know for certain, but "AMD enhanced 7nm" sounds awfully lot like the tweaked N7 used in Zen2 Refresh and Zen 3. Navi10 is on N7P, but it's transistor density is same as N7
     
  14. DDH

    DDH
    Newcomer

    Joined:
    Jun 9, 2016
    Messages:
    33
    Likes Received:
    29
    I agree, but perhaps they were under an agreement not to disclose the actual process node until AMD does
     
  15. fehu

    Veteran Regular

    Joined:
    Nov 15, 2006
    Messages:
    1,837
    Likes Received:
    832
    Location:
    Somewhere over the ocean
     
    Lightman, CarstenS, Pete and 2 others like this.
  16. Erinyes

    Regular

    Joined:
    Mar 25, 2010
    Messages:
    697
    Likes Received:
    148
    Yes, N7P only offers power and/or performance benefits. If what Papermaster is alluding to is true, Zen3 may not even be on N7P though I'd be surprised if it wasn't on the more advanced process by late 2020. And we have to remember that even within the same process node, density can vary widely, as we see with 7nm mobile SoCs and even to a certain extent with A100. I have not been able to ascertain with certainty which node it is using though. I would expect them to be on N7+ (EUV) for density and power reasons but I can't find any information to confirm this. For Navi 2x as well, the density advantage of N7+ would be advantageous but they could have stayed with N7P as well.
     
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,331
    Likes Received:
    3,310
    Location:
    Finland
    7nm mobile SoCs use different libraries compared to high performance chips, though, which explains the major difference in transistor densities.
     
    Lightman likes this.
  18. tsa1

    Newcomer

    Joined:
    Oct 8, 2020
    Messages:
    10
    Likes Received:
    14
    Not sure about Navi, but Zen3 may just be as well on the plain 7 nm, it improved a lot since the launch of zen2. Older R6 3600s could barely reach 4-4.1 ghz all-core clock with high-ish voltages (circa 1.35V) while newer CPUs (20xx PGT/PGS) could easily do 4.4-4.5 ghz at low voltages (1.25V or even less) with full stability in both LinX or Prime95. So, clocks might have improved a lot even without going N7P, N7+, N7 EUV or whatever it's called now.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,100
    Likes Received:
    1,186
    Location:
    London
    I've been wondering why Microsoft and Sony have recently been talking about backwards-compatible games "running in GCN mode not RDNA".

    USE OF WORKGROUPS IN PIXEL SHADER

    It seems to suggest that efficiency for pixel shading can be improved by grouping fragments into more than a single-wavefront-workgroup, e.g. a workgroup containing four wavefronts.

    Quite a lot of state is shared by all fragments in a wave, so making a wave larger can be beneficial. Though there's still going to be a problem with register allocation, which is the typical reason why large workgroups have limited use in compute (along with their low latency hiding). If a workgroup is spread across some or all SIMDs in a WGP, then the register allocation problem is ameliorated.

    Along the way the document implies that current hardware stores some pixel shader state (e.g. vertex attributes) in LDS. I don't know if that's actually the case, but it's interesting.

    The document then proceeds to describe how the pixel shader can be broken up into stages. Any stage that performs computations that are shared by all fragments in the workgroup can be computed by a single wave, not all waves. The results are put into LDS and then all waves in the workgroup access those results.

    Workgroup-orientated execution of pixel shaders would also help with cache coherency, which has been an on-going theme of RDNA.
     
    Krteq, Lightman and PSman1700 like this.
  20. andermans

    Joined:
    Sep 11, 2020
    Messages:
    6
    Likes Received:
    6
    The current hardware stores the vertex attributes that are used for interpolation in LDS during a pixel shader invocation. (For up to 16 triangles per PS wave and you interpolate them with V_INTERP_P1_F32 and V_INTERP_P2_F32 instructions in the shader which read directly from LDS)
     
    Jawed likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...