AMD: Navi Speculation, Rumours and Discussion [2019-2020]

Discussion in 'Architecture and Products' started by Kaotik, Jan 2, 2019.

Thread Status:
Not open for further replies.
  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    The ES board could have been literally any board, people are just focused on two screws when none of the other screws in it fit the actual card photos (or renders for that matter)
     
  2. msxyz

    Newcomer

    Joined:
    May 5, 2006
    Messages:
    122
    Likes Received:
    54
    Wouldn't be too expensive this sort of hybrid approach with both HBM and a wide external bus? AMD has to compete with NVidia also on price. A large eSRAM o eDRAM cache would be more cost effective (especially the latter).
     
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Ooh! That does look low!

    But the "arms" on the retention bracket are much longer in the "RedGamingTech" picture than in what I suppose is the RX 5700 XT that you're comparing with.

    So the arms are making it look like the GPU is lower than it really is. So, not HBM in my opinion.
     
  4. fehu

    Veteran

    Joined:
    Nov 15, 2006
    Messages:
    2,067
    Likes Received:
    992
    Location:
    Somewhere over the ocean
    Looks shorter too
     
  5. DDH

    DDH
    Newcomer

    Joined:
    Jun 9, 2016
    Messages:
    36
    Likes Received:
    39

    Based on that picture, the GPU for the 6900xt is mounted 4mm lower than the 5700xt

    The retention bracket is about 20% bigger. The diametrically opposite screw mountings are 90mm apart vs 76mm on the 5700xt

    This closely matches the vega7 mounting bracket which was also 90mm iirc. The vega7 had 4 stacks of hbm, and an interposer size of approximately
    840mm2.

    A 500mm2 die with 2 stacks of hmb2 would be close to the interposer size of a vega7.

    Rogame indicates there are 4 variants of big navi, 3 of which should be for consumer cards, XTX, XT and XL. Perhaps
    XTX is 80CU hbm, XT 72CU HBM, XL 72CU GDDR?

    By having both HBM and GDDR phy/io, could AMD potentially increase their wafer yields?
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    https://www.techpowerup.com/gpu-specs/radeon-rx-vega-64.c2871

    I hypothesised something similar recently (though with 40CU dies as a pair).
     
  7. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    I mean, yes but that would be a but silly, the yields wouldn't be that much higher.

    Though I suppose you could support more bins. Some super performance bin could get hbm and they could charge a mint and a half, while the more common bin gets gddr and is meant for the mass market.

    Still, such a strategy would make much more sense with chiplets. Then you wouldn't waste die space on doubling up the memory interface, you just pair the right logic chiplets with the right memory controllers, like some sort of computer engineering lego.
     
    jacozz likes this.
  8. giannhs

    Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    44
    Likes Received:
    47
  9. Krteq

    Newcomer

    Joined:
    May 5, 2020
    Messages:
    149
    Likes Received:
    263
    w0lfram and Lightman like this.
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Just watched Linus’ Tiger Lake “review” and he spent some time on the near dearth of AMD’s highly regarded 7nm mobile CPUs. Given that backdrop I don’t see how AMD can produce Navi2x in sufficient quantities unless they’ve been stockpiling for many months.
     
  11. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    568
    Likes Received:
    104
    There have been reports that Huawei bought a ton of short term TSMC capacity, and availability should thus be better going forward for AMD if that is behind the shortages.
     
  12. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    I mean, they might be. The contracts for these things are planned years in advance, I don't think they expected their mobile CPUs to do nearly as well as they did, and of course TSMC doesn't have any extra capacity whatsoever for short notice runs (do they? everything I've seen says they're nigh overbooked)

    So it partially depends on how many RDNA2 cards they bet they'd sell years ago. That and the GDDR shortage causing all these "sold out instantly!" problems.
     
  13. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    Console SoCs have likely been taking lots of capacity for a while to build the initial stock. I assume they would have gone down by now as launches approach, so that might give space for discrete GPUs. That and also the freed capacity from some other TSMC clients (e.g. Apple moving to 5nm), which AMD is very likely keen on bidding.
     
  14. DDH

    DDH
    Newcomer

    Joined:
    Jun 9, 2016
    Messages:
    36
    Likes Received:
    39
    This could also be a result of the pandemic and subsequent explosion in tech sales which all major tech companies benefited from. AMDs projections for their mobiles chip may have not been sufficient to cover the additional demand and without an ability to increase production here we are.

    I'm hopeful AMD will have more navi2x cards stockpiled than NVIDIA did of ampere
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Hope you guys are right. We should see supply open up on the CPU front first as TSMC capacity is freed up. I have no delusions about getting a Zen 3 chip in Q4 but it would be nice.
     
    PSman1700 likes this.
  16. dumbo11

    Regular

    Joined:
    Apr 21, 2010
    Messages:
    440
    Likes Received:
    7
    AFAICT this would provide a psuedo-fast memory pool *if* you can ensure that [dataset size] < [number of CUs on task] x [cache per CU]?

    Which seems interesting - as it means that the optimal dataset size would be somewhat proportional to the number of CUs on task - which seems to suggest that different RDNA2 cards may have very different performance in similar scenarios.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    Different GPUs have different sizes of L2 cache, already. This is because L2 cache slices are allocated to memory channels. Also, RDNA is designed explicitly with variation in the size of L2 size per memory channel: from 64KB to 512KB per slice.
     
    Lightman likes this.
  18. pTmdfx

    Regular

    Joined:
    May 27, 2014
    Messages:
    416
    Likes Received:
    379
    It doesn't read full replication at all to me — there are repeated descriptions and references to fine-grained address interleaving across participating caches in clusters to dynamically deliver larger effective cache capacity. Paragraph 62, for example, specifically addresses that the number of CU clusters is to be decreased if lines are expericing high level of sharing (i.e., replication level), and having fewer CU clusters mean larger pool of caches for interleaving in the cluster.
     
    Ext3h likes this.
  19. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Thanks, should have read the full patent text. My bad.

    So trading access latency (when clustered and interleaved) vs small L1 (when not clustered / interleaved).

    Even though I'm having trouble seeing where this is applicable? It appears to be focused primarily at reducing the load on the (shared) L2 cache?
    Is the L2 cache on GPU actually still that much slower than L1, so that even 3 trips over the crossbar can outweigh an L2 hit?

    Even if you consider that in the context of the patent even a store & forward approach was suggested for CU to CU communication, so this is potentially quite a few cycles wasted?
    By the looks of it, yes. (3.90TB/s L1 bandwidth, 1.95TB/s L2 bandwidth on the Radeon RX 5700XT.)

    For what's it's worth, the author also often spoke only about small number of CUs in each cluster, and about locality on the chip. So possibly this isn't even aiming at a global crossbar, but actually only at 4-8 CUs maximum in a single cluster?
    I suppose 8 CUs with 128kB of L1 each still yield a 1MB memory pool. And you got to keep in mind L0 cache still sits above that, so at least some additional latency isn't going to ruin performance.

    An interesting question, can this reliably reduce the bandwidth requirements on the L2 so far, that distribution of L2 cache slices across dies becomes viable yet?
    A clear "no" to that. Even when increasing the L1 size 8x this way, L1 cache misses are unavoidable in the bad cases. Maybe a 30-50% reduction of L2 hits on average, but not even proportional to the number of CUs participating in each cluster. And nothing changed on write back bandwidth. Rough estimates, not properly founded. Still not even remotely enough to allow splitting the LLC.

    Should still amount to a reasonable performance uplift in bad cases which had been suffering from excessive L1 miss, but good L2 hit ratio before. I could see this pushing the viable size limit for lookup tables by a fair amount.
     
    Lightman and BRiT like this.
  20. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    L1 is shared by all CUs in a shader array, 10 in RX 5700 XT for example.

    ROPs are clients of L1 in Navi. ROP write operations are L1-write-through though, i.e. L1 doesn't support writes per se, so L2 is updated directly by ROP writes.

    What proportion of frame-time is taken up with ROP writes? What proportion of VRAM bandwidth is taken up by ROP writes?

    I disagree. L1s need to be able to talk to all L2 slices. That isn't changed by a "chiplet" design where LLC is spread amongst chiplets. 2-4TB/s bandwidth amongst chiplets over an interposer seems pretty easy.

    The rumours are a shitshow right now, but there has been a rumour of a dramatic increase in L1 size. Nothing to do with the "128MB Infinity Cache" rumour, but of course it could be a component of an "Infinity Cache" system.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...