AMD: RDNA 3 Speculation, Rumours and Discussion

Discussion in 'Architecture and Products' started by Jawed, Oct 28, 2020.

  1. arandomguy

    Newcomer

    Joined:
    Jul 27, 2020
    Messages:
    104
    Likes Received:
    159
    It would more so be in the purview of the manufacturing side in which both TSMC and Samsung have alternative developments at varying stages.

    Example - https://semiwiki.com/semiconductor-...ghts-of-the-tsmc-technology-symposium-part-2/

    At least to my knowledge chiplet designs and inter chip connectivity have been on the minds of the industry for quite some time now and so the entire chain has been pursuing supporting technologies to support that paradigm. Despite the "technical marketing" neither AMD's chiplet design for CPUs nor Intel's EMID is some sort of out there surprise technology even though you can leverage them for some mind share due to being first to mass market.
     
  2. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    No, it's not identical but TOTALLY DIFFERENT architecture-wise. There is MUCH MORE than just total size in caches.
    It is probably built of similar physical SRAM blocks, but those are organized and connected totally differently.

    Threadripper/EPYC has 8..16 separate 8..16 MiB caches, one for each CCX.

    If every core is reading the same address, it will be cached to 8..16 separate L3 caches, and each core can only use 8(zen1) or 16(zen2) mgabytes of cache. Lots of duplication of same data, and misses when data is in the wrong L3 cache. If all the cores are operating on the same data, the L3 hit rate in zen/zen2 is much worse than hit rate of single 128 MiB L3 cache would be.

    On RDNA2, there is 128 MiB of cache that is shared by ALL the cores.

    Also, it's still unknown whether the "infinity cache" of RDNA2 is "memory-side" or "core-side" cache.
     
    BRiT likes this.
  3. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    Not disagreeing, just to put out some numbers:

    (Navi 10) GL1 to L2:
    16 channels * 2 (bidirectional) * 64 bytes/clk * 2.0 GHz (assumed) = 4 TB/s
    16 * 2 * 512 = 16384 wires o_ o

    Though for sure you could use SerDes, say at ~12.5 GT/s like Zepplin OPIO, which then reduces the need by 6x. But it is still approximately 2-3 HBM worth of data wires. In other words, for this to happen, it would need more than substrates for higher density on-package I/O than EPYC.

    Edit:
    (Navi 10) L2 to IF/SDF/MC:
    16 channels * 2 (bidirectional) * 32 bytes/clk * 1.75 GHz (persumably mclk @ 14Gbps) = 1.7 TB/s

    Navi 21 seemed to have upped from 32B/clk to 64B/clk, according to the footnote. Persumably it is for the Infinity Cache, if we assume IC being memory-side.

    Putting eDRAM aside, they could continue to manufacture all dies on the same bleeding edge processes. So at least they still get the cost saving part of the deal in reuse & dodging big monolithic chips, and get around the rectile limit of a single die.
     
    #23 pTmdfx, Oct 29, 2020
    Last edited: Oct 29, 2020
    BRiT likes this.
  4. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    913
    Likes Received:
    347
    Sorry, I didn't make a well interpretable comment. I meant to imply that the size of the cache is seemingly not a show-stopper for migration into a seperate i/o die, as in Threadripper it goes upto 256MB, if that was one of Jawed's concerns.
    In the presentation they said they "templated" off some things of the Zen architecture's L3, which can mean anything really, but it leaves the possibility that there are architectural similarities. In the die shot you see the cache behind the fabrics wall. I can only guess it might mean that a unified protocol between cache and core of CPU [core die] &| GPU [core die] &| cache [i/o die] through the fabric could a possibility (apart from the obvious banking differences CPU need and are not the same as for GPUs, thinking of a APU here), but more specifically that a die seperation between the inside and outside at the fabric wall is a low complexity possibility.
     
    Lightman likes this.
  5. Entropy

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,295
    Likes Received:
    1,307
    There is one trend that’s in favour of chiplets, and that is rising fixed costs for chips on leading edge processes.
    AMD going with chiplets for Zen was not necessarily optimal vs. monolithic as the performance of both Renoir and to some extent Ryzen 3300x demonstrate. But it allowed AMD to adress a wide spectrum of the x86 market, from modest self-builders, via HEDT, to big server solutions, with a single chiplet design. That not only saved them a lot of money on lithographic mask sets, but more importantly design cost and design time. Given their financial situation I don’t believe they could have done anywhere near as well with monolithic chip designs adressing all those market segments.

    That doesn’t mean it’s a universal recipy for success by any means, mobile SoCs are a shining counter example for instance. Since we’re talking about GPUs here, they enjoy the advantage of scalable design, so producing for instance half-of-everything designs are way cheaper than the original new base design. That one will still cost you a pretty penny though, and the mask sets are quite costly for each distinct chip. So if you are targeting consumer products, you really want high volume on your products to amortise your fixed costs over. Depending on what the market looks like, that could be an argument for trying to make a chiplet type design work, despite its challenges. Personally I doubt it makes all that much sense for GPUs. Nvidia and AMD seem able to cover the entire current GPU spectrum with at most three chips, that can be designed by scaling the resources of a base chip. Yield for the biggest chips could be an issue, but TSMC seems to do well with yields even on 5nm, designing with redundancy helps of course, and nVidia and AMD would use frequency binning and cutting down functional units anyway to provide a wide range of SKUs.

    Moores law is dead by the way. Nobody told you? ;-) Packaging is where it’s at these days.
     
    nAo and BRiT like this.
  6. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    611
    Likes Received:
    357
    128MB is ~ how much you need to fit the framebuffers for most demanding games at 4k.

    TSMC has been pushing hard on multiple different 3d packaging technologies, one of which is very similar to EMIB. TSMC's own site about them is sadly totally incomprehensible buzzword bingo, I personally much prefer the much more readable reasonably recent Anandtech article about them. It's important to note that these things are now actually being pushed into real, reasonably high-volume products. Nothing as high-volume as GPUs yet, but no longer just really low-volume one-off products either.

    I think it's a good bet that both AMD CPUs and GPUs will end up using something from this set of technologies, it's just a question of timing. I think on the first generation that they actually use it, there will be only one or two products doing it, so they don't have to bet an entire product cycle on relatively unproven tech. Then once it's proven the next gen products are all in.

    If they are using any kind of 3d integration, they are going use some HBM derivative and stack the memory on it too, essentially removing this issue from the equation.

    Everyone in the industry disagrees with you here. Look at what AMD, Intel, and TSMC are talking about, everyone is moving towards disaggregating things from one chip to many small ones. The driving force behind this is that everyone apparently believes that yields on large chips will be worse in the future than they are now.

    I agree that the hard part of splitting a rasterizing GPU are the ROPs. Every simple solution results in massive traffic needed essentially to ensure coherency in them. The solution is either do it late in the pipeline with massive inter-chiplet bandwidth provided with some kind of 3d packaging system, which is conceptually easy but technically challenging, or some kind of binning and data movement in the early parts of the pipeline to get the right pixels to the right pipelines before you shade them, which is easier on the hardware but much harder on the software, especially for providing good performance on older games.

    I do want to note that all of this goes away once everything goes RT. The ideal RT accelerator can consist of many small chiplets that basically don't even need to know of each other's existence. It's okay for the uppermost levels of the acceleration structure to be duplicated in every cache, as they are relatively small anyway -- good trees widen quickly. Assuming that the scene is partitioned into contiguous chunks for each RT chiplet, the cache usage partitions exactly the right way -- any primitive or part of the acceleration structure that is needed to draw a pixel is mostly likely needed to draw a pixel right next to it on screen, and least likely needed to draw a pixel far away from it.
     
  7. pTmdfx

    Regular Newcomer

    Joined:
    May 27, 2014
    Messages:
    379
    Likes Received:
    338
    Moore's Law isn't dead like corpse yet. We can have both cakes and eat them all.
    :mrgreen:
     
  8. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    2,547
    Likes Received:
    1,682
    Location:
    Earth
    I don't know about moore's law but 3080 is not in average even 2x faster than my 3.5 old 1080ti. That despite the chip sizes and power draws getting bigger and new gpu's being more expensive. If it's roughly 4 years to double gpu performance that is just sad. And if this gets even slower it's even more sad.

    That said I expect rdna3 vs. nvidia be very interesting. Both players are on the ball and that could bring the best out of both sides. Maybe rdna3 upgrade cycle is best we have seen long time or perhaps doubling of perf will happen now every 5 years :(
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    It's worth remembering that Tahiti launched with 264GB/s 8 years ago. So we're approaching 3x the performance per unit bandwidth...
     
    Lightman likes this.
  10. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    631
    Likes Received:
    323
    Today, sure. My concern is tomorrow. UE4 already has a "next gen only" TAA update, and I'd have to check it but based on other TAA advances I bet it uses at least one extra 8bit target if not more already.

    Now for competition, likely these same buffers will be pulled a lot, choking the 30XX series for bandwidth even if RDNA2 gets choked by limited cache. Ampere gaming just doesn't have enough bandwidth to go around. But I still suspect we'll see edge cases where buffers overrun 128mb without pulling a ton of traffic.

    Anyway, chiplets make a hell of a lot of sense for GPUs. Yields are pretty terrible for both the new console APUs, and "Big" Navi is even larger. Chiplets have advantages across the board. You only need to tape out one or two relatively small chiplet designs, and with design costs and times growing exponentially with each new node that alone is a huge cost and time saver. And by saving time you can also update faster, if your compute or I/O die is updated you don't necessarily need to wait for the other die(s) to be updated to put out a new product, we can see this shipping already with Zen processor generations. Then of course the yields go up mightily with tiny chiplets, as do binning possibilities. You can even scale beyond reticle limits, something Nvidia is no doubt already hotly anticipating, as it's alread in R&D at TSMC for their interconnects. And it probably opens up Samsung a lot more as competition to TSMC. If yields are pretty bad in comparison, but you're getting far more usable silicon thanks to tiny chiplets anyway, it could make S a lot more attractive. Not too mention with better yields going wider than ever is even more affordable, which could make it sensible even for mobile designs. Who needs high clocked chips when it's relatively cheap to throw 50% more silicon at the problem while getting a better power efficiency.
     
  11. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    Moores law is very far from dead. It has only slightly slowed down, density is doubling every ~3 years instead of 2 years.

    Navi 21 has about 2.1 times more transistors than Vega 10, which was released about 38 months earlier.
    nVidia GA100 has about 3.5 times more transistors than GV100, which was released 3 years earlier - so here actually nVidia has stayed with the pace of Moores law.

    TSMC N7 is 2.5 times more dense than GF "14nm" which came about 3 years earlier.

    And Moores law has NEVER said anything about clock speed or single-thread performance.


     
    T2098, McHuj and PSman1700 like this.
  12. Esrever

    Regular Newcomer

    Joined:
    Feb 6, 2013
    Messages:
    822
    Likes Received:
    616
    Slowing down = dead if it was about the exponential growth in the first place. The main thing why Moore's Law is dead isn't that we aren't getting any improvement any more, it is we will never be back where performance gains from just moving to a new node and adding more transistors automatically made much much better ICs every generation. We don't get to just pack 2x the number of faster transistors in the same space without worrying about power and cost like the 70s, 80s and 90s. The whole paradigm enabled general purpose ICs to dominate the market because designing specialized hardware didn't make financial sense when you have Intel coming in with a new CPU that did pretty much every thing twice as fast every 2 years. Nowadays you don't get that. We are going to need specialized hardware to do specialized things and this has been going on for the past 10 years.

    Another thing to note is cost per transistor has exploded even when moving to denser nodes. Electrical characteristics of transistors on the newer nodes are just plain not scaling. 5nm is barely an improvement to 7nm in practical terms such as power, performance, cost. Who cares if it's denser if it doesn't gain anything else? The physical size of the chip has never been a big thing for new nodes.

    Here is Moore's original quote:
    We pretty much stopped scaling like this on 28nm. Even moving to 14nm was expensive and cost of IC over the years has been going up. Wafer cost on the newer nodes are going up almost as fast as transistor density. Took 2 extra years before 14nm able to overtake 22nm for Intel and because of Intel's insistence on scailing, they are still on 14nm after trying to move beyond that for 6 years. Every new node is just more bandaids put on top to make it work making things more expensive. New tech is helping but all of it temporary. None of the new tech (EUV, GAAFET, TSV, Stacking, etc) is going to solve the problem long term since none of them enable the exponential scaling that used to just come basically free. They will all hit limits within a few generation of use. The requirements for new tech to keep progressing is getting higher and higher and this is what really defines post Moore's Law vs what used to Happen during Moore's Law back in the day.

    The law is long dead at this point unless you keep changing the definition like some people like to do.
     
  13. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    107
    Likes Received:
    61
    Location:
    Melbourne Aus.

    Thanks for the info, I'm no GPU export but my answers...

    1) on each chiplet, next to shading/Rt engine as they are now.
    2) OK, this is the part i know least about, i'm pretty good with modern CPU tech, but fall down when we get top the nitty gritty of breaking Up GPU workloads and distributing them among units.
    3) Sorry no L4 in the IO die, just the memory controller - You've already got a low latency high BW cache close to the Units, added abit of latency to go to actual memory isn't going to kill you
    a) yeah i'm not sure about this? What about just accepting some pixel will be duplicated during rendering?
    b) as before no L4 cache, L3 on each GPI chiplet, then talk to a big fat IO die which is the memory controller
    4) Whatever is best for the chosen memory interface, likely similar to what AMD do now, and do it on an older process that is better for high pin count IO

    I understand Moore's law, but I also understand modern reticule limits, if you want a GPU that is 10x to 100X times the power of a 3090 or 6900, it's not going to happen on a single die, Moores law or not.

    :)
     
  14. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    Moores law has never been about performance. It has always been about the number of transistors per chip. And that 3080 has 2.33 times more transistors than your 1080ti, and ot came about 3.5 years later. That's about doubling per 3 years, only slight slowdown in Moores law.

    And about performance: That 3080 is something like 10x faster than your 1080ti when performing ray tracing, or when calculating neural networks.

    Despite having 2.33 times more transistors it's not even 2x faster when performing rasterization on old games because
    1) the performance has been increased elsewhere than in rasterization performance
    2) those old games are also bottlenecked by CPU and memory bandwidth
    3) lots of those transistors are used for more cache to make the memory less of a bottleneck.
     
    #34 hkultala, Oct 30, 2020
    Last edited: Oct 30, 2020
  15. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    ... so you would have multiple separate clusters of ROPs and long latency between them.

    Quess what happens when you render two overlapping triangles, one in another die and and another on another die almost at the same time?

    You either have cache coherency between your L3 caches (which always adds a LOT over traffic between your dies and then take a considerable performance hit in this situation) or you are not cache-coherent and render this situation incorrectly.

    Or you make lots of limitations that they cannot render the same area of screen (or same area of non-screen buffer) and add a lot of overheads and inefficiencies from these limitations.

    A1) In the original post I was replying to it was claimed that there are cache advantages in the caches. I'm pointing out that In reality, it's the opposite. There are cache disadvantages from this.
    A2) The bigger problem is in writes. The nearest or latest copy of that pixel has to be the one that is the final copy of that pixel.

    You cannot just "accept" that you have a wrong color value.

    The point is: There are (manufacturing) technology AND market situation (WSA) specific REASONs for AMDs zen2/zen3 MCM ("chiplets") .There is no such reasons for MCM in a GPU generally.

    MCM is not the future. It's the PAST AND a niche solution for certain individual chips. There is nothing in RDNA3 that makes it good candidate for MCM with multiple logic chips, not for technical neither marketing reasons.

    No. It seems you do not underestand the economics of microchip manufacturing

    If you want to make a GPU that consumers can afford you are not limited by the reticule size and are even less limited by the reticule size in the future

    The reticule size is not decreasing with new mfg processes.

    But the cost per die area is increasing with new mfg processes.

    This means that the cost of chip that is reticule-limited is increasing.

    Also, power density is increasing. We are getting even more performance out from the same die area while consuming even more power.
    If we want to be able to keep our dies cool to keep running those at high speeds, putting those into same package is not helping, it's hurting.


    MCMs generally only make sense if you want to make as powerful enterprise-class system as possible inside one package


    If you want to make a consumer-priced chip that has 10x-100x times more performance than 3090 or 6090, splitting the logic into multiple dies and having lots of communications overheads and more latency will not help you. When you can afford the die area needed for that performance, you can then also afford that area inside single logic die and have none of these problems.
     
    #35 hkultala, Oct 30, 2020
    Last edited: Oct 30, 2020
    sonen likes this.
  16. hkultala

    Regular

    Joined:
    May 22, 2002
    Messages:
    296
    Likes Received:
    37
    Location:
    Herwood, Tampere, Finland
    And all these power/area and cost/area increases just means MCMs ("chiplets") are making even LESS SENSE. Because that MCM which has lots of die area is VERY expensive to make and very hard to cool. The amount of chip area that we can effectively cool inside one package and which consumers can afford is getting SMALLER. So that it's EASIER to make that as a monolithic die.

    This quote is not from Moores original paper. This quote is written about 50 years later than the original paper.

    Here is the original Moores law paper:

    https://newsroom.intel.com/wp-content/uploads/sites/11/2018/05/moores-law-electronics.pdf


    The definition was changed already 45 years ago, originally Moore was talking about doubling of transistor density EVERY years.

    So, Moores law was either dead already 45 years ago, OR it has slowed down twice, first 45 years ago and now again.

    But it cannot honestly both survive one slowdown and to be declared dead on second slowdown.
     
    #36 hkultala, Oct 30, 2020
    Last edited: Oct 30, 2020
    sonen likes this.
  17. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,326
    Likes Received:
    151
    Location:
    On the path to wisdom
    That's already the case. ROPs don't access arbitrary render target locations, they're each assigned to RT tiles in a repeating pattern.
     
    TheAlSpark likes this.
  18. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    611
    Likes Received:
    357
    Yes, and this works because all rops are on a single chip and there is essentially a massive crossbar (or some other N-M switch) between them and the shader array. Any lane of any simd pipe can output a rop op that can end up at any rop. Getting this done in a distributed GPU without blowing up the power budget is really hard.

    Unlike hkultala, I strongly feel that a "chiplet GPU" is the future and that it's going to happen. However, this is not something you can handwave away. This is IMHO the only actually really hard part of making a multi-chip GPU work. Everything else is comparatively easy, this is the part that you have to solve to get it work.

    The solutions I've offered are either waiting for TSMC to get fancy enough 3d packaging/interconnect tech done, or somehow sorting and binning at the early part of the pipeline.
     
  19. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,280
    Likes Received:
    2,952
    Location:
    Germany
    AMD says, at 4k across [a variety] of top gaming titles, they are getting a hit rate of about 58%.
     
    Alexko and Lightman like this.
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,150
    Likes Received:
    1,651
    Location:
    New York
    Aren't render target tiles assigned at the rasterizer and shader array level? Why would a shader array spit out pixels for a tile owned by a ROP in a different array.
     
    CarstenS likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...