AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Discussion in 'Architecture and Products' started by Jawed, Mar 23, 2016.

Tags:
Thread Status:
Not open for further replies.
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,348
    Likes Received:
    3,879
    Location:
    Well within 3d
    With the chiplet approach, it may be possible to create a solution that wouldn't present itself as a multi-GPU. An interview with Raja Koduri about the future being more about leveraging explicit multi-GPU with DX12 and Vulkan seems to contradict that. In addition to that, the chiplet method would seemingly be a more extreme break with the usual method, since it's potentially fissioning the GPU rather than just making it plural. Just counting on developers to multi-GPU their way around an interposer with two small dies is the sort of half-measure whose results we've seen from RTG so far.
    Also, the timelines for Navi and the filing of the interposer patent may make it too new to be rolled into that generation.

    There are some elements of the GPU that might make it amenable to a chiplet approach, in that much of a GPU is actually a number of semi-autonomous or mostly independent processors and various bespoke networks. Unlike a tightly integrated CPU core, there are various fault lines in the architecture as seen from the different barriers at an API level, and even from some of the ISA-exposed wait states in GCN where independent or weakly coupled domains are trading information with comparatively bad latency already. A chiplet solution would be something of a return to the days prior to VLSI when a processor pipeline would be made of multiple discrete components incapable of performing a function outside of the substrate's common resources and the other chiplets.

    It's such a major jump in complexity versus the mediocre returns of AMD's 2.5D integration thus far that I would expect there to be more progress ahead of it.
    I'm open to being pleasantly surprised, although not much has shown up that indicates GCN as we know it is really moving that quickly to adapt to it. Some of the possible fault lines in a GPU might be moving the command processor and front ends to one side of a divide from dispatch handling and wavefront creation, given the somewhat modest requirements for the front end and a general lack of benefit in multiplying them with each die. Another might be GDS and ROP export, where the CU arrays already exist as an endpoint on various buses.
    However, one item of concern now that we've seen AMD's multi-die EPYC solution and infinity fabric starting to show itself is how little the current scheme differs from having separate sockets from a latency standpoint. The bandwidths also fall seriously short for some of the possible GPU use cases. GCN's ability to mask presumably short intra-GPU latencies doesn't match up with what infinity fabric does for intra die accesses.
    There may be other good reasons for what happened with Vega's wait states for memory access, but the one known area AMD has admitted to adding infinity fabric to is the one place where Vega's ISA quadrupled its latency count, which gives me pause.
    That, and the bandwidths, connection pitch, and power efficiency AMD has shown so far and has so far speculated about really don't seem good enough for taking something formerly intra-GPU and exposing it.

    Nvidia's multi-die scheme includes its own interconnect for on-package communication with much better power and bandwidth than AMD's infinity fabric links, and even then Nvidia seems to have only gone part of the way AMD's chiplet proposal does. Nvidia does seem to be promising a sort of return to the daughter-die days of G80 and Xenos, with the addition of a die on-package containing logic that wouldn't fit on the main silicon. In this case, it would be hardware and IO that would be wasted if mirrored on all the dies.



    Perhaps they are hedging against issues they think they might hit with early ~7nm (whatever label those nodes get) or 7nm without EUV, or they have reason to expect it to be that bad for a while?
    Nvidia seems to be proposing moving off some of the logic from the multi-die solution that would be replicated unnecessarily. AMD's chiplet solution takes it to the point that a lot of logic can be plugged in or removed as desired. The socket analogy might go further since AMD's chiplet and interposer scheme actually leaves open the possibility for removable or swappable chiplets that have not been permanently adhered to their sites.
    The ASSP-like scheme may also allow more logic blocks to be applied across more markets, if the fear isn't that a large die cannot recover yields but that it cannot cover enough segments to get the necessary volume.
    Whether two 400mm2 dies that can more fully use the area because of a 50mm2 daughter die giving them more room each is sufficient, I wouldn't know. Given the bulking up in cache, interface IO, and other overheads just to make those dies useful again, it seems risky.

    Just putting two regular GPUs on an interposer or MCM with the complex integration, raised connectivity stakes, and redundant logic seems like a milquetoast explicit multi-GPU solution--which doesn't seem to go too far from what Koduri promised. It would be nice if AMD demonstrated integration tech and interconnect that would even do that sufficiently.
     
    iMacmatician likes this.
  2. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    124
    Likes Received:
    145
    Yields are not the most important reasons for a multidie approach. It's design cost. 7nm needs 2x the manhours than 16nm. In 16nm AMD was able to build 2 gpus per year with their r&d. With the r&d now increasing they should be able to do the same on 7nm, but it won't be possible to bring a full lineup. But with a multidie approach you can have a full lineup with 2 chips. I don't think navi will bring 4 die solutions yet, as the risk would be very high. But a small die , lets call it n11, then 2x n11 and above that n10, 2x n10 give you a nice lineup.
     
  3. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,348
    Likes Received:
    3,879
    Location:
    Well within 3d
    To quibble a bit on that, AMD as a whole put out Polaris 10, 11, PS4 Slim, PS4 Pro, and Xbox One S in 2016. The two slim GPUs were mostly reimplementations at a finer node, but the PS4 Pro especially represents something of a larger architectural investment that went into a GPU that was still binary compatible with Sea Islands.
    For 2017, we have Polaris 12, and we'll see Vega and Scorpio (possibly low-level compatible or built upon Sea Islands), and possibly Raven Ridge.

    It can certainly be argued that a lot of the resources that went into the custom architectures wouldn't have been available for AMD's mainline IP if it weren't for the customers paying for the NRE, but some of the internal development bandwidth of the company and all those fees paid for using TSMC instead of GF are part of that too.

    But if it's two dies that are made more mediocre for the sake of this "scalability" and Koduri expects developers to work DX12-fu to get explicit multi-GPU to work, does it allow two chips to cover multiple shrunken customer bases?
     
  4. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,842
    Likes Received:
    5,415
    If AMD is opting with manufacturing a single die and then scaling the number of dies for hitting different performance/price targets, they're definitely not going with 400mm^2 dies.
    Each Zen die with 2*CCX is <200mm^2. It's what AMD uses all the way up to Epyc with 4 of those "glued" together with IF, with very promising results.
    If Navi is following the same strategy, it's probably going with a similar size per die.
    At 7nm, we could see each Navi die with e.g. 32 NCUs + 128 TMUs + 32 ROPs at ~150mm^2.
     
    Anarchist4000 likes this.
  5. entity279

    Veteran Regular Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,259
    Likes Received:
    441
    Location:
    Romania
    So aren't the likes of Cadence & Synopsis helping in any way with IP design time? I assumed that this is done with more and more tool support over the years
     
  6. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    124
    Likes Received:
    145
    This year was also Ryzen, if we stick to release dates. So in both years we have 5 chips, which is a good amount. But at 7nm it will be hard to maintain that number.
    It won't need explicit multi-gpu. If you do this approach you want the 2 dies to behave like 1 gpu. You need a very fast interconnect for that.

    Unfortunately it's harder to make multidie gpus because the interconnect bandwidth needs to be way higher. Infinity fabric in epyc is ~160 GB/s, but that's too low for gpus. Nvidias paper is very interesting for that case. For their hypothetical 7nm 4 die gpu with 256SMs (64SP) they needed 768GB/s bandwidth and a special l1,5 cache and achieved 90% speed of a monolythic gpu. So just taking 200mm² die size you end up with a speed of a 720 mm² monolythic gpu taking into account the 90% scaling. But you also need to add the interconnect and additional cache. I don't know how big that would be, but memory interfaces with high speed seem to be not so small. Let's just assume 20mm². Then we're already at 880mm² combined die size for a 720mm² speed gpu. This doesn't sound so good anymore in terms of manufacturing price.

    Of course you use more tools, else it wouldn't even be possible to design. But cost are anyway skyrocketing.
     
  7. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,348
    Likes Received:
    3,879
    Location:
    Well within 3d
    I would tend to agree that it would be better if it didn't.
    However, the individual interviewed by PCPerspective in 2016 seems to say that's what he's assuming it will, and he might know more than most.
    https://www.pcper.com/news/Graphics...past-CrossFire-smaller-GPU-dies-HBM2-and-more

    To clarify, it's 42 GB/s per link with EPYC, and 768 GB/s per link with Nvidia, or 3 TB/s aggregate. Nvidia's interconnect is also 4x as power efficient per bit at .5 pJ/bit versus 2 pJ/bit for EPYC's on-package links.
     
  8. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    There was a recent Kanter tweet about 7nm EUV being rough coming out of a semi symposium, so if AMD wasn't hedging before, they are now.

    Given the time frame of those comments, simply boosting the percentage of async might be the solution. Not necessarily explicit programming for the model. The same way 10 waves were potentially used to mask memory latency, two async thread groups of equivalent size may address the issue.

    We're only starting to see titles with significant usage there with current adoption of DX12/Vulkan.

    They are, but you're talking about an increasing number of transistors and complexity that is starting to occupy three dimensions with FINFET. Forgetting about die size, more models will cost as many times more to design and lay out. Then consider yields and inventory management of all those chips. Zen Mass producing a single chip, binning, and matching performance tiers would be a prime example. While slower than monolithic chips, the value AMD is extracting is likely huge.

    Another consideration might be future optical interconnects improving both bandwidth and power of the interconnect significantly. I'm expecting AMD to track the PCIE development there.
     
  9. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,842
    Likes Received:
    5,415
    Epyc uses 4 links because it's what AMD deemed necessary for this specific use case. It doesn't mean a future multi-die GPU launching in 2019 in a different process would be limited to the same number of links.
    It also doesn't mean Navi will have to use the same IF version that is available today. Hypertransport for example doubled its bandwidth between the 1.1 version in 2002 and the 2.0 in 2004.

    Besides, isn't IF more flexible than just the GMI implementation? Isn't Vega using a mesh-like implementation that reaches about 512GB/s?

    It could very well still be worth it. How much would they save for having all foundries manufacturing a single GPU die and getting their whole process optimization team working on the production of that sole die, instead splitting up to work on 4+ dies? How much would they save on yields considering this joined effort?
    Maybe this 880mm^2 "total" combined GPU has equivalent manufacturing cots as much as performs as well as a 600mm^2 monolithic GPU, when all these factors are combined.
     
  10. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
  11. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,842
    Likes Received:
    5,415
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,915
    Likes Received:
    2,237
    Location:
    Germany
    Fun fact: Intel is touting exactly the same 12x performance advantage over DDR4 with it's upcoming Lake Crest AI-PU.
     
  13. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    124
    Likes Received:
    108
    The controller can, the stacks need to improve for that.
     
    ImSpartacus likes this.
  14. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,348
    Likes Received:
    3,879
    Location:
    Well within 3d
  15. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    Which part of the design (cost) do you have in mind?

    There are many steps for which I don't see any difference (architecture, RTL, verification). Those all happen in the 0/1 realm, process independent.

    So is there something that explodes so much that it makes the whole project double expensive?

    Hard to believe, TBH.

    I think that was more a matter of planning than directly related to the move towards the 16nm process.

    I'll believe it when I see it.

    A multi-die solution for GPU seems reasonable if you've exhausted all other options. IOW: exceeding reticle size.
     
  16. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    I think that would be music to the ears of competitors who decide to do multiple dies that are tailored to a market segment.

    I consider the CPU results largely irrelevant for today's GPU workloads.

    CPUs have always been in a space where each core has access to memory that's mostly used by that core only. A high latency, medium BW interface between the cores is surmountable with a bit of smart memory allocation.

    GPUs have always worked in mode where all cores can access all memory. If you want to keep that model, you're going to need an interconnection bus with the same BW as the memory BW itself or you'll suffer efficiency loss.
     
  17. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    124
    Likes Received:
    145
    I'm not so much into it, to know why, but it seems everything beside architecture is exploding and also verification costs are much higher:
    [​IMG]
    http://www.eetimes.com/document.asp?doc_id=1331185&page_number=3

    http://semiengineering.com/10nm-versus-7nm/

    Gartner is even saying more than double the manyears for 7nm vs 14nm. Ok they're talking about SoCs, but how the exact numbers are doesn't matter. Important is, the cost are growing extremely and i doubt amd can manage to design so many chips in the future. As costs go up the same will happen with nvidia as well, but probably they can stay a bit longer with monolythic chips because of their higher r&d.
     
  18. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    124
    Likes Received:
    145
    I hope, that we don't really get what he means, because i only see a chance in that approach if it behaves like 1 gpu. To wait for dev support would be awful.
    Yes and it's 4 links for epyc per die. I'm talking about the interconnect bandwidth per die and that's not 3TB/s. Max it's two links with 1,5 GB/s, but that's not totally clear for me.

    Of course you can do it, but it's nothing you get for free and have to sacrifice die space for it. You can't compare vegas ondie mesh. The problem is always when you go offdie. Especially power concerns are big because it costs much more energy to transfer data offdie.

    True, maybe that's the case. But i believe the differences won't be so big and it's more the design cost problem, which will lead to such chips.
     
  19. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,842
    Likes Received:
    5,415
    Intel says hi.


    Not all memory, but regardless no one suggested otherwise.
    Even nvidia is studying ways to make multi-die GPUs. Several smaller chips connected through a very high bandwidth fabric does seem like a strong suggestion for the future.
     
    #79 ToTTenTranz, Jul 29, 2017
    Last edited: Jul 29, 2017
  20. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    When you look at this graph, they're basically matching the amount of transistors to the design work.

    That doesn't work for GPUs: I don't think there is any functional difference between GP102 to GP107 other than the number of units. The incremental design effort for an additional version should be a fraction of a complete design.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...