AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Also, the leaked slide with GPUs and server segments positions Navi 10 and 11 as the direct successors to Vega 10 and 11 respectively, at least where the slide is concerned. I don't see any sign of the speculated multi-die approach with Navi from this slide although I can't rule it out either. (Unless I'm missing something, this slide is the only mention of specific codenames and positioning for Navi so far.)
With the chiplet approach, it may be possible to create a solution that wouldn't present itself as a multi-GPU. An interview with Raja Koduri about the future being more about leveraging explicit multi-GPU with DX12 and Vulkan seems to contradict that. In addition to that, the chiplet method would seemingly be a more extreme break with the usual method, since it's potentially fissioning the GPU rather than just making it plural. Just counting on developers to multi-GPU their way around an interposer with two small dies is the sort of half-measure whose results we've seen from RTG so far.
Also, the timelines for Navi and the filing of the interposer patent may make it too new to be rolled into that generation.

There are some elements of the GPU that might make it amenable to a chiplet approach, in that much of a GPU is actually a number of semi-autonomous or mostly independent processors and various bespoke networks. Unlike a tightly integrated CPU core, there are various fault lines in the architecture as seen from the different barriers at an API level, and even from some of the ISA-exposed wait states in GCN where independent or weakly coupled domains are trading information with comparatively bad latency already. A chiplet solution would be something of a return to the days prior to VLSI when a processor pipeline would be made of multiple discrete components incapable of performing a function outside of the substrate's common resources and the other chiplets.

It's such a major jump in complexity versus the mediocre returns of AMD's 2.5D integration thus far that I would expect there to be more progress ahead of it.
I'm open to being pleasantly surprised, although not much has shown up that indicates GCN as we know it is really moving that quickly to adapt to it. Some of the possible fault lines in a GPU might be moving the command processor and front ends to one side of a divide from dispatch handling and wavefront creation, given the somewhat modest requirements for the front end and a general lack of benefit in multiplying them with each die. Another might be GDS and ROP export, where the CU arrays already exist as an endpoint on various buses.
However, one item of concern now that we've seen AMD's multi-die EPYC solution and infinity fabric starting to show itself is how little the current scheme differs from having separate sockets from a latency standpoint. The bandwidths also fall seriously short for some of the possible GPU use cases. GCN's ability to mask presumably short intra-GPU latencies doesn't match up with what infinity fabric does for intra die accesses.
There may be other good reasons for what happened with Vega's wait states for memory access, but the one known area AMD has admitted to adding infinity fabric to is the one place where Vega's ISA quadrupled its latency count, which gives me pause.
That, and the bandwidths, connection pitch, and power efficiency AMD has shown so far and has so far speculated about really don't seem good enough for taking something formerly intra-GPU and exposing it.

Nvidia's multi-die scheme includes its own interconnect for on-package communication with much better power and bandwidth than AMD's infinity fabric links, and even then Nvidia seems to have only gone part of the way AMD's chiplet proposal does. Nvidia does seem to be promising a sort of return to the daughter-die days of G80 and Xenos, with the addition of a die on-package containing logic that wouldn't fit on the main silicon. In this case, it would be hardware and IO that would be wasted if mirrored on all the dies.



Glad to see that I'm not the only one on this.

But even if that's the way Navi will go, I don't see how, say, 2 400mm2 dies on an interposer could be cheaper than 1 750mm2 die with a few shader cores disabled. Yield is so easy to recover with a bit of redundancy, especially when the size of an individual core is getting smaller and smaller.

And that's assuming that this 2 die solution would have the same performance as the one die solution, which I doubt as well.

Perhaps they are hedging against issues they think they might hit with early ~7nm (whatever label those nodes get) or 7nm without EUV, or they have reason to expect it to be that bad for a while?
Nvidia seems to be proposing moving off some of the logic from the multi-die solution that would be replicated unnecessarily. AMD's chiplet solution takes it to the point that a lot of logic can be plugged in or removed as desired. The socket analogy might go further since AMD's chiplet and interposer scheme actually leaves open the possibility for removable or swappable chiplets that have not been permanently adhered to their sites.
The ASSP-like scheme may also allow more logic blocks to be applied across more markets, if the fear isn't that a large die cannot recover yields but that it cannot cover enough segments to get the necessary volume.
Whether two 400mm2 dies that can more fully use the area because of a 50mm2 daughter die giving them more room each is sufficient, I wouldn't know. Given the bulking up in cache, interface IO, and other overheads just to make those dies useful again, it seems risky.

Just putting two regular GPUs on an interposer or MCM with the complex integration, raised connectivity stakes, and redundant logic seems like a milquetoast explicit multi-GPU solution--which doesn't seem to go too far from what Koduri promised. It would be nice if AMD demonstrated integration tech and interconnect that would even do that sufficiently.
 
Yields are not the most important reasons for a multidie approach. It's design cost. 7nm needs 2x the manhours than 16nm. In 16nm AMD was able to build 2 gpus per year with their r&d. With the r&d now increasing they should be able to do the same on 7nm, but it won't be possible to bring a full lineup. But with a multidie approach you can have a full lineup with 2 chips. I don't think navi will bring 4 die solutions yet, as the risk would be very high. But a small die , lets call it n11, then 2x n11 and above that n10, 2x n10 give you a nice lineup.
 
Yields are not the most important reasons for a multidie approach. It's design cost. 7nm needs 2x the manhours than 16nm. In 16nm AMD was able to build 2 gpus per year with their r&d.
To quibble a bit on that, AMD as a whole put out Polaris 10, 11, PS4 Slim, PS4 Pro, and Xbox One S in 2016. The two slim GPUs were mostly reimplementations at a finer node, but the PS4 Pro especially represents something of a larger architectural investment that went into a GPU that was still binary compatible with Sea Islands.
For 2017, we have Polaris 12, and we'll see Vega and Scorpio (possibly low-level compatible or built upon Sea Islands), and possibly Raven Ridge.

It can certainly be argued that a lot of the resources that went into the custom architectures wouldn't have been available for AMD's mainline IP if it weren't for the customers paying for the NRE, but some of the internal development bandwidth of the company and all those fees paid for using TSMC instead of GF are part of that too.

With the r&d now increasing they should be able to do the same on 7nm, but it won't be possible to bring a full lineup. But with a multidie approach you can have a full lineup with 2 chips. I don't think navi will bring 4 die solutions yet, as the risk would be very high. But a small die , lets call it n11, then 2x n11 and above that n10, 2x n10 give you a nice lineup.
But if it's two dies that are made more mediocre for the sake of this "scalability" and Koduri expects developers to work DX12-fu to get explicit multi-GPU to work, does it allow two chips to cover multiple shrunken customer bases?
 
But even if that's the way Navi will go, I don't see how, say, 2 400mm2 dies on an interposer could be cheaper than 1 750mm2 die with a few shader cores disabled. Yield is so easy to recover with a bit of redundancy, especially when the size of an individual core is getting smaller and smaller.
If AMD is opting with manufacturing a single die and then scaling the number of dies for hitting different performance/price targets, they're definitely not going with 400mm^2 dies.
Each Zen die with 2*CCX is <200mm^2. It's what AMD uses all the way up to Epyc with 4 of those "glued" together with IF, with very promising results.
If Navi is following the same strategy, it's probably going with a similar size per die.
At 7nm, we could see each Navi die with e.g. 32 NCUs + 128 TMUs + 32 ROPs at ~150mm^2.
 
So aren't the likes of Cadence & Synopsis helping in any way with IP design time? I assumed that this is done with more and more tool support over the years
 
To quibble a bit on that, AMD as a whole put out Polaris 10, 11, PS4 Slim, PS4 Pro, and Xbox One S in 2016. The two slim GPUs were mostly reimplementations at a finer node, but the PS4 Pro especially represents something of a larger architectural investment that went into a GPU that was still binary compatible with Sea Islands.
For 2017, we have Polaris 12, and we'll see Vega and Scorpio (possibly low-level compatible or built upon Sea Islands), and possibly Raven Ridge.

But if it's two dies that are made more mediocre for the sake of this "scalability" and Koduri expects developers to work DX12-fu to get explicit multi-GPU to work, does it allow two chips to cover multiple shrunken customer bases?

This year was also Ryzen, if we stick to release dates. So in both years we have 5 chips, which is a good amount. But at 7nm it will be hard to maintain that number.
It won't need explicit multi-gpu. If you do this approach you want the 2 dies to behave like 1 gpu. You need a very fast interconnect for that.

If AMD is opting with manufacturing a single die and then scaling the number of dies for hitting different performance/price targets, they're definitely not going with 400mm^2 dies.
Each Zen die with 2*CCX is <200mm^2. It's what AMD uses all the way up to Epyc with 4 of those "glued" together with IF, with very promising results.
If Navi is following the same strategy, it's probably going with a similar size per die.
At 7nm, we could see each Navi die with e.g. 32 NCUs + 128 TMUs + 32 ROPs at ~150mm^2.

Unfortunately it's harder to make multidie gpus because the interconnect bandwidth needs to be way higher. Infinity fabric in epyc is ~160 GB/s, but that's too low for gpus. Nvidias paper is very interesting for that case. For their hypothetical 7nm 4 die gpu with 256SMs (64SP) they needed 768GB/s bandwidth and a special l1,5 cache and achieved 90% speed of a monolythic gpu. So just taking 200mm² die size you end up with a speed of a 720 mm² monolythic gpu taking into account the 90% scaling. But you also need to add the interconnect and additional cache. I don't know how big that would be, but memory interfaces with high speed seem to be not so small. Let's just assume 20mm². Then we're already at 880mm² combined die size for a 720mm² speed gpu. This doesn't sound so good anymore in terms of manufacturing price.

So aren't the likes of Cadence & Synopsis helping in any way with IP design time? I assumed that this is done with more and more tool support over the years

Of course you use more tools, else it wouldn't even be possible to design. But cost are anyway skyrocketing.
 
It won't need explicit multi-gpu. If you do this approach you want the 2 dies to behave like 1 gpu. You need a very fast interconnect for that.
I would tend to agree that it would be better if it didn't.
However, the individual interviewed by PCPerspective in 2016 seems to say that's what he's assuming it will, and he might know more than most.
https://www.pcper.com/news/Graphics...past-CrossFire-smaller-GPU-dies-HBM2-and-more

Unfortunately it's harder to make multidie gpus because the interconnect bandwidth needs to be way higher. Infinity fabric in epyc is ~160 GB/s, but that's too low for gpus. Nvidias paper is very interesting for that case. For their hypothetical 7nm 4 die gpu with 256SMs (64SP) they needed 768GB/s bandwidth and a special l1,5 cache and achieved 90% speed of a monolythic gpu.
To clarify, it's 42 GB/s per link with EPYC, and 768 GB/s per link with Nvidia, or 3 TB/s aggregate. Nvidia's interconnect is also 4x as power efficient per bit at .5 pJ/bit versus 2 pJ/bit for EPYC's on-package links.
 
Perhaps they are hedging against issues they think they might hit with early ~7nm (whatever label those nodes get) or 7nm without EUV, or they have reason to expect it to be that bad for a while?
There was a recent Kanter tweet about 7nm EUV being rough coming out of a semi symposium, so if AMD wasn't hedging before, they are now.

But if it's two dies that are made more mediocre for the sake of this "scalability" and Koduri expects developers to work DX12-fu to get explicit multi-GPU to work, does it allow two chips to cover multiple shrunken customer bases?
Given the time frame of those comments, simply boosting the percentage of async might be the solution. Not necessarily explicit programming for the model. The same way 10 waves were potentially used to mask memory latency, two async thread groups of equivalent size may address the issue.

We're only starting to see titles with significant usage there with current adoption of DX12/Vulkan.

So aren't the likes of Cadence & Synopsis helping in any way with IP design time? I assumed that this is done with more and more tool support over the years
They are, but you're talking about an increasing number of transistors and complexity that is starting to occupy three dimensions with FINFET. Forgetting about die size, more models will cost as many times more to design and lay out. Then consider yields and inventory management of all those chips. Zen Mass producing a single chip, binning, and matching performance tiers would be a prime example. While slower than monolithic chips, the value AMD is extracting is likely huge.

Another consideration might be future optical interconnects improving both bandwidth and power of the interconnect significantly. I'm expecting AMD to track the PCIE development there.
 
Unfortunately it's harder to make multidie gpus because the interconnect bandwidth needs to be way higher. Infinity fabric in epyc is ~160 GB/s, but that's too low for gpus. Nvidias paper is very interesting for that case. For their hypothetical 7nm 4 die gpu with 256SMs (64SP) they needed 768GB/s bandwidth and a special l1,5 cache and achieved 90% speed of a monolythic gpu. So just taking 200mm² die size you end up with a speed of a 720 mm² monolythic gpu taking into account the 90% scaling. But you also need to add the interconnect and additional cache. I don't know how big that would be, but memory interfaces with high speed seem to be not so small. Let's just assume 20mm².

Epyc uses 4 links because it's what AMD deemed necessary for this specific use case. It doesn't mean a future multi-die GPU launching in 2019 in a different process would be limited to the same number of links.
It also doesn't mean Navi will have to use the same IF version that is available today. Hypertransport for example doubled its bandwidth between the 1.1 version in 2002 and the 2.0 in 2004.

Besides, isn't IF more flexible than just the GMI implementation? Isn't Vega using a mesh-like implementation that reaches about 512GB/s?

Then we're already at 880mm² combined die size for a 720mm² speed gpu. This doesn't sound so good anymore in terms of manufacturing price.
It could very well still be worth it. How much would they save for having all foundries manufacturing a single GPU die and getting their whole process optimization team working on the production of that sole die, instead splitting up to work on 4+ dies? How much would they save on yields considering this joined effort?
Maybe this 880mm^2 "total" combined GPU has equivalent manufacturing cots as much as performs as well as a 600mm^2 monolithic GPU, when all these factors are combined.
 
Fun fact: Intel is touting exactly the same 12x performance advantage over DDR4 with it's upcoming Lake Crest AI-PU.
 
Yields are not the most important reasons for a multidie approach. It's design cost. 7nm needs 2x the manhours than 16nm.
Which part of the design (cost) do you have in mind?

There are many steps for which I don't see any difference (architecture, RTL, verification). Those all happen in the 0/1 realm, process independent.

So is there something that explodes so much that it makes the whole project double expensive?

Hard to believe, TBH.

In 16nm AMD was able to build 2 gpus per year with their r&d.
I think that was more a matter of planning than directly related to the move towards the 16nm process.

With the r&d now increasing they should be able to do the same on 7nm, but it won't be possible to bring a full lineup. But with a multidie approach you can have a full lineup with 2 chips. I don't think navi will bring 4 die solutions yet, as the risk would be very high. But a small die , lets call it n11, then 2x n11 and above that n10, 2x n10 give you a nice lineup.
I'll believe it when I see it.

A multi-die solution for GPU seems reasonable if you've exhausted all other options. IOW: exceeding reticle size.
 
If AMD is opting with manufacturing a single die and then scaling the number of dies for hitting different performance/price targets, they're definitely not going with 400mm^2 dies.
I think that would be music to the ears of competitors who decide to do multiple dies that are tailored to a market segment.

Each Zen die with 2*CCX is <200mm^2. It's what AMD uses all the way up to Epyc with 4 of those "glued" together with IF, with very promising results.
If Navi is following the same strategy, it's probably going with a similar size per die.
At 7nm, we could see each Navi die with e.g. 32 NCUs + 128 TMUs + 32 ROPs at ~150mm^2.
I consider the CPU results largely irrelevant for today's GPU workloads.

CPUs have always been in a space where each core has access to memory that's mostly used by that core only. A high latency, medium BW interface between the cores is surmountable with a bit of smart memory allocation.

GPUs have always worked in mode where all cores can access all memory. If you want to keep that model, you're going to need an interconnection bus with the same BW as the memory BW itself or you'll suffer efficiency loss.
 
Which part of the design (cost) do you have in mind?

There are many steps for which I don't see any difference (architecture, RTL, verification). Those all happen in the 0/1 realm, process independent.

So is there something that explodes so much that it makes the whole project double expensive?

I'm not so much into it, to know why, but it seems everything beside architecture is exploding and also verification costs are much higher:
2a_Design_costs_IBS_x_800.png

http://www.eetimes.com/document.asp?doc_id=1331185&page_number=3
http://www.eetimes.com/document.asp?doc_id=1331185&page_number=3
“It will take chip designers about 500 man-years to bring out a mid-range 7nm SoC to production,” Gartner’s Wang said. Therefore, a team of 50 engineers will need 10 years to complete the chip design to tape-out. In comparison, it could take 300 engineer-years to bring out a 10nm device, 200 for 14nm, and 100 for 28nm, according to Gartner.
http://semiengineering.com/10nm-versus-7nm/
http://semiengineering.com/10nm-versus-7nm/
Gartner is even saying more than double the manyears for 7nm vs 14nm. Ok they're talking about SoCs, but how the exact numbers are doesn't matter. Important is, the cost are growing extremely and i doubt amd can manage to design so many chips in the future. As costs go up the same will happen with nvidia as well, but probably they can stay a bit longer with monolythic chips because of their higher r&d.
 
I would tend to agree that it would be better if it didn't.
However, the individual interviewed by PCPerspective in 2016 seems to say that's what he's assuming it will, and he might know more than most.
https://www.pcper.com/news/Graphics...past-CrossFire-smaller-GPU-dies-HBM2-and-more
To clarify, it's 42 GB/s per link with EPYC, and 768 GB/s per link with Nvidia, or 3 TB/s aggregate. Nvidia's interconnect is also 4x as power efficient per bit at .5 pJ/bit versus 2 pJ/bit for EPYC's on-package links.

I hope, that we don't really get what he means, because i only see a chance in that approach if it behaves like 1 gpu. To wait for dev support would be awful.
Yes and it's 4 links for epyc per die. I'm talking about the interconnect bandwidth per die and that's not 3TB/s. Max it's two links with 1,5 GB/s, but that's not totally clear for me.

Epyc uses 4 links because it's what AMD deemed necessary for this specific use case. It doesn't mean a future multi-die GPU launching in 2019 in a different process would be limited to the same number of links.
It also doesn't mean Navi will have to use the same IF version that is available today. Hypertransport for example doubled its bandwidth between the 1.1 version in 2002 and the 2.0 in 2004.

Besides, isn't IF more flexible than just the GMI implementation? Isn't Vega using a mesh-like implementation that reaches about 512GB/s?

Of course you can do it, but it's nothing you get for free and have to sacrifice die space for it. You can't compare vegas ondie mesh. The problem is always when you go offdie. Especially power concerns are big because it costs much more energy to transfer data offdie.

It could very well still be worth it. How much would they save for having all foundries manufacturing a single GPU die and getting their whole process optimization team working on the production of that sole die, instead splitting up to work on 4+ dies? How much would they save on yields considering this joined effort?
Maybe this 880mm^2 "total" combined GPU has equivalent manufacturing cots as much as performs as well as a 600mm^2 monolithic GPU, when all these factors are combined.

True, maybe that's the case. But i believe the differences won't be so big and it's more the design cost problem, which will lead to such chips.
 
I think that would be music to the ears of competitors who decide to do multiple dies that are tailored to a market segment.
Intel says hi.


GPUs have always worked in mode where all cores can access all memory.
Not all memory, but regardless no one suggested otherwise.
Even nvidia is studying ways to make multi-die GPUs. Several smaller chips connected through a very high bandwidth fabric does seem like a strong suggestion for the future.
 
Last edited by a moderator:
I'm not so much into it, to know why, but it seems everything beside architecture is exploding and also verification costs are much higher
When you look at this graph, they're basically matching the amount of transistors to the design work.

That doesn't work for GPUs: I don't think there is any functional difference between GP102 to GP107 other than the number of units. The incremental design effort for an additional version should be a fraction of a complete design.
 
Status
Not open for further replies.
Back
Top