AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Considering Nvidia is competing with AMDs current 4k-ALU offerings with their own 2.5k versions and would be able to leverage the same ALU-levels of improvement, I guess the above Navi proposal by el etro would not be good enough for the 2018/9.timeframe.
 
You shrink four of this evolved Polaris 10 GPUs, use infinity fabric and MCM to tie it all togheter, like EPYC. Then you have a total of 9216 SPs. Then you use Interposer to match the GPU with 4096-bit(a couple of stacks) 16GB or 32GB of HBM3("NexGen Memory"). Then there it is, with an total Area not much bigger Than Vega64 with its HBM2, ready to take on Nvidia's best offering on TSMC 7nm.

What makes you go for four 36CU chiplets as opposed to, say, two 64CU chiplets?

I'm concerned with the power consumption from all of the IF lanes necessary to pull that off. It feels simpler to just have two chiplets and a ton of IF lanes back & fourth between those two (assuming you can get enough bandwidth in the first place).

Back when Nvidia wrote that paper on MCM-style graphics cards, I believe the conclusion was that the MCM-style only made sense over a traditional monolithic GPU if you were using that MCM technique to make something so big that it wasn't physically possible to pull it off monolithically.

So that generally means you're getting the most bang for buck if you take two of your biggest die and duct tape them together.

Then you use Interposer to match the GPU with 4096-bit(a couple of stacks) 16GB or 32GB of HBM3("NexGen Memory").

Just as a friendly note, the "NexGen" thing was a typo that got corrected in subsequent presentations.

I remember googling "NexGen Memory" when I first saw it, so I can empathize, lol.

52983_09_amd-vega-high-end-architecture-high-gamers.jpg


Then you chose the SoC version of GF 7nm(Power/Frequency tables for both 7LP flavors here: http://btbmarketing.com/iedm/docs/29...ha_Fig%202.jpg). The right choice would be the SHP choice for such a monolhitic GPU, but that's not "scalability".

Do you have a source for the existence of an SoC variant of GloFo's 7LP?

I thought that it was just going to be the high perf 7LP "Leading Performance" (definitely not "Low Power", lol...) version initially.

https://www.anandtech.com/show/1155...nm-plans-three-generations-700-mm-hvm-in-2018

Also, note that Anandtech reported that GloFo was bragging that their die size limit going up. It'd be weird to do that if you didn't expect your marquee customer to use that extra die size headroom (i.e. no tiny 36CU chips):

"GlobalFoundries also expects to increase the maximum die size of 7LP chips to approximately 700 mm², up from the roughly 650 mm² limit for ICs the company is producing today. In fact, when it comes to the maximum die sizes of chips, there are certain tools-related limitations."
 
What makes you go for four 36CU chiplets as opposed to, say, two 64CU chiplets?

One GDDR6 channel per module. This does unfortunately mean that the fabric speed needs to be built for the largest possible configuration, but everything down to the L2 size can be scaled down to a single module.

Epyc has proven that the fabric is fast enough to provide shared L3 across modules, so it should also be fast enough to provide a distributed L2, ROP and memory controller on Navi. With the added benefit of also getting independent memory channels.

Gesendet von meinem ONEPLUS A3003 mit Tapatalk
 
Isn't it either MCM or interposer but not both?

I think it can be MCM mounted at interposer, this last one serving for connecting HBM3 memory to the system.




What makes you go for four 36CU chiplets as opposed to, say, two 64CU chiplets?

Because 64CU will be over 100W of GPU Power consumption, making it hard to "tame" with SoC version of the process. Will be Polaris and Vega all over again. Uncompetitive in a performance per watt basis. Shrinking the 232mm² 14LPP of Polaris 10 will be easier to make a Four-die Epyc approach.

Sub 75W Nvidia GPUs(GP107 and GP 108) are fabbed at Samsung 14LPP, why GP106/104/102 aren't fabbed at it?

Do you have a source for the existence of an SoC variant of GloFo's 7LP?

I thought that it was just going to be the high perf 7LP "Leading Performance" (definitely not "Low Power", lol...) version initially.

https://www.anandtech.com/show/1155...nm-plans-three-generations-700-mm-hvm-in-2018

The marketing material linked at the post says it. SoC version for AMD and SHP/HPC version for IBM.

29-5%20Narasimha_Fig%202.jpg
 
Because 64CU will be over 100W of GPU Power consumption, making it hard to "tame" with SoC version of the process. Will be Polaris and Vega all over again. Uncompetitive in a performance per watt basis. Shrinking the 232mm² 14LPP of Polaris 10 will be easier to make a Four-die Epyc approach.

Sub 75W Nvidia GPUs(GP107 and GP 108) are fabbed at Samsung 14LPP, why GP106/104/102 aren't fabbed at it?

That's an interesting point. I forgot about that.

Did we ever get any substantial analysis confirming that was a functional improvement to GP107 & GP108 as opposed to, perhaps, just a measure to relieve TSMC's supply constraints?

The marketing material linked at the post says it. SoC version for AMD and SHP/HPC version for IBM.

29-5%20Narasimha_Fig%202.jpg

Yes, I see the image you originally linked, but I'm not familiar with it. Not knowing more, that image might as well be a random graph with the names "14nm", "7nm SoC" and "7nm HPC" next to some lines.

I'm sorry to be suspicious, but do we have a source that officially ties that graph to GloFo's roadmap?

I don't mean to be confrontational, I'm just not familiar with this image and I'd love the opportunity to add it to my collection of knowledge (we all come here to learn, right? :p).

One GDDR6 channel per module. This does unfortunately mean that the fabric speed needs to be built for the largest possible configuration, but everything down to the L2 size can be scaled down to a single module.

Epyc has proven that the fabric is fast enough to provide shared L3 across modules, so it should also be fast enough to provide a distributed L2, ROP and memory controller on Navi. With the added benefit of also getting independent memory channels.

Gesendet von meinem ONEPLUS A3003 mit Tapatalk

You think Navi's "next gen memory" meant GDDR6? Whew, that'd be brutal on AMD's public image (but I can't deny that it's a plausible possibility).

By "single channel", I'm assuming you mean a 128-bit config that's "half" as wide as Polaris 10's 256-bit config (I might be misinterpreting you).

At 128-bit, you'd need 16 Gbps GDDR6 to equal Polaris 10's 256 GB/s of bandwidth. 16 Gbps is the long term goal for GDDR6; we won't get it for a long time.

And even assuming all of that works, you're still left with 16 GDDR6 chips on a single graphics card, equivalent to a 512-bit config. No way AMD wants a Hawaii-esque repeat.

So I'm not sure that GDDR6 would be the right choice for a large MCM-style GPU with 400-500+ mm2 of total die size. That's HBM's arena.[/QUOTE]
 
I was actually thinking around the line of 64 or 128bit per module, depending on the memory interface used. At 128bit, that would be a full 512 bit interface for a full 4 module configuration.

"Channels", as this no longer behaves like a single, monolithic 512bit interface.

Not sure if a split memory controller is actually beneficial for graphic work loads, or if it would cause issues.

"Problem" is what to do if one channel is stalled, due to serving a small buffer which doesn't span multiple controllers. On the positive side, this would actually had been wasteful on a wider interface, due to command overhead vs burst transfer. But being stalled on parts of a larger buffer, which was more or less a single burst with a wider interface, sounds like it would complicated scheduling. For the better or for the worse.

With a 64 bit channel width and GDDR6 (burst length 16), we would still have a burst size of 1kb. Or 2kb with 128bit.
 
Of course not, using interposer means automatically that it's MCM (but being MCM doesn't necessarily mean you're using interposer)
Are you sure? I remember reading somewhere the an MCM was a specific packaging type. Personally I informally call interposer based modules MCM's as well, but when trying to be accurate I stick to MCM for "actual MCM" modules. I could be wrong though.
 
I just came with a crazy prediction for Navi: the usage of MCM, Inifinity Fabric and Interposer to make Navi perform head-to-head on all metrics, including efficiency, with Nvidia 7nm offerings.

Take Polaris 10(2304 SPs) and Upgrade it at Vega level of features(DX12.1, HBCC, Primitive Shaders, NCU and etc). Now you do some treatment to improve the per CU performance for a new level. Then you chose the SoC version of GF 7nm(Power/Frequency tables for both 7LP flavors here: http://btbmarketing.com/iedm/docs/29...ha_Fig%202.jpg). The right choice would be the SHP choice for such a monolhitic GPU, but that's not "scalability".

You shrink four of this evolved Polaris 10 GPUs, use infinity fabric and MCM to tie it all togheter, like EPYC. Then you have a total of 9216 SPs. Then you use Interposer to match the GPU with 4096-bit(a couple of stacks) 16GB or 32GB of HBM3("NexGen Memory"). Then there it is, with an total Area not much bigger Than Vega64 with its HBM2, ready to take on Nvidia's best offering on TSMC 7nm.


Crazy idea, but it looks plausible, no? I Just wish Navi is true GCN2 and not only GCN1.6. I'm enough of minor incremental updates.

Wouldn't it be better to use as small as possible configurations of shaders? The RX460 has the best performance per teraflop out of the current generation. MCM seems like it would be best used to overcome the lost efficiency introduced by feeding an excessive number of shaders, i.e. maximizing the utilization of shaders, rather than just building up massive GPUs, though I guess you can do both.
 
MCM is a TLA of "multi-chip module"; it doesn't say anything about how the multi-chipping is accomplished... Whether it is interposer, or sticking several chips onto a PCB substrate as has usually been the norm. Not sure if chip stacking using wirebonding is considered as qualifying under the MCM moniker, but logically it should. :p
 
Are you sure? I remember reading somewhere the an MCM was a specific packaging type. Personally I informally call interposer based modules MCM's as well, but when trying to be accurate I stick to MCM for "actual MCM" modules. I could be wrong though.
MCM is just generic term for "multi-chip module" aka many chips on 1 substrate
 
Wouldn't it be better to use as small as possible configurations of shaders? The RX460 has the best performance per teraflop out of the current generation. MCM seems like it would be best used to overcome the lost efficiency introduced by feeding an excessive number of shaders, i.e. maximizing the utilization of shaders, rather than just building up massive GPUs, though I guess you can do both.


But it would make the Interposing Yielding harder due to many GPUs to attach, as Opposed to do with bigger GPUs.

MCM and Interposers are way found by AMD to overcome the trend of each new process node being more and more mobile-focused.


Yes, I see the image you originally linked, but I'm not familiar with it. Not knowing more, that image might as well be a random graph with the names "14nm", "7nm SoC" and "7nm HPC" next to some lines.

I'm sorry to be suspicious, but do we have a source that officially ties that graph to GloFo's roadmap?

I don't mean to be confrontational, I'm just not familiar with this image and I'd love the opportunity to add it to my collection of knowledge (we all come here to learn, right? :p).

http://btbmarketing.com/iedm/docs/ go for IEDM 29-5, Fig. 2
I have to check if the power/performance curves given for both HPC and SoC flavors of GF 7LP is a makeup of the marketing firm or it is really GloFo's work.
 
Not sure if a split memory controller is actually beneficial for graphic work loads, or if it would cause issues.
Likely more problematic than beneficial for a workload. From a hardware and production standpoint it would be beneficial. Basically Epyc in regards to cost and scaling. The workload would essentially need to be NUMA aware which is more or less what HBCC currently solves for memory management. Given say four chips/channels each could start rasterization at different corners opposed to one tile. The 4SE arrangement roughly handling that distribution already. That should reasonably eliminate most of the communication between chips with the exception of any work heavy on global synchronization or atomics. Still works, just not ideal.

Are you sure? I remember reading somewhere the an MCM was a specific packaging type. Personally I informally call interposer based modules MCM's as well, but when trying to be accurate I stick to MCM for "actual MCM" modules. I could be wrong though.
As mentioned previously MCM just means multiple chips. Personally I've been drawing a line between on package and interposer based on the level of interconnection. Multiple interposers could exist on one package. Pair a chip to a stack of HBM on an interposer and link those like Epyc. The interposer the faster solution, but density may be a concern. An Epyc sized interposer would be an impressive piece of silicon.
 
Navi is supposedly being fabbed by TSMC, not GloFo.

http://www.digitimes.com/news/a20171023PD201.html

AMD's Vega series GPUs are fabricated by GlobalFoundries on 14nm process, but Taiwan Semiconductor Manufacturing Company (TSMC) has won the order from AMD to fabricate its NAVI GPUs using 7nm process technology. As TSMC is also keen on making deployments in advanced packaging technologies, it will continue to maintain coopetition relationships with local OSATs.

This is a surprise to me. We all knew Navi was officially on 7nm, but not any particular 7nm.

gpuroadmap.png


But GloFo seemed like an obvious choice since they already did Polaris & Vega, and GloFo's first gen 7nm seems strangely well equipped to support a big GPU:
To me, that spelled "I am preparing to make some big GPUs," but maybe that was off the mark.

Do we have any other info on Navi's fab?

EDIT Just re-read that DigiTimes article and it doesn't explicitly say that Navi will use 2.5D packaging, but it mentions that technique extensively, in particular by TSMC (but they could be talking about GV100).

TSMC will apply its CoWoS 2.5D packaging technology mainly to high-end GPUs or FPGAs needed for AI (artificial intelligence)-based supercomputer systems.

Can someone remind me, are advanced 2.5D packaging techniques necessary to support memories like HBM with their fancy TSVs in GPUs?

If so, that might suggest that Navi uses something like HBM, as opposed to GDDR6. It's a stretch, I know.
 
Last edited:
DigiTimes has it's sources, but it's not just one or two times they've been wrong. I'd take it with a pinch of salt 'till confirmed by other sources
 
Do we have any other info on Navi's fab?
No.
Besides AMD going TSMC for GPUs sounds silly considering GloFo is increasing Fab8 capacity by 20% next year.
Just feels weird.
Feels like IBM-speak.
Wait, it is IBM-speak!
Can someone remind me, are advanced 2.5D packaging techniques necessary to support memories like HBM with their fancy TSVs in GPUs?
Well its either interposers or EMIB. HBM interfaces are too wide for organic substrate.
 
I was actually thinking around the line of 64 or 128bit per module, depending on the memory interface used. At 128bit, that would be a full 512 bit interface for a full 4 module configuration.
Isn't this is presuming some new variation of the fabric's chip to chip interconnect, given the bandwidth?
EPYC's package bandwidth is ~170 GB/s, and there are interface and architectural limits to how much further EPYC can push its intra-package links. Per AMD, its MCM approach provides 10% overhead due to duplicated logic and the controllers/PHY for the links, and the proposed MCM GPU is has potentially 4-8x more link bandwidth needed than EPYC.

Not sure if a split memory controller is actually beneficial for graphic work loads, or if it would cause issues.
GPU memory controllers and graphics resources already do, generally.
There's a lot of explicit resource tiling, or internal striping of addresses or buffer formats.
At the hardware level, the L2 and L2 crossbar would tend to be where there's an attempt to bridge that gap. As the LLC for the chip, that section needs to. However, GPU L2s manage this with static partitioning of slices with known physical assignment to a channel, which is among the low-cost methods of achieving this when on-die. That low-overhead choice would need to go away, on top of the 10% EPYC-style links that need to be 4-8x more capable.

MCM is a TLA of "multi-chip module"; it doesn't say anything about how the multi-chipping is accomplished... Whether it is interposer, or sticking several chips onto a PCB substrate as has usually been the norm. Not sure if chip stacking using wirebonding is considered as qualifying under the MCM moniker, but logically it should. :p
In this context, the vendors discussing it treat MCM as the conventional implementation of multiple silicon chips on a plane interfacing with an organic or ceramic substrate that handles signal/power distribution and pinout.
The vendors called interposer-based integration 2.5D to distinguish the extra benefits and complexity.
In some ways, from a package's perspective it might appear as if a 2.5D solution is a single-chip module, since there's still a substrate below and the silicon interposer (technically a chip itself) hides the details of the stack.

The mobile or embedded formats that physically place wire-bonded chips above usually get marketed as something like PoP, to demonstrate the differences in the Z dimension and the properties of the chips and their connecting to the substrate. Also, it's also useful in allowing for meaningful discussion about how it differs from standard methods. MCM may be generic from the dictionary meaning of its individual words, but it gets used as shorthand for what was already here first and has built up a body of usage and technique already.
 
Navi is supposedly being fabbed by TSMC, not GloFo.

http://www.digitimes.com/news/a20171023PD201.html

For me, is a bye to the "scalability" thing, and a back for the monolithic GPUs. TSMC may have a better Jack-of-all-trades(fabbing from phones to HPC chips) with its 7nm process. AMD may had prototyped both versions, Multi Die MCM and Monolithic, of Navi in GloFo and came to a conclusion that they will go nowere using GloFo any flavors of the process. And TSMC 7nm+ is ready for production in 2H 2018 according to the latest roadmaps, so thats a decision that makes everyone but GloFo happy. But Ryzen can still be fabbed at GloFo 7SoC, it will still reach huge clockspeeds(3.7*1.3= 4.8Ghz all core boost clock and 4Ghz*1.3= 5.2Ghz single core Turbo for a theoretical Ryzen1 8-core fabbed at the process) at lower power consumption according to the graph...
 
I think we finally have some ballpark numbers on the cost savings and die size overhead related to a "chiplet" design that could find its way into Navi.

FFMojot.png


https://fuse.wikichip.org/news/523/iedm-2017-amds-grand-vision-for-the-future-of-hpc/4/

It's for Epyc, but I think it's a reasonable benchmark for GPUs as well.
  • A monolithic design would've saved about 9% in total die size compared to a 4-die chiplet (777mm2 vs 852mm2).
    • Presumably, this is from all of the overhead needed to connect the chips.
  • A monolithic design would've cost almost 70% more (1/0.59-1=0.69).
    • From Nvidia's old paper, we know that a monolithic design will handle beat an "equivalent" chiplet design, but for these kinds of savings, you can afford to underprice the monolithic design by a wide margin.

In lieu of achieving Nvidia-tier efficiency, AMD may be able to brute force their way to parity (or near parity) by going wider and slower without bloating their total die costs.

Then again, according to an old Anandtech article, "the single largest consumer of the additional 3.9B transistors [of Vega 10] was spent on designing the chip to clock much higher than Fiji." Why make all of those architectural changes to increase clocks if you're going to underclock just one generation later?

Also, just for proper credit, I found the above-linked IEDM 2017 article on r/hardware.

EDIT Thanks to iMacmatician, I noticed that AMD might not really have much of a choice in pursuing a chiplet design. Initial 7nm will increase the costs of a 250mm2 die tremendously.

Fa5xGqx.png


EUV can't come soon enough, eh?
 
Last edited:
Status
Not open for further replies.
Back
Top