AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Bondrewd · Dec 10, 2017

ImSpartacus said:
I think we finally have some ballpark numbers on the cost savings and die size overhead related to a "chiplet" design that could find its way into Navi.

We had it since HotChips2017.
AMD presented there a paper exactly about cost savings of MCM design.

ImSpartacus said:
It's for Epyc, but I think it's a reasonable benchmark for GPUs as well.

No, since MCM for GPUs is very, very tricky to execute.

ImSpartacus said:
In lieu of achieving Nvidia-tier efficiency, AMD may be able to brute force their way to parity (or near parity) by going wider and slower without bloating their total die costs.

Acktually they need a proper HP node, working Vega and tighter binning.
It's not that diffucult should AMD really (as in throwing a lot of money) bother with GPUs.

ImSpartacus said:
Why make all of those architectural changes to increase clocks if you're going to underclock just one generation later?

Because they are not going to underclock anything, Navi is N7HPC/7LP (HPC).

ImSpartacus said:
EUV can't come soon enough, eh?

EUV itself is hands down the LARGEST possible pain in the ass in CMOS history.
The steppers are slow, masks cost a fuck ton, mask inspection tools suck, ASML only recently announced pellicles for 250W sources (and said sources in itself suck). And lenses. Oh holy shit the lenses.
EUV was in enternal limbo for ~10 years for a very, very good reason.
It fucking sucks.

Deleted member 13524 · Dec 10, 2017

Bondrewd said:
No, since MCM for GPUs is very, very tricky to execute.

Other than needing higher interconnect bandwidth than CPUs, what exactly is "very tricky"?

Almost all mobile GPU designs (Mali, PowerVR, Vivante, probably Adreno too) have been modular for how many years?

Grall · Dec 10, 2017

ToTTenTranz said:
Almost all mobile GPU designs (Mali, PowerVR, Vivante, probably Adreno too) have been modular for how many years?

Modular in what way versus AMD/NV/Intel desktop/laptop chip designs? Mobile GPUs seem to use repeating clusters of hardware just like traditional GPUs have done for a long time now.

Bondrewd · Dec 10, 2017

ToTTenTranz said:
Other than needing higher interconnect bandwidth than CPUs

That's the tricky part, along with packaging required.

el etro · Dec 10, 2017

For the GPU uarch experts: how hard is to write drivers for a chiplet-based Navi? Especially in the case the MC is shared across the four dies of the chiplets.

Deleted member 13524 · Dec 10, 2017

Bondrewd said:
That's the tricky part, along with packaging required.

Sure, Epyc's/Threadripper's 42GB/s would be insufficient for GPU chiplets, but Navi could use a lot more channels and Infinity Fabric isn't staying stagnant.

el etro said:
For the GPU uarch experts: how hard is to write drivers for a chiplet-based Navi? Especially in the case the MC is shared across the four dies of the chiplets.

I'd say the implementation is either completely transparent to the driver or it can't be done.
Each chiplet having its own MC would mean each chiplet would preferably use its own stack/GDDR channel.

Infinisearch · Dec 10, 2017

el etro said:
For the GPU uarch experts: how hard is to write drivers for a chiplet-based Navi? Especially in the case the MC is shared across the four dies of the chiplets.

I would think the memory controllers being spread across chiplets not being much of a problem. I would think the problem would be what will happen to the currently massive on-die bandwidth whose usage can vary depending on the data being processed. The driver writers would have to find a way to minimize chiplet to chiplet transfers. That would be the basic thinking but then you could leave processors idling if you try to keep things on one chiplet. I don't know the details but for nvidia according to this page: https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline a triangle is sent to different rasterizer's depending on what on screen tiles it intersects. For nvidia the is one rasterizer per GPC and you would think there would be at least one 'GPC' per chiplet. If things stay this way triangle data would have to be sent to different chiplets... now what happens with tesselation... data amplification. Read that page I linked it will give you a good idea of how things currently work, and you should be able to infer some things about multi chip GPU's and their drivers.

edit - I don't know why I said driver writers... should've said designers... I mean I guess its possible they make it programmable so they can experiment but what I described is usually hardware.

el etro · Dec 10, 2017

So, technically is hard to extract maximum performamce of an Epyc-style GPU.

Infinisearch · Dec 10, 2017

el etro said:
So, technically is hard to extract maximum performamce of an Epyc-style GPU.

Firstly read my edit. Second I don't know about drivers but making a a non-SLI multi-chip gpu is first and foremost unknown territory. It will be harder and not as straight forward. But only Nvidia (who also is looking at multichip GPU's for the future) and AMD know exactly how complicated it is/will be. There was an Nvidia paper floating around that stated how powerful a multichip GPU would be compared to a single chip equal spec'd one that would be impossible to manufacture. I'll see if I can find it.

edit - here http://research.nvidia.com/publication/2017-06_MCM-GPU:-Multi-Chip-Module-GPUs

Grall · Dec 11, 2017

ToTTenTranz said:
Sure, Epyc's/Threadripper's 42GB/s would be insufficient for GPU chiplets, but Navi could use a lot more channels and Infinity Fabric isn't staying stagnant.

It would be really interesting if the combined AMD GPU/Intel CPU module meant that AMD gets access to Intel's silicon bridge tech for their other products as well.

3dilettante · Dec 11, 2017

ImSpartacus said:
I think we finally have some ballpark numbers on the cost savings and die size overhead related to a "chiplet" design that could find its way into Navi.

It's for Epyc, but I think it's a reasonable benchmark for GPUs as well.

A monolithic design would've saved about 9% in total die size compared to a 4-die chiplet (777mm2 vs 852mm2).

Presumably, this is from all of the overhead needed to connect the chips.

A monolithic design would've cost almost 70% more (1/0.59-1=0.69).

From Nvidia's old paper, we know that a monolithic design will handle beat an "equivalent" chiplet design, but for these kinds of savings, you can afford to underprice the monolithic design by a wide margin.

A fair amount of this has come up before and discussed.
The 9% overhead is for an MCM whose bandwidth is 8x-10x lower than the base level of bandwidth expected for a GPU. Barring improvements in various GPU elements AMD has not promised, that overhead would be multiplied many times over, to the point of negating a large amount extra area.
More of the duplicated hardware on the CPUs is still useful in an MCM context versus a GPU, so the percentage would probably worse. For big HBM GPUs with the largest fraction of their area devoted to the GPU proper, it's something like 85% for Fiji to 75% for Vega.
Nvidia's paper on a compute solution using an MCM includes moving hardware that isn't useful when duplicated into its own chip. An EPYC-style solution doesn't do this.

Beyond that, elements that are duplicated within the GPU may need to be duplicated, but might not contribute significantly to any scaling of performance without changes to the overall system architecture AMD has not promised (or not clearly). The shader arrays are the bulk of the performance scaling, whereas the command processor and other control hardware have significantly less unique work to do, since they deal with more globally visible state and system management. One command processor on its own could potentially interpret a significant fraction of the commands from the CPU, so the others would be performing redundant work or deferring to it constantly.

AMD indicates the interconnect takes about 2pJ/bit for link power cost. A 42 GB/s x 8 bits/Byte x 2 pJ/bit x 4 links x 1/1000 G/p = ~2.7 W (4W going with 6 links per MCM). This is for an EPYC link setup 8-10x too slow (at a minimum), and this is with a cache hierarchy designed to significantly reduce bandwidth demands GCN does not have.

Nvidia has given numbers for a next-gen on-package interconnect that is much more power-efficient than AMD.
AMD has so far promised nothing. A 30-40W interconnect TDP for a chiplet GPU might be noticeable.

AMD has given aspirational papers about some future methods of integration that might give some of the improvements needed as a bare minimum, but has not given any roadmap or tangible projections for a product using them in the next several generations. Unfortunately, AMD's promised future integration methods are much more speculative and expensive than what EPYC does.

ToTTenTranz said:
Other than needing higher interconnect bandwidth than CPUs, what exactly is "very tricky"?

Duplication of hardware is worse than it is for EPYC, and the power cost of the infinity fabric is about as high as can be tolerated already.
The GPU cache hierarchy and on-die pipelines make a number of implicit assumptions about being on the same die that an MCM would reveal as being much more problematic.
Nvidia's own proposal for this goes beyond EPYC by taking redundant hardware out of the the GPUs entirely, and revamps their cache hierarchy to have another layer of (snooped) cache per-die that isn't directly tied to local memory channels.
GCN's concept of coherence does not extent past one chip, its pipeline relies on GDS that would be highly serializing in an MCM. There is metadata (DCC, HiZ, MSAA compression, etc.) that remains incoherent even with Vega, and the ROPs are probably coherent in a way that's even more limited than the way GCN's L2 is trivially coherent (and broken by an MCM).
At a higher level, there's still a strong ordering to primitive processing and output that has some support built into it by hardware that trivially assumes ordering because it's one chip.
There's no transparent solution to resolving how the control logic assumes it has full control over a monolithic chip, as command processors have no communication, coherence, or hierarchy established that doesn't involve something more explicitly system/driver managed like Xfire.

GPU graphics context is big within one chip. Synchronization and pipeline events are painful within one chip. Multi-chip promises to make the costs of each multiply.

The person in charge of AMD's graphics, who probably would have known what AMD's plans were to get around this and would have pushed internal development in that direction for years, stated that he expected developers would take care of this massive problem space.
He was removed from that capacity and then gone entirely, not that I can read that as being an indictment of his plans on this matter necessarily.

Deleted member 13524 · Dec 11, 2017

3dilettante said:
Nvidia has given numbers for a next-gen on-package interconnect that is much more power-efficient than AMD.

You mean nvidia has given numbers for a next-gen on-package interconnect that exists in a paper or perhaps in a lab that is much more power-efficient than a solution that AMD has had in a final product sitting in shelves for over half a year that uses a 2 year-old manufacturing process, and is meant for CPUs and not GPUs?

Wow, color me impressed!

/s

3dilettante said:
AMD indicates the interconnect takes about 2pJ/bit for link power cost. A 42 GB/s x 8 bits/Byte x 2 pJ/bit x 4 links x 1/1000 G/p = ~2.7 W (4W going with 6 links per MCM). This is for an EPYC link setup 8-10x too slow (at a minimum), and this is with a cache hierarchy designed to significantly reduce bandwidth demands GCN does not have.

Nvidia has given numbers for a next-gen on-package interconnect that is much more power-efficient than AMD.
AMD has so far promised nothing. A 30-40W interconnect TDP for a chiplet GPU might be noticeable.

So the logic here is AMD will keep Infinity Fabric's MCM interconnects stagnant regarding power-per-transferred bit?
They'll just copy/paste what they used in a MCM CPU solution that was developed somewhere in 2014-2015 for a new arch launching in 2017, and then apply it directly in a 2018 GPU.... because they haven't released a paper saying otherwise?

3dilettante said:
The person in charge of AMD's graphics, who probably would have known what AMD's plans were to get around this and would have pushed internal development in that direction for years, stated that he expected developers would take care of this massive problem space.

I'm pretty sure Raja's comment was referring to DX12 explicit multi-gpu implementations, in the way that driver-enabled AFR in multiple GPUs (not multiple-chip single GPUs) has its days counted. And this has been mentioned for several years before that interview. nVidia's latest Titan V GPU doesn't even have SLI support meaning every GPU IHV is going that way.

3dilettante said:
He was removed from that capacity and then gone entirely, not that I can read that as being an indictment of his plans on this matter necessarily.

Raja took a huge upgrade in his position from head of RTG at AMD (market cap $10B, results just starting to go from red to black) to Chief Architect at Intel (market cap >$200B, tens of billions of revenue exceeding their records YoY), but somehow you're trying to make it sound like a demotion?

And gone entirely?
He was hired to start developing a high-end discrete GPU at Intel. How exactly is this gone entirely?

Rootax · Dec 11, 2017

I think he meant gone from RTG.

It was a weird departure, sounds like a "gtfo Raja" to me...

3dilettante · Dec 11, 2017

ToTTenTranz said:
You mean nvidia has given numbers for a next-gen on-package interconnect that exists in a paper or perhaps in a lab that is much more power-efficient than a solution that AMD has had in a final product sitting in shelves for over half a year that uses a 2 year-old manufacturing process, and is meant for CPUs and not GPUs?

Early disclosures and patents go back to 2013, with at least physical demonstration of the concept at 28nm.
http://research.nvidia.com/publicat...ngle-ended-short-reach-serial-link-28-nm-cmos
I've seen brief references to it a few times, although the most recent is Nvidia's MCM GPU paper.

I am not actually sure what Nvidia has proposed is sufficient for an EPYC-style solution for a seamlessly operating MCM GPU. What Nvidia has proposed is focused on compute and often needs significant adjustments to the software running, despite how much of the hardware was changed to minimize the impact. Graphics would be less consistent and more difficult to distribute, and Nvidia focused on the compute workloads.
I've said earlier in the thread that AMD's aspirations may be more consistent with something post-Navi or maybe even post-Next Gen.

Within the scope of promises, Nvidia's 2017 paper cited more tangible figures and design points, such as an upcoming hardware node and an interconnect with some history of physical demonstration. It was compared with near-term architectures or reasonable extrapolations one or two generations in range, and come in before 2020.
It has features that match EPYC more closely than what AMD has shown plans for, with AMD counting on interposers and more complicated variations of the tech, whereas Nvidia's method works with organic substrates and with narrower widths than EPYC's MCM links.

In other instances, tech demonstrations for chip interconnects with multi-year lead times have given ball-park figures for what was eventually realized.
Inter-socket communications for interconnects like Hypertransport had technology demonstrators five years ago that reached 11 pj/bit, which EPYC's xGMI managed to get down to 9 pj/bit by 2016/2017.
The physical and cost constraints for interconnects seem to enforce a more gradual rollout of technologies.

Perhaps it's been a feint by AMD, but the promised integration and next-generation communication methods have been focused on interposer tech or unspecified link technologies in an Exascale context (post-2020). There are some papers on interposer-based signalling that could significantly undercut Nvidia's method, but the companies/laboratories with papers on that don't seem to be aligned with AMD and EPYC doesn't use interposers.

There are some upcoming 2.5D integration schemes from partners like Amkor, although those conflict with AMD's plans since they are working to avoid interposers rather than creating active ones, and may be too late for Navi.

So the logic here is AMD will keep Infinity Fabric's MCM interconnects stagnant regarding power-per-transferred bit?
They'll just copy/paste what they used in a MCM CPU solution that was developed somewhere in 2014-2015 for a new arch launching in 2017, and then apply it directly in a 2018 GPU.... because they haven't released a paper saying otherwise?

Presentation slides for EPYC were cited as a path for Navi. AMD's CPUs are probably targeting PCIe 4.0 in 2020, although that gives a 2x improvement. AMD has so far preferred to keep its memory, package, and inter-socket bandwidths consistent, so its next sockets are coming out in that time frame.
Navi per AMD's revised roadmap doesn't have until then.

Vega 20 has slides indicating its gets PCIe 4.0, and if accurate the slides show xGMI comes into play. Those aren't close to bridging the gap between today and the Navi's delayed launch in the year following Vega 20.

nVidia's latest Titan V GPU doesn't even have SLI support meaning every GPU IHV is going that way.

If EPYC's overheads are being cited, then it should be noted that EPYC doesn't go that way.

Raja took a huge upgrade in his position from head of RTG at AMD (market cap $10B, results just starting to go from red to black) to Chief Architect at Intel (market cap >$200B, tens of billions of revenue exceeding their records YoY), but somehow you're trying to make it sound like a demotion?

Before his "sabbatical" that presaged his leaving AMD, Raja's statement was that he would have come back in a different role, with a more focused set of responsibilities.
The people most familiar with Koduri's work and Navi did not plan on allowing him to have as much autonomy going forward, and eventually his employment ended. There are multiple ways to interpret this. Perhaps Koduri had lost the confidence of those internal to AMD, or perhaps Koduri felt what AMD had to offer him and his plans was insufficient.

There is evidence for both:
Koduri was effectively being demoted and there are rumors of significant clashes in between him and Su.
There are statements to the effect that AMD de-emphasized graphics and rumblings to the effect that GPU development had been gutted.
The excuses for Vega's significant teething pains play into both, where one or both parties could not achieve a fully-baked result or they were actively trying to abandon one another.
Not the best way to use one side's tech on the other.

And gone entirely?
He was hired to start developing a high-end discrete GPU at Intel. How exactly is this gone entirely?

Unless Intel buys RTG, he's gone from Navi's development pipeline. Intel didn't hire Navi.

My impression is that both AMD and Koduri's positions and visions appear to be more modest than an EPYC-style Navi.

Deleted member 13524 · Dec 11, 2017

Rootax said:
It was a weird departure, sounds like a "gtfo Raja" to me...

"GTFO Raja" because.. AMD recovered a substantial chunk in discrete graphics marketshare despite the laughable R&D budget they had for GPUs during the last 3-4 years since the inception of RTG?
Because Vega cards sold out everywhere for several months despite their prices being inflated?

I hate AMD's handling of Vega's release just as much as the next guy. Non-existent communication, the pricing debacle, forcing Vega 56 down reviewers' throats some 48 hours before ending the review embargo, and even not choosing the "power saving mode" as default power plan seems completely ridiculous to me
.
But the cards aren't exactly sitting in the shelves with no one picking them up. And given the budget that RTG was given, together with the downsides of being stuck with GlobalFoundries' lacking 14nm, I'd say Vega and Polaris meant a spectacular handling of resources.

As far as I can see, Occam's Razor says Intel's management got to meet Raja during Kaby G's development, at the same time nvidia got too much high ground in HPC and Intel's Xeon Phi didn't get any traction, so they poached him to start their own high-performance GPUs as a result.

Before that, he was poached by AMD after he spent 3 years making the iGPU by far the largest chunk of die area in Intrinsity's/Apple's SoCs, making them unparalleled in graphics performance compared to anyone else.

And before that, he was poached by apple right after AMD launched the very successful RV770 line.

Yet somehow people tend to believe he was kicked out of AMD because he made such a terrible job, and then Intel made him Chief Architect of the Core and Visual Computing group.
Because... multi-billion dollar companies like Intel really tend to hire failures for their top management teams, out of.. charity?

3dilettante said:
If EPYC's overheads are being cited, then it should be noted that EPYC doesn't go that way.

You're citing me when I'm talking about Raja suggesting that DX12 explicit multi-adapter will leave mGPU out of the hands of driver developers. And nvidia going that way because they have been progressively launching discrete cards without SLI support (first the mid-range, now the top range).
What does this have to do with EPYC?

3dilettante said:
The people most familiar with Koduri's work and Navi did not plan on allowing him to have as much autonomy going forward,
(...)Koduri was effectively being demoted and there are rumors of significant clashes in between him and Su.

Where? Who?
The only source I've seen anywhere about that are some comments from @digitalwanderer and while I do value his insights and opinion, they could be simply hearsay from 3rd-hand information, as he himself stated several times.

Bondrewd · Dec 11, 2017

ToTTenTranz said:
Because... multi-billion dollar companies like Intel really tend to hire failures for their top management teams, out of.. charity?

Probably, just look at their CEO.
And the state of TMG in general.

3dilettante · Dec 11, 2017

ToTTenTranz said:
"GTFO Raja" because.. AMD recovered a substantial chunk in discrete graphics marketshare despite the laughable R&D budget they had for GPUs during the last 3-4 years since the inception of RTG?
Because Vega cards sold out everywhere for several months despite their prices being inflated?

Can "AMD chronically underfunded GPU development to the point of self-sabotage" excuse be used while simultaneously saying Navi is going to use an interconnect and GPU architecture scaled an order of magnitude beyond what AMD has demonstrated or projected for any project this decade?

My point in referencing Nvidia's interconnect and MGPU paper is as a baseline of the work, the magnitude of the changes involved, and the sort of evidence that goes with actual commitment--with the caveat that it still might not be enough to get graphics to work as consistently as the generally autonomous handling of EPYC.

Yet somehow people tend to believe he was kicked out of AMD because he made such a terrible job, and then Intel made him Chief Architect of the Core and Visual Computing group.
Because... multi-billion dollar companies like Intel really tend to hire failures for their top management teams, out of.. charity?

I've already commented that I appreciated Koduri's ambition and desire to invest fully in graphics, but there's also instances where there were serious shortfalls under his watch and his responsibility.
Either way, my argument is going by the combination of Koduri+AMD not working out, rather than relying either being insufficient for a very large engineering change.

Data points related to the process that is supposed to lead to a EPYC form of GPU integration
AMD: Strangled GPU development (you've just copped to this). AMD's GPU plans promise neither EPYC-type integration or such integration in the time frame. They don't even mention going beyond 2 chiplets.
Koduri: Indicated a more explicit form of multi-GPU handling as what he wanted. And he's gone.

AMD's GPU efforts have to live with the combination of the two. I think their visions in isolation don't provide much corroboration with the idea that Navi is doing this, and that in combination the messaging was incoherent and not conducive to Navi having this structure.

You're citing me when I'm talking about Raja suggesting that DX12 explicit multi-adapter will leave mGPU out of the hands of driver developers. And nvidia going that way because they have been progressively launching discrete cards without SLI support (first the mid-range, now the top range).
What does this have to do with EPYC?

AMD's slides on EPYC have been brought up more than once in this thread to show how cheaply it can be realized for a CPU architecture already heavily predisposed to support it. If you brought it up in the earlier instances I didn't go back to check. My interpretation of scaling is from Koduri's statement is that he saw multi-chip scaling going by doing more to get explicit developer management. If the observers further out do not get an abstraction, the driver can see the gaps too.

I could cite you more recently, however:

https://forum.beyond3d.com/posts/2014164/

Sure, Epyc's/Threadripper's 42GB/s would be insufficient for GPU chiplets, but Navi could use a lot more channels and Infinity Fabric isn't staying stagnant.

I'd say the implementation is either completely transparent to the driver or it can't be done.
Each chiplet having its own MC would mean each chiplet would preferably use its own stack/GDDR channel.

One, when AMD says chiplet it is a very specific use case, and I do not recall Koduri ever touching on the topic. EPYC uses chips that are fully capable of working as individual products with the ability to manage themselves and interface with the outside world. Chiplets do not and involve 3D integration, active interposers, and some future 2020+ exascale architecture. Their EPYC-style MCM signalling methods do not promise an order of magnitude gain, and even their chiplet projections don't promise a 4-die GPU integration even within a single interposer.

Two, neither AMD's projections nor Koduri promise transparent management like EPYC.

I would like to be pleasantly surprised, but all I'm seeing is the conflation of two different visions to produce something I've not seen either of them say they will do or want to do.

Where? Who?

The rumors were about his clashing with Su and trying to undermine AMD's organizational structure.
The source of the plan to reduce his set of responsibilities is Raja Koduri.
https://www.anandtech.com/show/11836/raja-koduri-sabbatical
"As we enter 2018, I will be shifting my focus more toward architecting and realizing this vision and rebalancing my operational responsibilities."
Or you could argue that it's just corporate boilerplate they made him say, but that doesn't indicate a healthy relationship either.

The official line was that he was taking a step back from the full operational demands of architect and VP, and if it was Koduri's desire to work with less administrative responsibility he made an odd choice in getting hired as the VP of a consolidating architectural group with among other things a from-scratch discrete mandate.

itsmydamnation · Dec 11, 2017

I think people in this thread are being obtuse for the sake of it.
n=2 and n=* are two very different things.
Lets assume multiple chips for a SKU from Navi onwards.

Now to me the reasonable assumption is that they would start simple, simple to me is interconnecting two chips. This is simplest because AMD doesn't have to solve scalability/coherency on a large scale first go.
To me the simple progression is to extend the Request/Data cross bar/L2 and the GDS between the two chips using Si interposer. From my understanding at that point the two major things that hold state are now shared. No fundamental change to cache coherency or cache hierarchy.
To me sharing is simplest from a system level if these data stores just appear as holistic shares.

So thats the "easy " part. What this all comes down to is cost to read and write as an average across those two structures. This is were using any EPYC as a base line is almost useless. Does anyone even know the Vega cross bar topology? is it a full mesh? some sort of butterfly or torus. What does locality look like as an average/how much bandwidth is actually needed on the Si cross connect? Is it even equal cost right now? and finally how extra power per bit on average to read/write data over Si interposer compared to local.

When you look at these problems almost all of them exist for scaling regardless of 1 chip or many. It seems to me if AMD wants to push Core counts/fabric agent count higher you have to address these issues anyway.
The other question is how much of the front end could you share, if the GDS is shared and the driver is aware could you not load balance across both front ends? At that point very little of the chip is dead silicon.

Thoughts?

3dilettante · Dec 11, 2017

itsmydamnation said:
I think people in this thread are being obtuse for the sake of it.
n=2 and n=* are two very different things.

For the purposes of comparing interconnects, it's more of a function of aggregate inter-chip bandwidth. The assumption is point to point links between chips. EPYC is a full-connected solution and provides more whereas Nvidia's proposal only connects a chip to its two neighbors horizontally and vertically. How many DRAM stacks are there for two chips, and what are their respective areas?
Going to 4 chips helps give more area to compensate for redundant silicon and to provide a more tangible improvement versus the older node's 600-800mm2 chips.

There are now questions to answer concerning how accesses are handled between the local requestors, remote requestors, and each chip's memory.

To me the simple progression is to extend the Request/Data cross bar/L2 and the GDS between the two chips using Si interposer. From my understanding at that point the two major things that hold state are now shared. No fundamental change to cache coherency or cache hierarchy.

There's metadata caches for the graphics domain (various compression methods, hierarchical structures) that are currently incoherent, and buffers/caches for the output of front-end stages or command processors that are of unclear relationship with the L2. Some may spill to the L2, while others may generally avoid it like the MBs of vertex parameter cache. There some queues/buffers used by the shader engines that the Vega ISA hints at allowing shaders to address simultaneously within the chip, and MSG instructions that have global impact but might not store into the GDS.
The tradeoff in redundant work versus taking bandwidth or synchronization across the chip would have to be looked at.

The L1-L2 crossbar is diagrammed in some documents as being inside the GPU domain proper, and not exposed to the infinity fabric. If Vega has something similar to Fury's crossbar (or crossbar-like structure), it's somewhere like 2-3x the HBM's bandwidth. Which bandwidth should the cross-chip interconnect cater to? The assumption so far has been for it to match DRAM bandwidth.

Does anyone even know the Vega cross bar topology?

Vega's fabric, outside of the internal L2 crossbar, has been described as a mesh as opposed to Zen's fabric being a crossbar.

Jawed · Dec 11, 2017

3dilettante said:
A 30-40W interconnect TDP for a chiplet GPU might be noticeable.

Well at least there'd be close to zero chiplet to "HBM" power, if each chiplet is capped by a stack of memory. The troublesome question is what's in the chiplet at the base of a stack of memory. Just ROPs?

Are the ROPs using a high-enough proportion of bandwidth to mitigate interconnect power entirely?

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Bondrewd

Deleted member 13524

Guest

Grall

Invisible Member

Bondrewd

el etro

Deleted member 13524

Guest

Infinisearch

el etro

Infinisearch

Grall

Invisible Member

3dilettante

Deleted member 13524

Guest

Rootax

3dilettante

Deleted member 13524

Guest

Bondrewd

3dilettante

itsmydamnation

3dilettante

Jawed

Similar threads