AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Why is everyone so convinced Navi is a multi chiplet solution?
Everything else AMD is releasing is and research from AMD and Nvidia support the idea. GPUs scale far more easily than CPUs.

While I agree that binned rasterisation is a task that would be perfect for the base die of a PIM module ...


... vertex data is spread across all memory channels. There's no way to avoid having communication amongst PIMs in this case. And, to be frank, vertex data (pre-tessellation) is not a huge bandwidth monster.
Just treat each stack/PIM as an independent cache and duplicate vertex data with a paging mechanism from system memory. Same idea as HBCC where only ~3% of the frame changes each iteration. Any modifications can be brute forced from there with heavy frustum culling. Something which primitive shaders have been suggested to be good at.

Vega should have all the pieces required for MCM already, assuming the new features were functioning. I'm still of the mindset Navi is an ~200mm2 Vega as opposed to a big one. The current small Vegas are all integrated, excluding the Intel thing.
 
... vertex data is spread across all memory channels.
That's the point... if we assume the application programmer optimized the order of vertex's/index's for the post and pre transform caches the majority of triangles can be formed with data local to one chiplet. This will most likely require more complicated vertex buffering but the point is most vertex data and work associated with them and the triangle data can be kept local.
And, to be frank, vertex data (pre-tessellation) is not a huge bandwidth monster.
Its not really about the vertex data though, although it is a bonus along with providing work for all chiplets. The real point would be doing things this way minimizes transfers that have to do with the output of the rasterizer. If you don't batch visibility all at once overdraw bandwidth would consume 'extra' bandwidth.
 
What kind of additional latency do you guys expect from inter-chip(let) connections anyway? It's not like with 3D Rendering in Cinema 4D or Blender, where you have a nice sorting up front and then much much rendering happening in tiny tiles.

EPYC shows something like ~50ns extra latency over a local access if going to another die in the same package. Over xGMI, it seems to take ~120ns.
https://www.servethehome.com/amd-epyc-infinity-fabric-latency-ddr4-2400-v-2666-a-snapshot/

Yays:

1 - Only relevant info we have about Navi is "scalability" and "next-gen memory". Next-gen memory can only be HBM3, HBM Low-Cost or GDDR6.
I haven't heard whether HBM low-cost is confirmed, since the last I saw Samsung was still shopping the idea around. HBM3 is apparently late 2019/2020, which I would need to reconcile with AMD's roadmap having Navi apparently earlier and Next Gen in that slot.
4 - Vega already has Infinity Fabric in it with no good given reason, so they could be testing the waters for implementing a high-speed inter-GPU bus.
Vega 20 supposedly has xGMI, which would be an off-package bus running over a PCIe physical interface. There can be use cases for that in compute nodes or servers, although if used in the client space it's potentially more able to accelerate resource moves prompted by the copy queue or AMD's existing transfer over PCIe capabilities.

5 - AMD doesn't have the R&D manpower and execution capatility to release 4 distinct graphics chips every 18 months, so this could be their only chance at competing with nvidia on several fronts.
Assuming the Polaris and Carrizo refreshes as insignificant changes, and that the shrinks for the Xbox One S and PS4 Slim aren't big enough changes despite being different chips. There's Xbox One X, PS4 Pro, Vega 10, Raven Ridge, and the Intel custom chip.
I would say the console shrinks and subsequent distinct SOCs are evidence that AMD can find the means to roll out more than 4, if it wants to.
Given the quality of Vega's rollout, I'll grant that it's apparently not able to roll them out very well.

Nays:
1 - Infinity Fabric in Threadripper's/EPYC's current form doesn't provide enough bandwidth for a multi-chip GPU.
It technically could, but the sort of overhead AMD documented for EPYC would lead to more die area and power efficiency lost to the attempt than if they hadn't bothered.

3 - Multi-chip GPU is probably really hard to make, and some like to think AMD doesn't do hard things. Ever.
Then there's the part when the head of RTG was asked about transparently integrated multiple GPUs, and he said he didn't want that.
I can admit to some skepticism for AMD's chance for implementing this, because I think they've been saying they don't want to do that.
The things they do want to do, however, are actually hard and likely not realizable until after 2020.

4 - nvidia released a paper describing a multi-GPU interconnect that would be faster and consume less power-per-transferred-bit than Infinity Fabric, and some people think this is grounds for nvidia being the first in the market with a multi-chip GPU. Meaning erm.. Navi can't be first.
I think the more useful interpretation is that Nvidia gave a reasonable bare minimum for what has to be done for any such solution to be adequate (I think even that is optimistic for what people expect), and even then it only discussed things in terms of compute.
That includes a significantly better interconnect than what's available presently, hardware specialization beyond EPYC's MCM method, and a significant change in the internal memory hierarchy.

What AMD has offered is?

That's an interesting patent, I wonder if that's for Navi:

System and method for using virtual vector register files:
The filing date is June 2016. There's usually a delay between filing and when a feature shows up in a product, if it does. For example the hybrid rasterizer for Vega had an initial filing in March of 2013.
Vega's development pipeline may have had some unusual stalls in it, so we may need to come back to this to see when Navi or its successor is finalized and whether this method appears in it.

Everything else AMD is releasing is and research from AMD and Nvidia support the idea.
The items that AMD discloses for GPU chiplets talk about them being paired with memory standards 2 generations beyond HBM2. Should I only take every other sentence AMD says as evidence and ignore the ones that contradict my desired outcome?
 
Given the quality of Vega's rollout, I'll grant that it's apparently not able to roll them out very well.
One would expect that AMD prioritized its console contracts (IE: PS4 Pro, Bone X) ahead of its own graphics line-up. Anecdotal evidence seems to back that assumption up... :p

*edit: grammer...
 
Last edited:
That's the point... if we assume the application programmer optimized the order of vertex's/index's for the post and pre transform caches the majority of triangles can be formed with data local to one chiplet.
My contention is with your original sentence where you say "all work" ([...] they localize all work up to and including rasterization to that chiplet). Now you're saying "not all work". So, uhuh, we agree.
 
How will a Muilty Chip design looks like? They build 2-3 Complete Engines one a Die and combine 2-4 of thes Dies?

Or maybe they build a Frontend Chip than a Shader Chip and at least a Backendchip?
 
My contention is with your original sentence where you say "all work" ([...] they localize all work up to and including rasterization to that chiplet). Now you're saying "not all work". So, uhuh, we agree.
I guess I wasn't clear in my first post. I guess I should have said "they localize all work that can be generated from the local data up to and including rasterization". But I thought my pointing out the numa thing and the striping methodology right before that statement alluded to that meaning. Sorry for the confusion.
 
I haven't heard whether HBM low-cost is confirmed,
Lower cost, assuming Intel's EMIB is analogous to not using an interposer.

The items that AMD discloses for GPU chiplets talk about them being paired with memory standards 2 generations beyond HBM2. Should I only take every other sentence AMD says as evidence and ignore the ones that contradict my desired outcome?
Yes, because not all concepts in a research paper will make the final cut and technology changes. To conserve energy source and destination need to be close. Even for scaling it makes more sense to tightly couple them, then add the ability to share data on top of that. Limit coherence to only the data where it really matters, which isn't most textures and untransformed geometry.

How will a Muilty Chip design looks like? They build 2-3 Complete Engines one a Die and combine 2-4 of thes Dies?

Or maybe they build a Frontend Chip than a Shader Chip and at least a Backendchip?
Split out by shader engine makes the most sense with 1, w, and 4 SE parts. Two binning passes and a leap towards 16 SEs across an Epyc/Ripper backplane may be doable.
 
I haven't heard whether HBM low-cost is confirmed, since the last I saw Samsung was still shopping the idea around. HBM3 is apparently late 2019/2020, which I would need to reconcile with AMD's roadmap having Navi apparently earlier and Next Gen in that slot.

The slide is very clear: Navi in 2018 with "Nextgen Memory", after Vega with HBM2:

MWjEwrR.jpg


Perhaps it's HBM2.5 or HBM Low-cost, perhaps it's HBM2 using Intel's EMIB and given a different name, perhaps it's GDDR6 or perhaps it's HBM3 by SK-Hynix coming before Samsung's.
It's not like this industry is entirely predictible with a span of 2/3 years. How long did it take between GDDR5X being announced and launched in a final product? 3 months or so?
Regardless, after this slide I doubt Navi is coming with HBM2 like Vega.


Assuming the Polaris and Carrizo refreshes as insignificant changes, and that the shrinks for the Xbox One S and PS4 Slim aren't big enough changes despite being different chips. There's Xbox One X, PS4 Pro, Vega 10, Raven Ridge, and the Intel custom chip.
I would say the console shrinks and subsequent distinct SOCs are evidence that AMD can find the means to roll out more than 4, if it wants to.
I don't think console chips count for 2 reasons:
1 - They're not developed solely by AMD, they're a joint venture between teams belonging to AMD and Sony/Microsoft.
2 - The teams who worked on PS4, PS4 Pro, XBone and Xbone X are probably working on the next gen already.

Like you said, Polaris isn't a long way from a 14nm shrink of GFX8 architectures Tonga/Fiji. Carrizo is actually from 2015 but you probably meant Bristol Ridge which is practically Carrizo with Excavator v2 and the GPU was untouched.
So in practice, what we got was Polaris 10/11 in 2016 and Vega 10/11 + Polaris 12 in 2017. I remember Raja saying 2 distinct GPUs per year was just about what RTG could do..

And that's just not enough to compete with nvidia who launched GP100 + GP102 + GP104 + GP106 + GP107 + GP108 + GV100 within the same period of time.



Then there's the part when the head of RTG was asked about transparently integrated multiple GPUs, and he said he didn't want that.
We've been through that before. Raja was specifically following up on a conversation about ending "Crossfire" (i.e. driver-ridden AFR that needs work per-game on AMD's side) and leaving multi-GPU in DX12 to game developers. Which is what they're progressively doing already.
What exactly did he state in that interview that makes one think he was talking about multi-chip GPUs?


The things they do want to do, however, are actually hard and likely not realizable until after 2020.
Like I said. It's hard. And AMD can't do hard things.


What AMD has offered is?
More than some ideas on a paper.


Should I only take every other sentence AMD says as evidence and ignore the ones that contradict my desired outcome?
You mean this is not what you're doing? Trying to invalidate all facts that point to "Yes" in order to prove your opinion of a "No"?
 
If we could have kernels with no pre-defined limit on the register allocation, oh boy that would be so sweet.

It adds a hardware management engine to distinguish between the less-active register use cases from a smaller set of hot registers, which allows a smaller local register file, a larger less energy-efficient register file, then spill-fill in the memory hierarchy. The theory is that it would dynamically relax the per-wavefront register occupancy constraints, although the pre-defined 256-register ISA limit would remain.

Lower cost, assuming Intel's EMIB is analogous to not using an interposer.
That would be different than what Samsung's variant is attempting. EMIB is structurally a small silicon bridge capable of having the same interconnect density as a silicon interposer. Samsung's reduced-cost memory drops the bus width so that it can avoid using silicon, and a memory standard captive to something only Intel seems to have would be questionable.
The reduced-cost HBM was something Samsung was still making inquiries about customer interest.

The slide is very clear: Navi in 2018 with "Nextgen Memory", after Vega with HBM2:

Perhaps it's HBM2.5 or HBM Low-cost, perhaps it's HBM2 using Intel's EMIB and given a different name, perhaps it's GDDR6 or perhaps it's HBM3 by SK-Hynix coming before Samsung's.
GDDR6 would be a next-generation memory, at least compared to GDDR5. If compared to HBM2, the possibilities seem limited in terms an upgrade, like a form of HBM2 that lives up to its original specifications. The low-cost variant is a potential cost reduction, but would be somewhat inferior
I haven't ruled out some distinction in terms of memory revision, caches, or HBCC version that lets AMD say something counts as Next Gen memory, although I do not recall if that bullet point has persisted into more recent slide decks.

I don't think console chips count for 2 reasons:
1 - They're not developed solely by AMD, they're a joint venture between teams belonging to AMD and Sony/Microsoft.
2 - The teams who worked on PS4, PS4 Pro, XBone and Xbone X are probably working on the next gen already.
That's a balance that exists by AMD's choice in priorities, and doesn't take into account how much IP cross-pollination is going on.
The description of the process by the designers for the PS4 Pro and Xbox One X show a lot of high-level interchange and pulling from a menu of options AMD provides. Adjusting for custom external IP, AMD's internal investment in terms of engineering seems on the order of at least a modestly different GPU variant. The implementation and design for manufacturing on chips the same size or larger than Polaris also put the onus on AMD, since AMD eats the manufacturing costs if they cannot get it to yield sufficiently.

Like you said, Polaris isn't a long way from a 14nm shrink of GFX8 architectures Tonga/Fiji.
Polaris was "refreshed" in the last 18 months, which is what I wasn't counting. I considered the initial Polaris launch something of a borderline case, and had forgotten to add that to the count.
Carrizo is actually from 2015 but you probably meant Bristol Ridge which is practically Carrizo with Excavator v2 and the GPU was untouched.
That would be a refresh that I did not count, I'm not sure if the steppings changed from the end of one line to the start of the next.

So in practice, what we got was Polaris 10/11 in 2016 and Vega 10/11 + Polaris 12 in 2017. I remember Raja saying 2 distinct GPUs per year was just about what RTG could do..
Actually, I had blanked on the other Polaris chips as well, so add two more.

We've been through that before. Raja was specifically following up on a conversation about ending "Crossfire" (i.e. driver-ridden AFR that needs work per-game on AMD's side) and leaving multi-GPU in DX12 to game developers. Which is what they're progressively doing already.
What exactly did he state in that interview that makes one think he was talking about multi-chip GPUs?
Koduri's words concerning moving away from Crossfire was its abstracting of multiple GPUs as if they were a single GPU. Going forward with the new APIs, the intention was to involve and invest developers into the explicit management of the individual GPUs .
(Source: https://www.pcper.com/news/Graphics...past-CrossFire-smaller-GPU-dies-HBM2-and-more, after 1:50)

Like I said. It's hard. And AMD can't do hard things.
The technologies at AMD and its chain of manufacturing partners do not show a reasonable path to implementing 3D and 2.5D active interposers and chiplets that scale up and down the stack in this decade, no. Nor does it seem like its competitors are realistically positioned to do any better, although some have on occasion expressed skepticism on steps even earlier in AMD's chain of improvements.
I do not think AMD has promised a such a transition as early as within the next 12-18 months, which seems consistent with none of the OSATs or foundries AMD relies on gearing up for this, or talking about it after the next set of fan-out packaging methods that come out in 1-2 years and not close to that goal.

More than some ideas on a paper.
Is the "more" in this case slides for a CPU division product? I do not recall what the "more" is for AMD's GPUs, which I recall is rather vague and long-term.

You mean this is not what you're doing? Trying to invalidate all facts that point to "Yes" in order to prove your opinion of a "No"?
To quote someone's list: "2 - No official news or leaks about Navi have ever appeared that suggest it's a multi-chip solution."

I think it's a stretch to apply EPYC's MCM method, and a mistake to use AMD's post 2020 HPC plans as a guidelines for 2018/2019.
AMD's stated position was that it intended to have explicit multi-adapter allow developers to manage their GPUs up and down the stack.

Once it's exposed and delegated to the developers, there's less pressure for interconnect scaling like what was done for EPYC's more variable allocation needs. The granularity for a lot of the transfers at an API level can be done at a coarser inter-frame granularity or after some heavier synchronization points. There's less back and forth and more coalescing into pages or finalized buffers, so something like PCIe 3.0 or the PCIe 4.0/xGMI in the Vega 20 slides could provide sufficient grunt.
To me that seems like a reasonable path from what we see today to Navi's supposed time window, which isn't that far away anymore.

xGMI or PCIe 4.0 can also go over a PCB rather than require the GPUs stay on the same substrate, which seems like it would make it easier to scale the product up an down without creating a series of modules of increasing size.
 
That would be different than what Samsung's variant is attempting. EMIB is structurally a small silicon bridge capable of having the same interconnect density as a silicon interposer. Samsung's reduced-cost memory drops the bus width so that it can avoid using silicon, and a memory standard captive to something only Intel seems to have would be questionable.
The reduced-cost HBM was something Samsung was still making inquiries about customer interest.
Different, but reduce the number of pins from HBM2 and you're looking at GDDR. The only difference may be placing the memory close enough to avoid large drivers in the ICs. It avoids the large silicon interposer still, but has the small bridge to retail a high level of IO. Sort of a middle ground if you will. Your thinking was my original thinking as well, but in hindsight we may have been wrong. At the very least they would seem to be competing technologies unless HBM3 is far lower bandwidth or involves interesting signaling.
 
Different, but reduce the number of pins from HBM2 and you're looking at GDDR. The only difference may be placing the memory close enough to avoid large drivers in the ICs. It avoids the large silicon interposer still, but has the small bridge to retail a high level of IO. Sort of a middle ground if you will. Your thinking was my original thinking as well, but in hindsight we may have been wrong. At the very least they would seem to be competing technologies unless HBM3 is far lower bandwidth or involves interesting signaling.
I'm not thinking this as much as reading what Samsung stated.
It's roughly half as wide as HBM2, and running at 50% higher bit rate. It strips out ECC and the base die, and aims for an organic rather than silicon material for its interposer.
https://www.anandtech.com/show/1058...as-for-future-memory-tech-ddr5-cheap-hbm-more
 
Then there's the part when the head of RTG was asked about transparently integrated multiple GPUs, and he said he didn't want that.
I can admit to some skepticism for AMD's chance for implementing this, because I think they've been saying they don't want to do that.
The things they do want to do, however, are actually hard and likely not realizable until after 2020.

The filing date is June 2016. There's usually a delay between filing and when a feature shows up in a product, if it does. For example the hybrid rasterizer for Vega had an initial filing in March of 2013.
Vega's development pipeline may have had some unusual stalls in it, so we may need to come back to this to see when Navi or its successor is finalized and whether this method appears in it.

The items that AMD discloses for GPU chiplets talk about them being paired with memory standards 2 generations beyond HBM2. Should I only take every other sentence AMD says as evidence and ignore the ones that contradict my desired outcome?

Okay, what would be your guess and expectations for the next Xbox, if Microsoft is targeting a late 2021 release (4 years after X1X, 8 years after XB1) in terms of AMD GPU architecture, number of CUs and memory bandwidth, and would HBM3 be feasible by then?
 
If we could have kernels with no pre-defined limit on the register allocation, oh boy that would be so sweet.

Oh boy, imagine having up to 2^60 registers. Wait, isn't that a typical CPU with LD/ST instructions? Quick, let's also add a 5th cache hierarchy for 'registers' (after tex/const/D$/I$) including a tiny L0-cache.
 
Okay, what would be your guess and expectations for the next Xbox, if Microsoft is targeting a late 2021 release (4 years after X1X, 8 years after XB1) in terms of AMD GPU architecture, number of CUs and memory bandwidth, and would HBM3 be feasible by then?
If the pattern from the current generation holds, whatever architecture adopted by the console would be much closer to a new card launched in a similar time frame, possibly with a slight delay like with Bonaire. That's potentially something next gen to AMD's Next Gen. Navi hopefully shouldn't be the design under consideration by that point.

However, if the current gen's pattern of 6x to 8x over the prior generation holds, I think it's possible for a 64 CU Navi to get rather close to that if the basis is Durango or Orbis, and that doesn't necessarily go beyond the high-level chip organization we have now. Versus Sea Islands variants of 12 CUs at .853 GHz and 18 at .8 GHz, a chip with 64 CUs running at a notch below Vega's overly high 14nm clock speeds could actually improve GPU shader performance by almost an order of magnitude over the Xbox One, and that's without 7nm's power scaling or any architectural improvements since Sea Islands. Durango's something of a lower bar to clear, however.
Getting that level of hardware improvement over the launch PS4 would likely require some of those other improvements being taken into account. This is roughly mapping the foundry marketing of 4 node transitions where 28->~20->~16->~12->~7 into something close to the improvement from the 90->65->40->28 that denoted the prior generation's transition, although I'd be more confident with somewhere between 7nm and 5nm to get enough padding to compensate for marketing's inflating the progress being made these days.

If the baseline for 2021's jump is the 16nm PS4 Pro or Xbox Scorpio, then the there's probably going to be problems getting the same spec ratio, and there's a question of diminishing returns where getting the same perceived improvement may need more performance and physical integration tricks. However, the cost and risk picture at that point for AMD is a question mark.

HBM3 is potentially a case where trends are not going in AMD's favor. It doesn't sound like Samsung wants it to be a value play, and it's not certain if the DRAM market's long term price picture matches what AMD expected when it committed to HBM so long ago. Consolidation and process node difficulties in the DRAM market coupled with enterprise and mobile competition for memory production may put a price premium per stack that scales with chip count if doing something like an MCM or PIM.
HBCC and AMD's HPC proposals do show recognition of this, but it is something that is more awkward to handle if there's an architecturally higher minimum in memory cost.
Four years is a long time, however.

I'd imagine there's some potential course adjustments built in based on whether some of these pan out.
However, and this is more of a grenade to throw in the console forum, it's a loaded question to ask about what's feasible for a console more than 3 years out (roughly the design time of the current gen) just in terms of AMD.
2021 is pretty far, and conditions aren't the same in terms of leadership, some of the new products, or the positioning of competitors. Some of the console makers seem to have demonstrated a willingness to Switch, or have shown infrastructure in their backwards compatibility measures to handle a switch.
 
Status
Not open for further replies.
Back
Top