AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
Quick question about Navi :

Navi will be the first gpu "designed" by Raja Koduri since he's back with AMD, no ? I mean, he came back mid 2013, I guess work on Vega architecture was already started ? Or I overestimate time to make a gpu ?

According to Anandtech: said:
The timeframe for Raja’s influence depends on what you’re talking about. Raja’s immediate goal is to ensure that AMD has the best GPU architecture/hardware possible. Unfortunately, it will likely take 2 - 3 years to realize this goal - putting the serious fruits of Raja’s labor somewhere around 2015 - 2016. Interestingly enough, that’s roughly the same time horizon for the fruits of Jim Keller’s CPU work at AMD.

This timeframe sweeps through Fiji, Polaris and Vega. In fact, "waiting for Raja" has become somewhat of a troupe: ever since Fury, in the run-up to the new chip unveiling part of the hype includes "This is his first design!"; then after the launch the motion dissipates only to re-surface, verbatim, in the lead-up to the next release.
 
This timeframe sweeps through Fiji, Polaris and Vega. In fact, "waiting for Raja" has become somewhat of a troupe: ever since Fury, in the run-up to the new chip unveiling part of the hype includes "This is his first design!"; then after the launch the motion dissipates only to re-surface, verbatim, in the lead-up to the next release.

Well since its "roughly the same time you'd see Jim Keller's CPUs (Ryzen)" wouldn't that make Vega the first? Vega from the slides seems to be the biggest change to GCN yet, though its not clear why its not performing near where it should be.

Vega is also supposed to support Infinity Fabric, though I'm not sure where it will come into play. It sounds like Navi will make heavy use of IF to work similar to Ryzen, where you have 1 smaller GPU that you can stack together to make the bigger ones, without the normal downsides of mGPU designs which require game engine support. That would allow them to scale from low end to top end with a single GPU design, each additional "core" giving an additional 1x performance. That would simplify their design process and allow them to keep low R&D costs similar to how CPUs are handled. They are on top of their game when it comes to CPU vs Intel right now, with only 2 intel chips worth their cost.
 
Vega is also supposed to support Infinity Fabric, though I'm not sure where it will come into play.
I don't know if the infinity fabric is being used within Vega, but it's used for CPU-GPU connection in Zen+Vega APUs/SoCs.
 
It's used inside the GPU too, apparently to connect the UVDs, VCEs and such

What's the point of using IF in this configuration ? I get the benefit for connecting "cores" or "full chips" between them,like the multiple Zen configurations are showing, but part of chips ?
 
I don't know if the infinity fabric is being used within Vega, but it's used for CPU-GPU connection in Zen+Vega APUs/SoCs.
What's the point of using IF in this configuration ? I get the benefit for connecting "cores" or "full chips" between them,like the multiple Zen configurations are showing, but part of chips ?

Rick Merritt - EETimes said:
Infinity is agnostic on topologies and will be implemented like a mesh on Vega, said Maurice Steinman, an AMD fellow for client SoC architectures and modeling. It can provide the full bandwidth of any attached DRAM.

“That was something that we could not do in the past,” Steinman said.

“We had multiple on-die protocols trying to do the same thing that gave us some inefficiencies,” Steinman said. He added that creating a new interconnect “was a huge investment, but we are seeing the ability to do variants and architectures we could not do that we are now embracing.”

In the past, an on-chip network change early in an SoC design cycle might have taken six months, “but we can do it now in a few hours,” he said. In addition, AMD is able to offer more interconnect variants to its ASIC customers, such as the videogame console makers.

The Infinity link is the conduit for a new suite of uses. They range from test and debug functions to new algorithms to check hundreds of on-chip sensors and dynamically adjust power and frequency as the chip has thermal headroom.
http://www.eetimes.com/document.asp?doc_id=1330981&page_number=2
 
Someone should tell that to Vega :LOL:
Raw vs effective bandwidth. Current results just mean they are scheduling accesses inefficiently as a result of drivers. Plus the overclocked memory is running much faster than Fiji. Thermal throttling is another possibility for reduced bandwidth.

In the Linux drivers it was mentioned certain functions don't have to wait to "acquire memory" so some QoS is likely occurring.
 
Quick question about Navi :

Navi will be the first gpu "designed" by Raja Koduri since he's back with AMD, no ? I mean, he came back mid 2013, I guess work on Vega architecture was already started ? Or I overestimate time to make a gpu ?

I think it's been indicated that GPU design times have been getting longer than 2-3 years in latest generations, not quite as long as a high-end CPU core like Zen, but enough that Anandtech's estimate at the time may have been off.
The level of involvement over time has been uncertain, but the amount of blame that could be shifted to prior leadership has been decreasing.
There's enough time at this point to think he should have had enough of a chance to significantly influence things. Some complications akin to a bubble in the development pipeline might mean he had a less than ideal start, if some of the projects in question actually did start before him and then stalled before a restart. A period of limited resource and corporate shuffling could cause the time frames to stretch.
There's nothing public that would say either way. The initial patent for a hybrid tiled rasterizer that sounds similar to Vega's rasterizer was filed before Koduri came back to AMD, however.

The rumor mill has Navi as the current "this time it's his project", after moving on from Vega.

However, even if that is so he had time to influence Vega, and whether or not he had the chance to make every decision on it or not won't change that the bulk of its development and project finalization are fully under his watch.
If we were to stipulate that Vega has a basis that was at least partly built before Koduri could change things, one possible scenario is that he did change some things--the disruption or increased risk introduced by a mid-course correction could lead to teething pains.

Someone should tell that to Vega :LOL:
What exactly the fabric does and where it plugs in would matter. The full bandwidth of Vega's HBM2 stacks is inferior to the internal crossbar arrangement between the L2 and CUs in any large GCN GPU, a fabric that only offered DRAM bandwidth would be a significant regression. Dropping the ROPs into that fabric would make things even more problematic.

On the other hand, if there's still high-bandwidth crossbar between the L2 and CUs, then the data fabric's importance to the GPU is unclear. A mesh would be mostly fine if the bulk of its traffic is a direct line between the memory controller and a statically partitioned L2 slice.
 
One of the things that gives me concern with the concept of Navi being merely a set of compute dies with memory stacked atop them is the density ratios: compute versus memory. Ratios concerning both logic and power.

Already we have 8GB single-stack HBM2 modules. Yes, a consumer GPU with 4 GPU chiplets consisting of 8GB of memory each, totalling 32GB would be a dream come true. But in the real world, apart from being a bad ratio of compute to memory, there's also the power problem: it's very likely that high end GPUs will always be in the region of 300W. Realistically, one or two stacks of HBMx sitting atop compute chips pumping out a hundred or more watts is not going to happen.

So the processor in memory concept needs to be seen as a subset of GPU functionality, which I think goes back to my initial idea about PIM: that memory-heavy fixed function hardware would be located in PIM, with the rest of the GPU being somewhere else.

So this patent application is kinda interesting:

Interposer having a Pattern of Sites for Mounting Chiplets

The described embodiments include an interposer with signal routes located therein. The interposer includes a set of sites arranged in a pattern, each site including a set of connection points. Each connection point in each site is coupled to a corresponding one of the signal routes. Integrated circuit chiplets may be mounted on the sites and signal connectors for mounted integrated circuit chiplets may coupled to some or all of the connection points for corresponding sites, thereby coupling the chiplets to corresponding signal routes. The chiplets may then send and receive signals via the connection points and signal routes. In some embodiments, the set of connection points in each of the sites is the same, i.e., has a same physical layout. In other embodiments, the set of connection points for each site is arranged in one of two or more physical layouts.

Effectively it's describing a "one-size fits all" interposer, upon which varying configurations of chiplets can be deployed. The varying configurations would amount to performance grades, i.e. the complete range of SKUs from mainstream to enthusiast. In this model of a GPU, the chiplets are not necessarily all modules that contain HBM. And power hungry chiplets can be manufactured without memory as part of their module. Thus solving both of the ratio problems I described: logic and power.

This is similar to, though ultimately different from the model seen in the Exascale APU paper:

Design and Analysis of an APU for Exascale Computing


where some of the functionality of the package does not feature memory stacked atop logic. On the other hand, the APU is not described as re-configurable to suit performance grades. It's worth noting that this paper couches GPU chiplets, with memory atop, as being in the low 10s of watts of power consumption, which is incompatible with a consumer GPU consisting of 2 or 4 stacks of HBMx memory...

Another interpretation would be to disregard HBMx stacks of memory. As I've noted before, the HBM standard effectively describes a signalling layout for a stack of memory dies atop a base controller die, such that the base controller die can be doing anything, as well as controlling the memory. In theory there's no need to have multiple memory dies in a stack atop a logic die, when doing PIM. Instead each PIM chiplet could feature logic with a single die of memory on top. Now the chiplet count would be far higher, e.g. 16 for a high end GPU, with 1GB of memory in a single die as part of the dual-die PIM chiplet. Again, this would swing the ratios for logic and power back toward something realistic.

This "thin chiplet PIM" model then complies with the desire to scale across performance grades. Also, not all chiplets would have memory atop them. e.g. PCI Express and output driver circuitry would be a common block found in all SKUs, so might be a memory-less chiplet that sits alongside a farm of PIM chiplets.
 
couldn't they put the shaders on the hbm2 dies as the bottom stack and run them slower and make up for it with higher shader counts ?
 
couldn't they put the shaders on the hbm2 dies as the bottom stack and run them slower and make up for it with higher shader counts ?
Memory processors work only well if all the required data + code + execution units are in the same location. Perfect for highly local calculations, but not good for random memory accesses. Shader code can access arbitrary memory locations (not known at scheduling time). We would need a completely different programming model for computing devices like this.
 
One of the things that gives me concern with the concept of Navi being merely a set of compute dies with memory stacked atop them is the density ratios: compute versus memory. Ratios concerning both logic and power.
It might not be in the cards for Navi. The AMD proposals are for a succession of computing projects whose timelines are past 2020.
Navi does seem to be keyed to next-gen memory and scalability, however it doesn't seem like the likely next-gen memory candidates such HBM3, GDDR6, maybe cost-reduced HBM are necessarily more suitable for the PIM model or well-timed for Navi's 2018-2019 time frame.
AMD wants to apply the same GPU silicon to many markets, and in addition to the memory and power concerns brought up there are other trade-offs in terms density, communication between chips, and physical optimization whose tradeoffs would compromise Navi if it tried to adopt a TOP-PIM model while still being the GPU that can slip into the current product spaces.

The active interposer and chiplet scheme, however, is a significant jump in implementation and cost over passive interposers and MCM packaging. MCMs are well-understood, but even AMD's best effort here with the links in EPYC are orders of magnitude short of the needs here. Passive interposers might get closer, but may be marginal even for Navi in terms of cost and complexity--plus AMD needing to significantly do better in terms of interconnect power, bandwidth, and density.
AMD's active interposer assumption is seemingly further out and may be necessary for what it intends to do.

Nvidia's similar concept seems to be better-documented, referencing nearer-term products as possible contemporaries, better than AMD's signalling promises, and reliant on fan-out packaging rather than an interposer. It has at least one daughter die with some of the more miscellaneous IO and other hardware that would otherwise be uselessly replicated across the chips.

Realistically, one or two stacks of HBMx sitting atop compute chips pumping out a hundred or more watts is not going to happen.
AMD's stacked compute papers have assumed something like a maximum of 10W for under-stack logic, and with the memory stack perhaps ~5 W for the rest in order to keep DRAM temperatures at 85C or less.
The latest exascale concept has a 160W node power budget (200W minus internode and system infrastructure power), and with 8 GPU chiplets that is 20W assuming the CPU sections draw 0W.
More realistically, AMD's modeling gives 40-70W for off-interposer system memory access across workload profiling. Even assuming the CPU segments and other chiplets wouldn't be drawing power idling or supporting the GPUs, that means probably barely more than 10W can be sustained per GPU chiplet.

So the processor in memory concept needs to be seen as a subset of GPU functionality, which I think goes back to my initial idea about PIM: that memory-heavy fixed function hardware would be located in PIM, with the rest of the GPU being somewhere else.
Some of the fixed function hardware still has some relatively high-bandwidth connectivity with the programmable portions, and the latency-hiding for the GPUs isn't that strong for intra-block communication and synchronization like it is for texturing latency.
Unfortunately, without something like AMD's active interposer concept and a next-gen fabric that somehow allays concerns that its active interposer concept is still insufficient, its efforts so far are unsuitable.


Effectively it's describing a "one-size fits all" interposer, upon which varying configurations of chiplets can be deployed.
My interpretation is that this covers various forms of interposer, one of which seems to generally describe the active interposer concept from AMD's latest exascale concept. The exascale paper briefly mentions the interposer providing miscellaneous functionality and networking as part of its duties for supporting the chiplet.
What is more universals is the standardization of interface site formats, which chiplets can conform to and different interposer designs can combine. I think one rough interpretation is doing for an interposer what various slots, package ball-outs, and sockets do for motherboards.

It follows with AMD's dis-integration goals with interposers and splitting functions and silicon processes apart and combining them in a 2.5D package. It seems as if most are on the same page, albeit others like Nvidia and Intel are looking at not needing an active interposer and stand a good chance of soon having in practice what AMD has on paper.

What AMD might be trying to do is dis-integration to the point that various areas can be treated as a kind of Application Specific Standard Product. It can sell chiplets with just subsets of the overall GPU and freely include other blocks/outside IP as a custom product. The silicon itself would be more generic due to this, although I still question how generic it can be given what its exascale project is doing to the CU implementation. Among other things, clock rates are currently too high, and Vega is not moving in a promising direction. Unless AMD starts giving more concrete details on how it intends to improve the bandwidth, power, and interface pitch, the drop in perf/mm2 and perf/W could readily eclipse the benefit of splitting up the die.

Perhaps Vega's implementation of the infinity fabric is a first step, and we can look at its non-progress in terms of perf/mm2 and perf/W as what even that step can do even before taking the scaling hits from leaving the die. The CU area in its marketing picture is unambiguously much less dense than it is with Polaris, despite currently not offering much more in terms of what it delivers per-CU.

couldn't they put the shaders on the hbm2 dies as the bottom stack and run them slower and make up for it with higher shader counts ?
One of AMD's concepts had a big GPU die that then connected to TOP-PIM or similar HBM stacks with mini GPUs under it.
Workloads could try to leverage whichever silicon suited them best.
Granted, I think that was 2 or more of AMD's aspirational compute concepts in the past.
 
It might not be in the cards for Navi. The AMD proposals are for a succession of computing projects whose timelines are past 2020.
Navi does seem to be keyed to next-gen memory and scalability, however it doesn't seem like the likely next-gen memory candidates such HBM3, GDDR6, maybe cost-reduced HBM are necessarily more suitable for the PIM model or well-timed for Navi's 2018-2019 time frame.
Also, the leaked slide with GPUs and server segments positions Navi 10 and 11 as the direct successors to Vega 10 and 11 respectively, at least where the slide is concerned. I don't see any sign of the speculated multi-die approach with Navi from this slide although I can't rule it out either. (Unless I'm missing something, this slide is the only mention of specific codenames and positioning for Navi so far.)

(EDIT: Videocardz doesn't allow direct image linking.)

It would be funny if GDDR6 is the "Nexgen memory" from the Capsaicin roadmap. The timing fits if Vega 11 doesn't use GDDR6.
 
Last edited:
I don't see any sign of the speculated multi-die approach with Navi from this slide although I can't rule it out either.
Glad to see that I'm not the only one on this.

But even if that's the way Navi will go, I don't see how, say, 2 400mm2 dies on an interposer could be cheaper than 1 750mm2 die with a few shader cores disabled. Yield is so easy to recover with a bit of redundancy, especially when the size of an individual core is getting smaller and smaller.

And that's assuming that this 2 die solution would have the same performance as the one die solution, which I doubt as well.
 
Status
Not open for further replies.
Back
Top