AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.
but the difference in R&D budget is so huge...
More R&D money allows to launch more dies at once, or optimize designs better.
Money can't save dead architectures; Otellini's Intel had all the money in the world yet it didn't save Larrabee. At all.
 
Yays:

1 - Only relevant info we have about Navi is "scalability" and "next-gen memory". Next-gen memory can only be HBM3, HBM Low-Cost or GDDR6.
2 - AMD has been making a fuss about Infinity Fabric and its scalability
3 - AMD has been making a fuss about ending the pursuit of monolithic chips in favor of several smaller chips in a single substrate because it's a whole lot cheaper to make overall
4 - Vega already has Infinity Fabric in it with no good given reason, so they could be testing the waters for implementing a high-speed inter-GPU bus.
5 - AMD doesn't have the R&D manpower and execution capatility to release 4 distinct graphics chips every 18 months, so this could be their only chance at competing with nvidia on several fronts.



Nays:

1 - Infinity Fabric in Threadripper's/EPYC's current form doesn't provide enough bandwidth for a multi-chip GPU.
2 - No official news or leaks about Navi have ever appeared that suggest it's a multi-chip solution.
3 - Multi-chip GPU is probably really hard to make, and some like to think AMD doesn't do hard things. Ever.
4 - nvidia released a paper describing a multi-GPU interconnect that would be faster and consume less power-per-transferred-bit than Infinity Fabric, and some people think this is grounds for nvidia being the first in the market with a multi-chip GPU. Meaning erm.. Navi can't be first.





Money can't save dead architectures; Otellini's Intel had all the money in the world yet it didn't save Larrabee. At all.
Main difference being that Intel can afford to take huge risks that become failures like Larrabee and other crazy projects like SOFIA's Atom SoCs with (old and low-end) ARM GPUs.. and still make tens of billions every year surpassing their own records YoY.

But it's not like everything is lost for Knights Landing. There's a benchmark in the monero benchmark database claiming the Xeon Phi 7210 does 2770 H/s on 215W. And its price is awfully close to a Vega 56 nowadays
;)

We're talking >10KH/s on a 700W rig.
 
Last edited by a moderator:
3 - AMD has been making a fuss about ending the pursuit of monolithic chips in favor of several smaller chips in a single substrate because it's a whole lot cheaper to make overall
I sort of remember this, where is this from exactly?

I think if AMD goes multi-chiplet with Navi it will only be for a Titan/x80ti competitor. Hopefully everything below will be single chip.
 
I sort of remember this, where is this from exactly?
Hotchips 2017:

JHX2IHw.jpg


I think if AMD goes multi-chiplet with Navi it will only be for a Titan/x80ti competitor. Hopefully everything below will be single chip.
I'd say 200-250mm^2, which is the range of Ryzen's Summit Ridge (213mm^2) and Polaris 10 (233mm^2).

I think GPUs smaller than 200mm^2 are bound to go inside APUs eventually, at least on AMD's side.
 
Are we seeing history repeating itself? First HBCC as a distant cousin of Turbo Cache / Hyper Memory.

Now maybe a cousin of the Voodoo days of multiplying dies, this time on a substrate instead of on the card itself.

Eerie. Even more for AMD since both techs died an ugly dead.
 
Ehm, isn't Fiji a double Tonga?
And Vega is going down, it's already inside APUs and smaller dies are incoming.
Tonga had different MC, different VP. Not a scale-down of Fiji (or the other way around). What's inside the APUs uses (right now) not HBM2.
 
That's an interesting patent, I wonder if that's for Navi:

System and method for using virtual vector register files:

Described is a system and method for using virtual vector register files that may address all of the bottlenecks presented by current register file architecture by yielding lower die area, lower power and faster SIMT units while balancing low latency and register usage. The virtual vector register file architecture can include a two level, non-homogenous hardware vector register file structure that can yield considerable power benefits by avoiding the access of large structures in favor of small structures whenever possible. Management of vector register allocation between the two levels is provided in order to minimize the number of accesses to a larger vector register file. In particular, the virtual vector register file architecture provides more efficient management of vector register file storage by avoid having a large percentage of vector registers that are unusable at any given time and reducing vector register file size. For example, for the “super pixel shaders,” the virtual vector register file neatly avoids spending costly physical vector register storage on unused (or used once twice—then dead) vector registers.

vC4LoiK.jpg
 
While I agree that binned rasterisation is a task that would be perfect for the base die of a PIM module ...

Well I was thinking about this some more and have some thoughts. I think the key to AMD's strategy with a multi-chip Navi center around DSBR and its deferred work feature. If we consider opaque triangles with no writing to UAV's in the pixel shader (and no memory accesses other than initial geometry in the geometry stages) you can guess at a interchip bandwidth/performance opimization strategy. If the chips are setup similarly to a NUMA setup with each chip having one or more memory channels and assume they are logically stripped on a per chiplet granularity (in case I'm not being clear I mean all memory channels hooked up to chiplet 1 form the first stripe, chiplet 2 second stripe... and so on), my guess is that they localize all work up to and including rasterization to that chiplet.
... vertex data is spread across all memory channels. There's no way to avoid having communication amongst PIMs in this case. And, to be frank, vertex data (pre-tessellation) is not a huge bandwidth monster.
 
Tonga had different MC, different VP. Not a scale-down of Fiji (or the other way around). What's inside the APUs uses (right now) not HBM2.
You need to decide whether you're talking about just the gfx-portion or the whole chip (which leads to tons of different generations spawning from nowhere due miniscule differences)
If we go by "GCN-generations", which limits itself to the actual GPU-part instead of whole chip, Tonga and Fiji are the same, all Vegas are the same etc - AMD has the capability to use whichever memory controller they see best fit for each chip*, they could do Polaris+HBM and it would still be Polaris, they could do Vega+GDDR and it would still be Vega, just like Vega+shared controller on the APU is still Vega.

*and apparently infinity fabric made this just a lot easier, even though they had the capability before too
 
What kind of additional latency do you guys expect from inter-chip(let) connections anyway? It's not like with 3D Rendering in Cinema 4D or Blender, where you have a nice sorting up front and then much much rendering happening in tiny tiles.
Huh? That's exactly what monolithic GPUs are, right now. Each stage of the graphics pipeline consists of compute followed by some kind of sorting/filtering for the next stage to work on. So in simple terms vertices are shaded, they're assembled (sorted) into triangles and some some are culled, they're rasterised and filtered for visibility and sorted into quads of fragments, they're shaded and then they're sorted to preserve triangle-ordering (if required) and filtered (for visibility) for blending with the render target.

The extra latency amongst chiplets would require that between-stage buffers (queues) are larger so that they can handle variations in throughput.

So if pipeline stage B can consume 0.25 units of work from Stage A per clock and Stage A does 0.5, 0.25 or 0.125 units of work per clock, you might give Stage A a 2 unit buffer. This buffer would then handle the situation where A spends 8 clocks producing one unit of work, provided that there were 2 units of work already in the buffer. Obviously if A takes longer than 8 clocks, then B will end up idling. The design will be balanced for "typical" usage, not the extremes. Sometimes A will be forced to pause because B says I can't take it, you're going too fast for me, my buffer's full!

So, now put A and B into two separate chiplets with a 2 clock delay for work from A to B. So that's the equivalent of A producing one unit of work when it's fastest. So you could add one unit of work to B's buffer: it now buffers 3 units of work from A. Alternatively you might say the average throughput of A is 0.25, so B needs to buffer 2 more units of work. So as long as there is variation in throughput, sometimes faster than average and sometimes slower, then A will on average have the same throughput in the MCM or monolithic designs.

You would simulate the variations in throughput of A to determine the size of the buffer in the monolithic chip. And for an MCM you would add latency into the simulation. To complicate the simulation there might be variable latency amongst the chiplets (one hop or two?) and it's likely that B isn't constant throughput. But the latter affects the monolithic design too.

The end result is that increased buffering is one of the costs of MCM versus monolithic.
 
Did anyone think about Navi being a family of chips, just scaling from low end to high end? In contrast to Vega and Fiji, which did not scale down.

Yes, AMD has been running with two diverging lines families filling out the product stack for several years now, primarily as a function to their HBM commitment:
Hawaii (and family) + Fiji
Polaris + Vega

Of course, the rest of the Vega stack could end up being true too-to-bottom family.

As for Navi... The “scalability” (provided that that is still the key design imperative, things had an ample opportunity to change) could really mean anything. It was first discussed in the context of multiple chips and DX12 explicit multi adapter; it was not until EPYC unveiling that the thinking has shifted to MCM design; in reality it could be referring to highly modular, diverse IC libraries allowing for easy composition of designs to address various markets, with CU, special purpose units like tensor, etc being almost plug-and-play.
 
You need to decide whether you're talking about just the gfx-portion or the whole chip (which leads to tons of different generations spawning from nowhere due miniscule differences)
If we go by "GCN-generations", which limits itself to the actual GPU-part instead of whole chip, Tonga and Fiji are the same, all Vegas are the same etc - AMD has the capability to use whichever memory controller they see best fit for each chip*, they could do Polaris+HBM and it would still be Polaris, they could do Vega+GDDR and it would still be Vega, just like Vega+shared controller on the APU is still Vega.

*and apparently infinity fabric made this just a lot easier, even though they had the capability before too

I feel like HMB/interposer use is much more of an inflection point and the “choosing whichever controller they see fit” really undersells the amount of effort requiring current generations of AMD chips to use memory type over the other. I guess we will see how the Vega family plays out.
 
Status
Not open for further replies.
Back
Top