AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Status
Not open for further replies.

Jawed

Legend
So I've been bothered by the meanings of "Scalability" and "Nexgen Memory" in this slide associated with Navi:

bLXCAt7.jpg


And I've just realised that this could be the HBM design with GPU logic, e.g. ROPs, in the base of each stack of HBM memory, as part of the logic die.

This concept came up a while back under the title "Processing in Memory":

https://www.eecis.udel.edu/~lxu/res...riented Programmable Processing in Memory.pdf

Though it's worth noting that that paper talks about putting compute units in the base logic, so practically an entire shader engine and ROPs in the base logic.
 
How would you cool the base dies if you put significant computing elements in them?
 
The paper talks about a 10W limit on the graphics functionality within the base die. The estimate is that 12 CUs and associated logic on 16nm would meet that constraint, which is between Cape Verde (10) and Bonaire (14). The clocks would be much lower though.

HBM2 stacks are high enough that they will cause clearance issues with the heatsink mounted on the graphics chip. It might be simpler just to configure the cooling for both the memory stacks and the graphics chip.

I'm not convinced that CUs in the base logic is a good approach. Xenos showed a separation which works very well: the latencies for raster operations all work without undue stress given off-chip ROPs.

A key point of the paper is that the logic in the base die has access to substantially higher bandwidth from within the stack (4x). I'm not sure if this is actually possible with HBM. One could argue that a new variant of the HBM stack could be designed such that intra-stack bandwidth would be monstrous but ex-stack bandwidth would be a fraction of that value.

The paper tries to model the effect of the extreme intra-stack bandwidth on a variety of applications and justifies the PIM architecture's efficiency/performance on that basis.

I wonder if it's possible to implement an L2 cache that's within the memory stacks (e.g. 4 stacks each have one-quarter of the L2). The latency across the interface from the stacks to the graphics chip would be higher than if L2 was on the graphics chip, but it would be able to have all the general qualities of L2. This would result in an architecture where all of the traffic between the graphics chip and the memory stacks would be L2 traffic.

I suppose we need some estimates of L1<->L2 traffic in GCN, per CU or per shader engine. These estimates would need to add in ROP traffic, since GCN currently doesn't route ROP data through L2.
 
Last edited:
One could argue that a new variant of the HBM stack could be designed such that intra-stack bandwidth would be monstrous but ex-stack bandwidth would be a fraction of that value.
What would be the benefit of this, though? Little software these days use pure rasterization, and the software that does typically doesn't cause even today's graphics hardware to sweat; instead, textures and framebuffer data tend to get pulled through lots of shader programs which would require data to be transferred off to the main GPU die for processing anyhow...
 
A key point of the paper is that the logic in the base die has access to substantially higher bandwidth from within the stack (4x). I'm not sure if this is actually possible with HBM. One could argue that a new variant of the HBM stack could be designed such that intra-stack bandwidth would be monstrous but ex-stack bandwidth would be a fraction of that value.

The paper tries to model the effect of the extreme intra-stack bandwidth on a variety of applications and justifies the PIM architecture's efficiency/performance on that basis.
The base layer of the HBM is just aggregating the DDR I/O from the upper layers, which is different from the case of Xenos' custom eDRAM. So the benefit could be limited to the power savings from the ROP-DRAM traffic, unless you would also like to cut the external bandwidth to the host SOC.

But I think the most overlooked aspect is the interleaving granularity or locality of the stacks. As far as I understand, the current papers from AMD seems indicating a NUMA-ish PIM architecture with each PIM being an explicit "compute GPU", where the discrete AMD GPUs we have today are interleaving between channels for every 4 cache lines.

I wonder if it's possible to implement an L2 cache that's within the memory stacks (e.g. 4 stacks each have one-quarter of the L2). The latency across the interface from the stacks to the graphics chip would be higher than if L2 was on the graphics chip, but it would be able to have all the general qualities of L2. This would result in an architecture where all of the traffic between the graphics chip and the memory stacks would be L2 traffic.
It could be doable & beneficial to RMW operations (ROP/L2 atomics). But probably you would have to move to a bi-directional proprietary multiplexing interfaces (memory requests + pipeline control + shader export bus) for the stacks, while keeping the L1-L2 crossbar on the SOC. There might also be a downside in latency and power to kernels that fit the L2 cache. So (from an average Joe's POV) it doesn't seem to be a huge benefit.

As a side note, AFAIK the GCN architecture currently has a "virtually addressed" cache hierarchy for its local memory (supposedly VIVT), where the translation happens on L2 misses. So this could be another issue on the list (distributed TLBs with the MMU on the SOC, heh?).
 
What would be the benefit of this, though? Little software these days use pure rasterization, and the software that does typically doesn't cause even today's graphics hardware to sweat; instead, textures and framebuffer data tend to get pulled through lots of shader programs which would require data to be transferred off to the main GPU die for processing anyhow...
Most rendering techniques are still substantially sensitive to render target bandwidth. It's why, for example, developers struggle to get all their data into the smallest possible format in G-buffer creation. It's why MSAA is often unusably slow. etc.

Also, it now turns out that programmable blending is coming to a D3D near us, soon. Thank fuck for that, it's only been about 10 years we've been talking about it. Putting something more substantial than atomics close to VRAM is going to become pretty important soon. So I might have to revise my opinion that CUs probably won't appear inside the base logic.

Which then raises the question: is a host die at the centre of a set of "nexgen" memories with graphics logic in their base dies still going to have shader engines? The estimates for power/area in the paper at the 16nm node seem pretty decent. A set of 4 of these nexgen memories, without a host die, will be too slow for 2018. Would 8 of these nexgen memories be fast enough?

There would still need to be a display controller, PCI Express switch etc.
 
The base layer of the HBM is just aggregating the DDR I/O from the upper layers, which is different from the case of Xenos' custom eDRAM. So the benefit could be limited to the power savings from the ROP-DRAM traffic, unless you would also like to cut the external bandwidth to the host SOC.
I don't know how peaky the host<->nexgen bandwidth consumption would be and whether it would be possible to get away with a reduction in bandwidth for that link.

But I think the most overlooked aspect is the interleaving granularity or locality of the stacks. As far as I understand, the current papers from AMD seems indicating a NUMA-ish PIM architecture with each PIM being an explicit "compute GPU", where the discrete AMD GPUs we have today are interleaving between channels for every 4 cache lines.
That's why I described the architecture in terms of distributed L2. A crossbar within the host is implicit.

It could be doable & beneficial to RMW operations (ROP/L2 atomics). But probably you would have to move to a bi-directional proprietary multiplexing interfaces (memory requests + pipeline control + shader export bus) for the stacks, while keeping the L1-L2 crossbar on the SOC.
As far as I can tell the HBM family of standards is for the stack of memory dies. What the base die does, apart from connecting to that stack, is up to whoever makes the base die. So the connection host<->nexgen would be entirely proprietary.

I dare say this is the genius of the HBM standard.

There might also be a downside in latency and power to kernels that fit the L2 cache. So (from an average Joe's POV) it doesn't seem to be a huge benefit.
I don't think latency is the top-most priority for GPU L2. L1<->L2 traffic obviously costs more power if off-die, and we have no way to assess this traded against intra-stack power.

But I'm afraid to say, now that programmable blending is looking likely, soon, I have to admit that some kind of generic compute in the nexgen memories is hard to argue against.

As a side note, AFAIK the GCN architecture currently has a "virtually addressed" cache hierarchy for its local memory (supposedly VIVT), where the translation happens on L2 misses. So this could be another issue on the list (distributed TLBs with the MMU on the SOC, heh?).
GCN L2 is already distributed on the far side of a crossbar, so I'm not sure what's different...
 
As far as I can tell the HBM family of standards is for the stack of memory dies. What the base die does, apart from connecting to that stack, is up to whoever makes the base die. So the connection host<->nexgen would be entirely proprietary.

I dare say this is the genius of the HBM standard.
I don't think this is true.

The standard specifies the interface to the stack, but doesn't say whether or not to use a separate non-memory base die. It can be made to work with or without base die.
 
Sorry, "would" in that sentence is misleading. "Would" in the context of a design that uses custom base dies. Or "could" in the context of HBM, generally.

Obviously custom base dies change the cost equation...
 
Just a random thought, but I got to ask:

How plausible is it, that Navi is going to drop most internal cross bars as well as the central memory controller in favor of AMD's Coherent Fabric?

Next gen memory might actually connect directly to the fabric as well, think of it as localized memory controllers.
 
Just a random thought, but I got to ask:

How plausible is it, that Navi is going to drop most internal cross bars as well as the central memory controller in favor of AMD's Coherent Fabric?

Next gen memory might actually connect directly to the fabric as well, think of it as localized memory controllers.
There is a paper on the use of the interposer to provide the routing path for a multicore chip's interconnect.
http://www.eecg.toronto.edu/~enright/micro14-interposer.pdf

Current bump pitch is somewhat worse than they assume, and the internal data paths for GPUs are far wider than the network put forward. A naive implementation already consumed too much area, so the concentrated network with fewer drops to the interposer was needed in order to leave sufficient area not dedicated to the interconnect to make things worthwhile.
The current GPU networks have cache slices that move data in 64 byte chunks, with many clients sporting data stops of that width.

One of Intel's objections to the current way of implementing 2.5D is that the pitch is an order of magnitude (possibly more) more coarse than it needs to be. Those selling interposers promise they could get better someday.

There are other methods than the copper pillar and microbump method, although that is the one that appears to have reached some level of commercial viability for 2.5D. I've seen other tech like tungsten pillars, or close-proximity signalling that might help with the pitch problem. The latter would need an active interposer of some sort.
 
I'm not sure it's actually going to be a 2.5D design, besides, there is already GMI for chip interconnects, bridging in between on-die Coherent Fabric networks if I haven't misunderstood that.

I could e.g. imagine that a each shader engine is directly connected to the Coherent Fabric (only), replacing the cross bars currently spanning multiple engines entirely.

As long as the unified fabric features proper routing and efficient switching, I wouldn't expect all too many issues/overhead with such an approach, the gains from keeping most of the traffic "local" can be kept.

But let's step this up with an example: What if the geometry processors would be able to stream geometry via to interconnect to rasterizers not even physically placed on the same die? An arbitrary number of units addressable directly if needed? No more DMA/XDMA, just a single virtual fabric with a global, unified address space?

And if you want to extend your virtual GPU, you just plug in additional resources which become addressable by the existing command processors? Yes, I don't expect it to be EFFICIENT if you start streaming geometry like this, that's just an extreme example. But at a higher level, plugging a virtual GPU together should work quite well. The comparisons with NVLink might not be too far off, especially regarding the capabilities to schedule transparently.
 
I'm not sure it's actually going to be a 2.5D design, besides, there is already GMI for chip interconnects, bridging in between on-die Coherent Fabric networks if I haven't misunderstood that.
I misinterpreted the statement of abandoning internal crossbars as a reversion to something physically external to the die.

As long as the unified fabric features proper routing and efficient switching, I wouldn't expect all too many issues/overhead with such an approach, the gains from keeping most of the traffic "local" can be kept.
The logical level of the interconnect shouldn't necessarily decide the physical topology. AMD's coherent processors have crossbars internally, for example.

The current method for GCN is to rely on the last-level cache for the CUs or other hardware units, and the ROP cache hierarchy for export, so the crossbars in question are between memory clients and the first point of global coherence.
Putting the fabric where those crossbars has to add overhead, since the current situation is either non-existent or traffic that is physically incapable of being incoherent.
A much more limited set of values is needed to address the appropriate target, and the coherent fabric would be putting a higher-level set of considerations (flit routing, global addressing, packetization, coherence broadcasts) between what should be a simple L1-L2 inclusive cache hierarchy, or specific producer-consumer data flows where the endpoints are fixed and coherence implies outside interference that is not wanted.

But let's step this up with an example: What if the geometry processors would be able to stream geometry via to interconnect to rasterizers not even physically placed on the same die? An arbitrary number of units addressable directly if needed? No more DMA/XDMA, just a single virtual fabric with a global, unified address space?
That could be after a large data amplification step, and things on-die are very cheap relative to anything that goes off of it.
The traffic could be "compressed", but the compressor/decompressor would be giving the command stream to a local geometry processor and rasterizer.

And if you want to extend your virtual GPU, you just plug in additional resources which become addressable by the existing command processors? Yes, I don't expect it to be EFFICIENT if you start streaming geometry like this, that's just an extreme example. But at a higher level, plugging a virtual GPU together should work quite well. The comparisons with NVLink might not be too far off, especially regarding the capabilities to schedule transparently.
There are already queues and memory paths for the front ends to fetch work from, and various methods of stream out. It's generally helpful to take things as far as you can with the internal networks, and then take the hit of going off-die or worrying about arbitrary access and routing.
 
Just a random thought, but I got to ask:

How plausible is it, that Navi is going to drop most internal cross bars as well as the central memory controller in favor of AMD's Coherent Fabric?

Next gen memory might actually connect directly to the fabric as well, think of it as localized memory controllers.
The coherent fabric is supposed to be at the system level to chain CPU cores, accelerators and I/O blocks up for the system-level memory coherence. GPUs certainly has to adapt to it, but probably just for things at the system level, i.e. PCIe functionalities, MMU, coherence memory accesses, platform atomics, etc. It don't see why the GPU needs to drop its internal buses which should in the first place amplify the bandwidth to its own local memory. Even in APUs, you have "Garlic" a dedicated path from the GPU IP to the DRAM controller for the GPU's graphics memory aperture (unless they dropped it in Carrizo, hmm?).

Probably unless you expect the future of GPU local memory is participating in the coherence domain as malloc-able, pageable system memory, which doesn't seem happening soon. OpenCL2 and HSA still support non-coherence device memory for devices to provide high bandwidth buffers for specific compute needs.
 
Last edited:
unless they dropped it in Carrizo, hmm?
I think they did just that. It's already attached to the Coherent Fabric via the GMI, rather than having a dedicated bypass. Not sure what guarantees regarding coherence the local L2 cache can give yet.

I do expect AMD to split the GPU vertically in some way. I'm just not sure yet how, and which components would have only reduced functionality as a result.
What I'm mostly worried about would be blend ops and depth stencil tests. In order to bypass that, you would need to make the pipeline tiled from the rasterizer on downwards, to keep the locality where you need it most. The discard accelerator might actually function as a router in this case, and even duplicate triangles if required, sorting the final vertex buffers by tile. Redistribution on the higher pipeline stages might not even occur in practice as long as there is sufficient backpressure.

I don't see any issue at all with compute kernels, expect that global semaphores spanning more than one cluster could become even more expensive. I would even expect that by default, a kernel attempts to hold affinity to one cluster as long as it can be scheduled all at once.

And yes, I do expect the whole GPU memory to become pageable soon enough. I wouldn't even be surprised to see an APU where both HBM2 and DDR4 were addressable by both CPU and GPU in a transparent fashion. The GPUs GDS is also a candidate which I would expect to become attached straight to the fabric, as it needs to accessed possibly by all local GPU fragments.
 
The rasterizers are tiled in screen space, and so are the ROPs that are part of the shader engine the rasterizers are at the top of.
The path from the rasterizer through the shader engine's subset of CUs, to the export bus to the tiled ROP partition contains some of the dedicated internal crossbars/data paths that I touched on earlier.

I'm curious for more details on the discard accelerator, if it should be a genuinely new architectural feature (not all new things are new with marketing). Further speculation would be more appropriate for the Polaris thread, although giving it routing duties would make it centralized work distribution unit, which AMD would probably label as such.

As a Primitive Discard Accelerator, it might be going in the geometry processor section, which is local to a shader engine.
One possibility is some kind of runahead on a coarse representation or stripped-down shader, which would allow geometry being evaluated by multiple shader engines to be culled in parallel, rather than having a centralized router. It would help avoid worries if those shader engines were not sharing the same chip, if scalability is a concern.
(edit: Polaris also has a NEW label on the command processor, which might be another spot for it. That's even more likely to be runahead in that case, given how far ahead that portion of the chip is in the pipeline. The resources are somewhat limited in that portion, however.)

On the topic of this thread, next-gen memory could be new HBM with some changes for denser/higher stacks and possibly finer bump pitches than are currently available.
Besides the PIM concept already mentioned, it's also in a time frame where some of AMD's HPC projects start to discuss non-volatile memory. Possibly, a hybridized memory system or stack could be in play.
A non-volatile memory section can help with power consumption, since it can avoid leakage and refresh power consumption. The read path is usually lower power, but writes seem to be more challenging in terms of power/latency.
Some interesting games could be played with a DRAM pool backed by NVM, particularly if items like memory compression can keep actively updated memory in DRAM with mostly read-only or less frequently used accesses handled by NVM.
 
Last edited:
I think they did just that. It's already attached to the Coherent Fabric via the GMI, rather than having a dedicated bypass. Not sure what guarantees regarding coherence the local L2 cache can give yet.
That's half BS, obviously. At least the part on GMI being the port to the fabric, I was pointed out that GMI is for off-die interconnects. But AMD did in fact kill "Garlic" in favor of a fatter "Onion3" in Carrizzo. And what is published on "Onion3" is pretty much in line what is promised for the "Coherent Fabric".

As a Primitive Discard Accelerator, it might be going in the geometry processor section, which is local to a shader engine.
One possibility is some kind of runahead on a coarse representation or stripped-down shader, which would allow geometry being evaluated by multiple shader engines to be culled in parallel, rather than having a centralized router. It would help avoid worries if those shader engines were not sharing the same chip, if scalability is a concern.
Yes, would expect just that as well. Ultimately this means that the rasterizer of each shader engine receives more than one vertex list (possibly one from every other shader engine), but the lists should be almost minimal which does favor bandwidth costs. "Routing" didn't imply that there was only a central unit, just that it is capable of routing each vertex where it belongs to.
 
With recent rumors pointing to a high-performance "Vega 20" using 7nm, there's been some discussion about whether Navi has been delayed to 2019.

AMD's roadmap only mentions "Next-gen Memory" and "Scalability". This next-gen cannot mean HBM2 because that one will be in Vega already.

Regarding the Scalability part, maybe this means a fixed ratio of CUs<->TMUs<->ROPs arrangement, plus the ability to "decouple" video codecs and output modules (and decoupling ROPs from memory channels e.g. bringing back a ring bus). This would make it easier for AMD to scale the Navi architecture up and down in order to easily create more variations that would be put into APUs/SoCs, from AMD or even others (like Samsung).
I'm thinking something along the lines of PowerVR and Mali where SoC makers are free to scale the GPU into MPx quantities, plus mix and match video codecs from other IP providers.
AMD has been very keen on bragging about their capabilities for semi-custom chips. A scalable architecture in the form of "MPx" could make semi-custom chips a bit less custom, saving them time and money for implementation.


As for "next-gen" memory, it seems Hynix's plans for HBM3 include a low-cost HBM alternative, along the lines of Samsung's.
With a flagship Vega 20 in 2018 using HBM2 and a 12TFLOPs Vega in 2017, there's still room for some Navi GPUs to be released in 2018. Namely, they might want to EOL Polaris by then and go back to address the lower-cost/performance market.
AMD's sub-$200 discrete GPU from 2018 would then be something like a ~6 TFLOPs Navi MP32 with 8GB of low-cost HBM (single stack?).
 
"Tahiti" chiplets:

http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf

Memory stacked upon GPU blocks (as processors, not merely ROPs) is looking like a focus of AMD's research.

Thoses research are going since a long time on AMD .. i think we have allready post some research and patents here about it .. ( will be hard to find right now as posts are dating before Fury and HBM1 ).

Something like there TOP-PIM research maybe ?

https://www.eecis.udel.edu/~lxu/resources/TOP-PIM: Throughput-Oriented Programmable Processing in Memory.pdf

http://www.cs.utah.edu/wondp/eckert.pdf
 
Last edited:
Status
Not open for further replies.
Back
Top