AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Gubbi · Aug 8, 2017

sebbbi said:
I was not talking about GPU accessing unified DDR4 system memory. I was talking about unified graphics memory (GDDR5 or HBM2) between two GPUs. No paging obviously. Direct cache line granularity access by both GPUs to the same memory.

But then you are back to a single structure used by all GPUs.

The motivation for moving to multi-die GPUs (multi-GPU GPU sound wrong) is two fold:

1. One optimized design for all markets: Low end, mid range and high end.
2. To circumvent the reticle limit for silicon ICs.

Regarding 1.) For low end you'd have a single die with attached memory. For mid range, two dies, - each with memory attached. For high end you'd have 4 dies plus memory. Your GPU is now a NUMA multi-processor system. This means you need a high bandwidth interconnect to glue the dies together, something that is very possible with silicon interposers.

Regarding 2.) Nvidia just built a >800mm^2 behemoth. They are at the reticle limit and can't go up. The market for these behemoths in terms of units is small; in terms of dollars, it's big. That implies high risk (if they had any real competition at the high end). If you could build the same system out of four 200mm^2 dies you reduce the risk by a lot. You also gets an economies of scale cost reduction.

Also, the end of silicon scaling is coming to an end. However, Moore's law (ever lower $/transistor) won't come to a crashing stop. As the time between each process node advance increases, the time to amortize the capital expenditures for equipping a fab also increases, lowering the cost per mm^2 Si. That means there will be a lot of silicon in GPUs in the future.

Cheers

Rootax · Aug 8, 2017

I see Navi like the Zen is for the amd/cpu market. Hope they can pull it off, in time (this time....).

Gubbi · Aug 8, 2017

As for topology, I don't actually think it will be straight NUMA. If you have four dies in your GPU, the odds of stuff being in your locally attached memory when missing caches is just 25%. That means you'll need a lot of bisectional bandwidth.

Alternatively you could maximize local access by making the locally RAM a victim cache (AMD's HBCC says hi!). This way you exploit spatial and intra-frame temporal locality automatically. To maximize spatial locality, you'd need to bin graphics work into tiles (say 256x256 pixels). To exploit inter-frame temporal locality, you'd want the same tile to be rendered by the same GPU-die every frame. For computational loads you will probably need to be aware of topology (a queue per GPU-die or something similar).

Cheers

sebbbi · Aug 8, 2017

Gubbi said:
But then you are back to a single structure used by all GPUs.

I was talking about two GPU dies, but one shared memory (GDDR5/HBM2). Just like multi-socket CPUs. Current multi-GPU designs have separate memory for each GPU. This doesn't scale.

yuri · Aug 8, 2017

Are there actually any patents backing this multi-GPU NUMA approach? I'm aware of AMD's research papers (eg. PIM) which target relatively distant future (2020+).

I guess there should exist some patent activity already since the supposedly multi-GPUed Navi is a direct successor of the current architecture.

Anarchist4000 · Aug 8, 2017

sebbbi said:
I was talking about two GPU dies, but one shared memory (GDDR5/HBM2). Just like multi-socket CPUs. Current multi-GPU designs have separate memory for each GPU. This doesn't scale.

That would be possible, but as pointed out above, probably not the ideal solution. The ideal use for flat addressing would be a perfectly random data distribution where you couldn't achieve any locality. Not unless memory capacity/external bandwidth was significantly constrained or you could ensure locality(ROPs). It would likely be more efficient to duplicate pages, wasting some excess memory, and avoid the interconnect. The solution would likely be a hybrid approach. Part victim cache and part flat addressed for local splits.

As Gubbi suggested, creating tiles and distributing spatially to GPUs would likely be the best approach. DSBR and HBCC should do most of that automatically. Distributing compute, especially tasks not in screen space, would be problematic.

yuri said:
Are there actually any patents backing this multi-GPU NUMA approach? I'm aware of AMD's research papers (eg. PIM) which target relatively distant future (2020+).

Not that have been found, just some exascale papers that don't really apply to graphics. Those patents however would be Navi and wouldn't be published until after release.

Anarchist4000 · Aug 9, 2017

sebbbi said:
I was not talking about GPU accessing unified DDR4 system memory. I was talking about unified graphics memory (GDDR5 or HBM2) between two GPUs. No paging obviously. Direct cache line granularity access by both GPUs to the same memory.

Thinking about this some more, what you proposed could work, but would require explicit handling by the programmer. Not all that different from current multi-adapter with DX12/Vulkan. Presenting each GPU as an independent device with shared, independent pools and more bandwidth between them. Care would have to be taken to avoid the interconnect and split the work evenly. That would work better, being explicit, for the intermediate stages the driver would not easily be able to understand.

A hybrid approach to memory management would still be best though. Create intermediate resources per device and let all free space page from memory. Even for an expert programmer HBCC's paging may be hard to beat. The driver would still need to split the work efficiently for apps that didn't handle multiple adapters. Even for explicit handling, if a developer codes two devices, and next year a part with 4+ arrives there will be scaling issues. Coding for a dynamic number of devices would be required, but seems doable. Still a problem for intermediate steps that can't be partitioned easily.

sebbbi · Aug 9, 2017

Anarchist4000 said:
Thinking about this some more, what you proposed could work, but would require explicit handling by the programmer. Not all that different from current multi-adapter with DX12/Vulkan. Presenting each GPU as an independent device with shared, independent pools and more bandwidth between them. Care would have to be taken to avoid the interconnect and split the work evenly. That would work better, being explicit, for the intermediate stages the driver would not easily be able to understand.

GCN (before Vega) already has pretty limited cache coherency between the CUs and zero coherency between the rasterizer and the CUs. The ROP output always goes to memory before you can read it. You already need to explicitly flush the ROP caches in DX12/Vulkan (by inserting barrier that transforms RT -> SRV). Same goes to standard compute shader UAV writes. No guaranteed visibility of writes to other CUs, until an explicit UAV->SRV barrier is executed (it flushes the L1 caches). Atomics and UAVs with globallycoherent attribute are the only exceptions where other CUs see the writes of other CUs directly (without needing an explicit barrier to flush caches).

ROPs also must see writes by other CUs, because of blending and depth test. But shader cores themselves don't need coherency. But tiled rasterizer splits work to area local independent tiles, meaning that ROPs don't have to check for coherency at fine granularity. You simply would schedule them at tile granularity. Scheduling would ensure that hazards do not happen. That tile location would be locked until the GPU processing that tile location has written the tile to memory. You would likely need a bit more storage for tile geometry to ensure that you have enough independent tiles in flight at once.

Obviously if the memory controllers are located in the GPU dies, this would be a NUMA system, and would need a fast link between the GPUs. But it wouldn't need to be any wider than the link from GPU to memory.

Anarchist4000 · Aug 9, 2017

sebbbi said:
The ROP output always goes to memory before you can read it. You already need to explicitly flush the ROP caches in DX12/Vulkan (by inserting barrier that transforms RT -> SRV). Same goes to standard compute shader UAV writes. No guaranteed visibility of writes to other CUs, until an explicit UAV->SRV barrier is executed (it flushes the L1 caches).

The question is what memory it is flushing too. Coherency aside, with multiple chips and unified address space you can't readily identify where the memory address resides. ROPs could be reading/writing across the interconnect. Seriously breaking the NUMA model that would be desirable. Or worse get pointed towards system memory or even virtual/disk drive. The problem is simple to fix, but to the best of my knowledge there are no APIs that natively account for it. Just explicit multi-adapter which isn't necessarily that scalable if you ended up with say 16 GPUs in a few years time.

sebbbi said:
ROPs also must see writes by other CUs, because of blending and depth test. But shader cores themselves don't need coherency. But tiled rasterizer splits work to area local independent tiles, meaning that ROPs don't have to check for coherency at fine granularity. You simply would schedule them at tile granularity. Scheduling would ensure that hazards do not happen. That tile location would be locked until the GPU processing that tile location has written the tile to memory. You would likely need a bit more storage for tile geometry to ensure that you have enough independent tiles in flight at once.

That's what I'm suggesting, but you would have an active number of bins equal to the number of chips in the system with further binning based on cache size. Sub-tiles where GPU A takes the left half of screen space and tiles through. GPU B doing the opposite. Ensuring each stays on it's half of the fence so resources are primarily cached locally without crossing the interconnect. Another tier of tiled raster which shouldn't be that difficult.

That still leaves the issue of splitting tasks, more common with compute, that aren't readily divisible.

3dilettante · Aug 9, 2017

Anarchist4000 said:
The question is what memory it is flushing too. Coherency aside, with multiple chips and unified address space you can't readily identify where the memory address resides. ROPs could be reading/writing across the interconnect.

ROP traffic crossing the interconnect seems undesirable. I'm not sure why it would be necessary.
The GPU driver and hardware are able to enumerate the number and stride of fixed-function resources. The initialization of the device, resource parameters, and the fixed alignment of the RBEs means the GPU and driver should be able to allocate memory consistent with the physical addresses managed by the local memory controllers.

That's what I'm suggesting, but you would have an active number of bins equal to the number of chips in the system with further binning based on cache size.

That doesn't seem like it would be hard. All descriptions of the tile area covered by one bin indicates there is at least 1-2 orders of magnitude more bins than the number of multi-GPU chips contemplated.

Anarchist4000 · Aug 10, 2017

3dilettante said:
ROP traffic crossing the interconnect seems undesirable. I'm not sure why it would be necessary.
The GPU driver and hardware are able to enumerate the number and stride of fixed-function resources. The initialization of the device, resource parameters, and the fixed alignment of the RBEs means the GPU and driver should be able to allocate memory consistent with the physical addresses managed by the local memory controllers.

I'm saying it would be possible and something to avoid. Looking at the Vega ISA there is a V_Screen_Partition_4SE_B32 instruction for a primitive shader. Dynamic memory management and GPU driven rendering it may be up to the programmer to allocate in the future. So there would need to be a per node allocation ability at the very least beyond the driver. Not difficult, but something new. I'm sure there are some exceptions, but my understanding was currently apps create flip chains and they remain reasonably static. With tiled raster and variable tile sizes during runtime the alignment could be thrown off with more or less memory required per device. Not unreasonable to assume fully replacing graphics with compute in some cases where the alignment mechanism wouldn't exist.

3dilettante said:
That doesn't seem like it would be hard. All descriptions of the tile area covered by one bin indicates there is at least 1-2 orders of magnitude more bins than the number of multi-GPU chips contemplated.

Not hard, just a level high than what the current tiled raster would likely consider. Also a concern depending on how the device presents itself, single or multiple adapters, as it may fall to the programmer. Moreso if a postprocessing compute pass ideally follows the distribution for NUMA.

mrcorbo · Aug 10, 2017

Is using a multi-chip design going to require moving some elements that are currently integrated into the GPU onto their own separate chip to prevent redundancy? Video processing, audio, off-device I/O come to mind.

3dilettante · Aug 10, 2017

mrcorbo said:
Is using a multi-chip design going to require moving some elements that are currently integrated into the GPU onto their own separate chip to prevent redundancy? Video processing, audio, off-device I/O come to mind.

That's a likely part of it. Nvidia's MCM paper proposes moving IO onto its own chip.

AMD's general dis-integration strategy with interposers goes further to start subdividing out blocks based on IP, process, customization, and performance needs. Blocks could stay at processes that provide better properties for what they need, such as a CPU chiplet whose silicon can be tuned to almost purely performance-optimized digital logic, a GPU tuned for density, I/O tuned for analog properties and leakage, networking/optical interconnect blocks, DRAM, NVRAM, etc.

Being able to shift versions of each block without re-implementing a whole SOC would be another advantage. A basic video processing block could be taken out of the GPU, although methods that involve using CUs for part of the processing may need some compensating measures.

jacozz · Aug 11, 2017

I hope this "dis-integration" approach will takeoff. I Imagine a cpu module + a couple of gpu modules with 32 GB of stacked memory shared on a single interposer with a fat Thread Ripper cooler on top. No Ram needed, no discrete gpu needed.

Speaking of which. What happened to HSA?

sebbbi · Aug 11, 2017

3dilettante said:
ROP traffic crossing the interconnect seems undesirable. I'm not sure why it would be necessary.
The GPU driver and hardware are able to enumerate the number and stride of fixed-function resources. The initialization of the device, resource parameters, and the fixed alignment of the RBEs means the GPU and driver should be able to allocate memory consistent with the physical addresses managed by the local memory controllers.

Couldn't you simply allocate physical storage of each rasterizer tile from the local memory of the GPU that was assigned to render that tile (assuming tile size = multiple of page size). Virtual addresses of the render target would still be contiguous. This works perfectly as long as each tile is rendered once. But Nvidia and AMD tiled rasterizers bin smallish amount of triangles at once. So you'd have to sometimes load previous tile contents from the other GPUs local memory (over the interconnect). The scheduler could of course prefer to send that tile again to the same GPU that processed it originally (to reduce interconnect traffic). This should work fine, as long as there's unified virtual address space between the two GPUs local memories, and fast enough interconnect between the GPU dies (to allow both GPUs to read transparently data from the other GPUs memory).

If both GPUs for example had their own 300 GB/s interconnect to memory, a 300 GB/s bidirectional link between the two GPUs would be enough to access the other GPUs memory at max bandwidth. This would be similar to a single 600 GB/s system, albeit with higher latency to access pages resident on other GPUs memory. Interconnect would also be used for coherency traffic, but as I said in my previous posts, current rendering APIs need coherency only for a small minority of operations (and small minority of resources). Most of the memory coherency can be handled with cache flushes (barriers in Vulkan/DX12).

Gubbi · Aug 11, 2017

sebbbi said:
Couldn't you simply allocate physical storage of each rasterizer tile from the local memory of the GPU that was assigned to render that tile (assuming tile size = multiple of page size). Virtual addresses of the render target would still be contiguous. This works perfectly as long as each tile is rendered once. But Nvidia and AMD tiled rasterizers bin smallish amount of triangles at once. So you'd have to sometimes load previous tile contents from the other GPUs local memory (over the interconnect).

It appear AMD think using cache semantics to manage locally attached RAM is the solution. If each locally attached chunk of ram is a victim cache, all ROP writes automatically goes to local RAM. The rasterizer needs to bin/hash all incoming triangles to a queue per GPU-subsystem in a deterministic way to maximize temporal and spatial locality.

Say you have 4GB HBM attached and uses 4K cache line size, you end up with one million 28 bits tags per GPU (assuming 40 bit virtual address capability), - with MOESI bits (or whatever protocol they use), that's around 4MB of SRAM on die.

Using cache semantics has other advantages; It significantly reduces the complexity of handling preemption and allows you to over subscribe RAM, because the active working set is almost always significantly less than allocated set.

Cheers

Anarchist4000 · Aug 11, 2017

Gubbi said:
The rasterizer needs to bin/hash all incoming triangles to a queue per GPU-subsystem in a deterministic way to maximize temporal and spatial locality.

Leaving the distribution explicit makes the most sense here for primary bins. In the case of compute there would be no obvious descriptor tying it to bins. At least none that come to mind.

Gubbi said:
It appear AMD think using cache semantics to manage locally attached RAM is the solution. If each locally attached chunk of ram is a victim cache, all ROP writes automatically goes to local RAM.

Along those lines you never have to preallocate a framebuffer or any intermediate resources. Just create pointers to pass as required. That should work similarly to the current bindless model, but with the address being virtual. Let the victim cache deal with alignment and distribution as buffers don't need to be continuous.

Gubbi said:
Say you have 4GB HBM attached and uses 4K cache line size, you end up with one million 28 bits tags per GPU (assuming 40 bit virtual address capability), - with MOESI bits (or whatever protocol they use), that's around 4MB of SRAM on die.

Recent linux drivers mention a large speedup from 2MB pages. So there is definitely some overhead there.

3dilettante · Aug 11, 2017

sebbbi said:
Couldn't you simply allocate physical storage of each rasterizer tile from the local memory of the GPU that was assigned to render that tile (assuming tile size = multiple of page size). Virtual addresses of the render target would still be contiguous.

I think there's a number of ways it can be implemented. Using AMD's setup as an example, there is a fixed relationship between shader engine and screen space. The graphics domain hardware associated with all that would know by design what it is responsible for.

The relationship is so straightforward that AMD included an instruction with a fixed lookup table to determine responsibility. I'm not sure about embedding in the ISA such a specific higher-level primitive and that might have some ominous implications, but if it's stuck in the ISA it indicates that there is no mystery as to where a specific point in screen space is relative to the hardware and memory handling it.

The hardware would know where it is in the overall system, based on elements derived from hardware, hardware initialization, OS/driver launch, and API/runtime actions before user software even needs to start worrying.
It's a similar problem to how AMD's APUs stripe graphics data over their shared memory with the CPU, whose stride and granularity doesn't match. Being able to handle mixed residency or linking ROPs to special storage is something the Xbox One had to do with the ESRAM. In both cases, this integrates with virtual memory without the ROPs or shader being adjusted.

Page properties can be one area where the graphics resources can have their attributes tracked. Past that, there are further ways of adding indirection even for physical addresses a page translates to. NUMA and memory controller interleaving add hash functions to the controllers or caches, so that physical addresses can stripe to configurable locations.

Indirection can be added at many points at the OS, software, resource, sub-resource, page, physical, and hardware levels.

This works perfectly as long as each tile is rendered once. But Nvidia and AMD tiled rasterizers bin smallish amount of triangles at once. So you'd have to sometimes load previous tile contents from the other GPUs local memory (over the interconnect). The scheduler could of course prefer to send that tile again to the same GPU that processed it originally (to reduce interconnect traffic).

At least with the current methods, the idea is that the GPU performs a stretch of operations where the given batch stays on-die, which would try to avoid the question arising too often. AMD's instruction and the idea it should be used by a primitive shader seems to indicate that the scheduler is notifying specific shader engines and their local resources of a primitive (or perhaps running culling on the same geometry in parallel?), and for domain-specific setup and export it's a hard-wired relationship.

Anarchist4000 · Aug 12, 2017

3dilettante said:
At least with the current methods, the idea is that the GPU performs a stretch of operations where the given batch stays on-die, which would try to avoid the question arising too often. AMD's instruction and the idea it should be used by a primitive shader seems to indicate that the scheduler is notifying specific shader engines and their local resources of a primitive (or perhaps running culling on the same geometry in parallel?), and for domain-specific setup and export it's a hard-wired relationship.

Presumably the primitive shader is implementing the bins and could go fully TBDR with that instruction assisting. Binning all draws over several passes with varying tile dimensions until all tile sizes fit into specified cache size. Overlapping the binning of frame B with rasterization of frame A. Allocating space was a problem in the past, but dynamic allocation and a large cache would make that trivial if there was a L3 victim cache we haven't seen. It would appear as if the bins or L2 was being flushed to memory, but without the performance hit.

Would make sense with the large cache and seeming lack of memory bandwidth. With TBDR across draws it could have nearly zero overdraw and be that efficient. We haven't seen it as it occurs transparently.

3dilettante · Aug 13, 2017

Anarchist4000 said:
Presumably the primitive shader is implementing the bins and could go fully TBDR with that instruction assisting. Binning all draws over several passes with varying tile dimensions until all tile sizes fit into specified cache size.

Fully TBDR would be fully deferring the whole frame's shading until all positions and culling is done. Primitive shaders still appear to be invoked at the frequency of the original primitive submissions, and the on-die storage appears to be insufficient to hold everything since the rasterization scheme has multiple outs for exceeding storage.

The instruction is a shortcut for calculating what shader engines may be affected by a primitive. It seems to assume some rather constrained parameters at an ISA level, like 32-pixel tiles, and by its very name applies only to GPUs with 4 shader engines. It's insufficiently precise to do more than determine which shader engines will likely be tasked with doing actual binning and evaluation at sub-pixel accuracy. Relying on a bounding box is conservative, but tasking this instruction or a primitive shader with the evaluation of narrow triangles and other corner cases may lose more than it gains.

AMD: Navi Speculation, Rumours and Discussion [2017-2018]

Foo Fighter