Software transparent Multi-GPU using an interposer

Cat Merc

Newcomer
Would such a setup be at all possible? My gut feeling says no, but my understanding of GPU rendering is fairly limited.
 
AMD's upcoming Vega GPU's are built using their 'Infinity Fabric' technology which gives it modularity like the CCX's in Ryzen. So it's possible in the future they come out with a multichip software transparent product. The only question would be whether they use classic MCM assembly or an bigger interposer.
 
AMD's upcoming Vega GPU's are built using their 'Infinity Fabric' technology which gives it modularity like the CCX's in Ryzen. So it's possible in the future they come out with a multichip software transparent product. The only question would be whether they use classic MCM assembly or an bigger interposer.
Infinity Fabric itself doesn't mean everything, when the problem at hand is that GPUs have a far higher scale of interconnect bandwidth needs. "Modularity" doesn't help either if you consider how GPU caches are served to amplify bandwidth and to exploit spatial locality first and foremost with its predominant data-streaming access nature, and the increasing reliance on device scope atomics and coherency needs.

Imagine to have 2x Vega 10 as four smaller chips. For each chip, in addition to the x1024 HBM2 interface, you would also need a triple of such (in SerDes or whatever) to at least maintain full channel-interleaving bandwidth that matches what is being expected for a monolithic GPU. Now let's take account of also the need of L2 caches and ROP access needs. Let's not forget also the GPU control flow (wavefront dispatchers, CPs, and GPU fixed functions esp).

As always I am not saying it is impossible. But apparently the only question is not the only one.
 
I think it's not realistic to expect that the hardware looks to the _driver_ like one GPU, but you can certainly just publish two graphics-queues to the run-times (DX12 + Vulkan runtimes, not games) and require no other adjustments at all (shared memory address space and so on, maybe e.g. shared MMU instead of multi-MMU coherency or such).
 
Infinity Fabric itself doesn't mean everything, when the problem at hand is that GPUs have a far higher scale of interconnect bandwidth needs. "Modularity" doesn't help either if you consider how GPU caches are served to amplify bandwidth and to exploit spatial locality first and foremost with its predominant data-streaming access nature, and the increasing reliance on device scope atomics and coherency needs.

Imagine to have 2x Vega 10 as four smaller chips. For each chip, in addition to the x1024 HBM2 interface, you would also need a triple of such (in SerDes or whatever) to at least maintain full channel-interleaving bandwidth that matches what is being expected for a monolithic GPU. Now let's take account of also the need of L2 caches and ROP access needs. Let's not forget also the GPU control flow (wavefront dispatchers, CPs, and GPU fixed functions esp).

As always I am not saying it is impossible. But apparently the only question is not the only one.

I agree with you, but I remember someone from AMD saying something along the lines of multi-chip was the way to go in the future. Without much thought I assumed with Vega having infinity fabric this was at least the first step towards such a future. Why else would AMD choose to build Vega with infinity fabric...? But there's no guarantee it would be software transparent.
 
I think it's not realistic to expect that the hardware looks to the _driver_ like one GPU, but you can certainly just publish two graphics-queues to the run-times (DX12 + Vulkan runtimes, not games) and require no other adjustments at all (shared memory address space and so on, maybe e.g. shared MMU instead of multi-MMU coherency or such).
This makes little difference from explicit multi GPU though. Shared memory address space is already a thing IIRC (via the host).
 
I agree with you, but I remember someone from AMD saying something along the lines of multi-chip was the way to go in the future. Without much thought I assumed with Vega having infinity fabric this was at least the first step towards such a future.
Multi-chip is the way, but the tone was set for heterogeneous SoC integration in the first place. Their excascale proposal uses multiple GPUs per package, but that's because of the model they pursue for that particular project (in-memory computing).

Why else would AMD choose to build Vega with infinity fabric...? But there's no guarantee it would be software transparent.
Cache coherency between multiple GPUs, Zen hosts and perhaps OpenCAPI appears to be an incentive though.
 
This makes little difference from explicit multi GPU though. Shared memory address space is already a thing IIRC (via the host).

If the memory pool itself isn't shared it's very problematic currently. Transfers between them are only pushable, not pullable etc. (see DX12 docs).

If you compare it to CPUs - having different memory pool for different sockets vs. shared memory pool for all sockets vs. more cores on the same SoC in the same socket - I suspect the trend to be the same.
 
I think it's relevant to mention here that while Vega indeed does use the Infinity Fabric to connect to a CPU in Raven Ridge and other APUs, according to the slides Navi is the architecture that focuses on scalability:

bLXCAt7.jpg



That said, Navi might be an architecture that takes inspiration from PowerVR's and Mali's "MP" models, by designing a base core unit and then grouping larger or smaller numbers according to a performance and power segment.

This seems to be what they're doing right now with Zen. Ryzen 5 and 7 have 2*CCX, Threadripper has 4 CCX and Epyc goes up to 8, and all of them use Infinity Fabric to interconnect the CCXs. Raven Ridge is 1*CCX + Vega through Infinity Fabric.
Navi could be the GPU version of a CCX (let's imagine GCX). Future AMD chips could be just a mix of different numbers of CCX and/or GCX modules, according to the product they want to output.
If the interconnect fabric is robust , performant and future-proof enough, AMD's hardware R&D teams could focus on iterating upon CCX and GCX.
 
Back
Top