The future of MGPU design

Freak'n Big Panda · Jul 23, 2008

There is a trend in the industry at the moment of aiming to address the high end market with MGPU designs (R680, R700) but there are numerous problems with this approach as many of us are aware of.

For AFR based schemes, persistent data is the number one killer of scaling performance. The need to duplicate the memory for each chip is also a major cost concern.

Clearly the IHVs will try to tackle these problems, the question is how?

I have an idea on how to address the issues and I'd like feedback from some of the more knowledgeable members here.

Say a GPU had 4 x-bit DRAM channels. Would it be possible to tie two of the channels on GPU0 to 2 of the channels on GPU1 to give GPU0 access to GPU1s local memory and vice versa? This would kill two birds with one stone as memory would no longer need to be duplicated and persistent data would no longer be an issue thanks to the UMA.

I was thinking something like this:

GPU0 DRAM CH0 <-----------> GPU1 DRAM CH0
GPU0 DRAM CH1 <-----------> GPU1 DRAM CH1
GPU0 DRAM CH2 <-----------> DRAM
GPU0 DRAM CH3 <-----------> DRAM
GPU1 DRAM CH2 <-----------> DRAM
GPU1 DRAM CH3 <-----------> DRAM

So the bus between GPU0 and GPU1 would use the GDDR5 protocol in order to provide direct access to the other GPU's local memory store.

What do you think about this idea and do you have any other thoughts on how the industry will address the MGPU issue from a hardware standpoint?

ShaidarHaran · Jul 23, 2008

Shared memory or unified memory architecture has been speculated for quite some time now. The actual implementation is in question, but no doubt the arrival at this technology plateau is extremely desirable so it will happen eventually, IMHO.

Anarchist4000 · Jul 23, 2008

It would work but what happens when you increase the number of chips in the system? Ultimately you'd know approximately how much direct memory bandwidth was available to a specific chip. From there some form of high speed link to a switch is likely the most efficient option when dealing with more than two chips.

Blazkowicz · Jul 23, 2008

there'll be optical connections eventually.

Karoshi · Jul 23, 2008

Blazkowicz said:
there'll be optical connections eventually.

Cant wait for the chips to have 10 pins, some for juice and some for 1000+ lambdas at 10GHz.

Freak'n Big Panda · Jul 24, 2008

Anarchist4000 said:
It would work but what happens when you increase the number of chips in the system? Ultimately you'd know approximately how much direct memory bandwidth was available to a specific chip. From there some form of high speed link to a switch is likely the most efficient option when dealing with more than two chips.

Yeah a switch chip between the GPUs and the DRAM would probably be the best bet for 3+ chip solution

rpg.314 · Jul 24, 2008

Freak'n Big Panda said:
So the bus between GPU0 and GPU1 would use the GDDR5 protocol in order to provide direct access to the other GPU's local memory store.

Background: I am not a 3D guy, but more of a GPGPU guy.

From what you say, I understand that GPU0 fetches data straight from mem if the address lies in lower half of memory space, and if it lies in upper half, it instead asks GPU1 to fetch it for itself. This thing happens in reverse also, so both of them can access full memory.

If that's true, then both of us are thinking along similar if not same lines.

I agree that multi-gpu scaling is a problem that needs to be solved and hopefully will be solved soon.

Regarding optical interconneccts, I am waiting for them too.

Freak'n Big Panda · Jul 24, 2008

rpg.314 said:
Background: I am not a 3D guy, but more of a GPGPU guy.

From what you say, I understand that GPU0 fetches data straight from mem if the address lies in lower half of memory space, and if it lies in upper half, it instead asks GPU1 to fetch it for itself. This thing happens in reverse also, so both of them can access full memory.

If that's true, then both of us are thinking along similar if not same lines.

I agree that multi-gpu scaling is a problem that needs to be solved and hopefully will be solved soon.

Regarding optical interconneccts, I am waiting for them too.

Yeah that is exactly what I was thinking, do you have any insight on what changes would be needed to the memory controller in order to enable such a configuration?

rpg.314 · Jul 25, 2008

Yeah that is exactly what I was thinking, do you have any insight on what changes would be needed to the memory controller in order to enable such a configuration?

i guess the way to go is the way they currently build multi-socket servers. Though I suspect they would want to improve scaling incrementally building off their current SLI/crossfire base to avoid risks and for easy dev adaptation. :???:

Which is why I guess it will take a couple of gen's more to achieve true multi-gpu scaling.

Having said that, 4870x2 might end up pleasantly surprising us.

armchair_architect · Jul 26, 2008

With this plan each chip would have 2x the bandwidth to it's directly attached memory as it has to the remote memory. (As well as ~1/2 the latency, but that's less of a problem). Each chip would also have 2/3 the bandwidth of a lower-end single-chip board. That has two consequences:

Data placement will still be pretty important. There will still have to be tricks to get more coherence (make sure as much of the data each GPU will need is in its close memory) rather than setting for an average 50/50 split. This is part of what why AFR is the de facto standard.. inter-frame dependencies hurt, but there are far fewer of them than intra-frame dependencies, so it's easy to get localized access.
Any parts of the frame that were bandwidth-limited on a single-chip config will get less than 2x scaling, so the overall scaling will be somewhat less than perfect.

rpg.314 · Jul 27, 2008

you are right armchair_architect, on both counts. Then may be a possible solution is a switch chip between all gpu's and memory which does some caching too (for the pixels at the edges of contiguous tiles)

rpg.314 · Jul 27, 2008

With this plan each chip would have 2x the bandwidth to it's directly attached memory as it has to the remote memory. (As well as ~1/2 the latency, but that's less of a problem).

though I must ask for an explanation what will cause this (less bandwidth) ? Perhaps this could be solved by giving more channels to cross-gpu communication than to direct memory access.

ShaidarHaran · Jul 27, 2008

Why not utilize a central I/O chip containing a memory controller, with each GPU having dedicated links to it? I realize this would increase latency compared to an MGPU design featuring native IMCs per GPU, but I imagine this effect would be negligible, especially given the sheer amount of threads/batches in flight on modern GPUs. That way there's no worries over coherency, and it would facilitate UMA.

armchair_architect · Jul 28, 2008

rpg.314 said:
though I must ask for an explanation what will cause this (less bandwidth) ?

Err, this appears to be me writing without enough attention to the proposed architecture. I saw six lines and thought the config was 6 channels per GPU, with 2 for inter-GPU transfer and 4 for local DRAM.

Bandwidth to local and remote would be the same (though there's still the ~2x latency to remote). But this means that compared to a single chip, each chip in the dual-chip config has effectively only half as much bandwidth: total bandwidth to DRAM is the same either way, but there's twice as many clients of that bandwidth in the dual-chip setup.

Freak'n Big Panda · Jul 28, 2008

Yeah well of course but some 6gbpp gddr 5 providing ~192GBps on a 4x64bit bus should be enough for the DX11 offerings from the IHVs, well if RV770 performance with it's GDDR3 is anything to go by. This would give each chip around 100GBps, this could also be extrapolated to a 512bit bus if more b/w was needed. But I'm not sure if a 512bit bus would be feasible given the die sizes we're likely to see on TSMC 40nm.

Mat3 · Jul 28, 2008

ShaidarHaran said:
Why not utilize a central I/O chip containing a memory controller, with each GPU having dedicated links to it? I realize this would increase latency compared to an MGPU design featuring native IMCs per GPU, but I imagine this effect would be negligible, especially given the sheer amount of threads/batches in flight on modern GPUs. That way there's no worries over coherency, and it would facilitate UMA.

But wouldn't you want, let's say, at least a 128-bit bus to each GPU (let's say 2 of them), and at least 256-bit connecting to the memory, so could the central chip containing just the memory controller, some data paths, and some cache be big enough for that?

liolio · Jul 28, 2008

ShaidarHaran said:
Why not utilize a central I/O chip containing a memory controller, with each GPU having dedicated links to it? I realize this would increase latency compared to an MGPU design featuring native IMCs per GPU, but I imagine this effect would be negligible, especially given the sheer amount of threads/batches in flight on modern GPUs. That way there's no worries over coherency, and it would facilitate UMA.

[uneducated guess]
I was thinking about this too.
I'm far from competent to discuss the pro/con of the design but I was thinking about something like this.

In the xenos the shader core and the daugher die are connected through a fast serial link (32GB if I'm right), I could see xGpu connected to kind of north bridge through fast serial lanes.

In the xenos ATI put ROP outside of the shader and close (couldn't be closer in fact

) to the memory where they are supposed to read.
It granted them a lot of bandwidth but does the rop needs indeed to be pretty close to their working memory (whether it's for latency issue or something, explanations welcome)?

If yes, would it make sense to move the Rop in this "north bridge like" part?
(It would also help the die to be big enough to support a 256 bits wide bus + using cheap process).

[/uneducated guess]

MfA · Jul 28, 2008

Serial signalling's main selling point for these kind of buses is finer grain clock synchronization ... but if even PCIe 3.0 lets go of 8/10 bit encoding I just don't think it's a big deal. GDDR5 is already doing 5 GHz parallel and Rambus thinks they can push it up to 16 GHz. Serial signalling won't help push the data rates for this application.

PS. I serioulsy doubt Xenos used serial signalling to the daughter die.

rpg.314 · Jul 29, 2008

i have no idea about what xenos uses, but my gut feeling is that QP or HT (or something like it) is the way to go to build X2 or even X4 gfx cards. This solution has worked for CPU's, so (naively) feels like it's the right way to go.

MfA · Jul 29, 2008

QP is licensing hell, and Hypertransport is one technology update behind the times ... also general coherency protocols just have way too high a overhead IMO. As far as transfer rates is concerned even PCIe 3.0 will be faster than either, so they are better off sticking to that. They have the know how, they have the license ... it just makes more sense.

The future of MGPU design

Freak'n Big Panda

ShaidarHaran

hardware monkey

Anarchist4000

Blazkowicz

Karoshi

Freak'n Big Panda

rpg.314

Freak'n Big Panda

rpg.314

armchair_architect

rpg.314

rpg.314

ShaidarHaran

hardware monkey

armchair_architect

Freak'n Big Panda

Mat3

liolio

Aquoiboniste

MfA

rpg.314

MfA

Similar threads