Dual GPU cards and shared memory pool

MistaPi · Mar 3, 2011

I understand that it can't be done with two sperarate cards, but is it not doable for one card with two GPU's?

rpg.314 · Mar 3, 2011

It is doable. The expected gains vs engg effort tradeoff makes it pointless.

RecessionCone · Mar 3, 2011

CUDA 4.0 has a unified memory space, so that you can access memory from one GPU by the other, without issuing an explicit memcpy, and without routing data through the CPU.

compres · Mar 3, 2011

RecessionCone said:
CUDA 4.0 has a unified memory space, so that you can access memory from one GPU by the other, without issuing an explicit memcpy, and without routing data through the CPU.

Sorry for the n00b question: Is it already available and/or have you already ran some benchmarks to see latencies/BW?

pcchen · Mar 3, 2011

The CUDA 4.0 UVA has to go through the PCIe bus so it's not very fast, at least not suitable for real NUMA style 3D rendering. Basically, if you want to have two GPU working like one GPU, you need a very fast link between them and that's going to be very expensive.

RecessionCone · Mar 3, 2011

compres said:
Sorry for the n00b question: Is it already available and/or have you already ran some benchmarks to see latencies/BW?

It's being released tomorrow to developers. I haven't run any benchmarks.

RecessionCone · Mar 3, 2011

pcchen said:
The CUDA 4.0 UVA has to go through the PCIe bus so it's not very fast, at least not suitable for real NUMA style 3D rendering. Basically, if you want to have two GPU working like one GPU, you need a very fast link between them and that's going to be very expensive.

PCI-E 3.0 x16 is 12.8 GB/s. QPI (used to make NUMA CPU configurations) is only at 25.6 GB/s. So I'd say it's comparable to QPI, and can definitely be used for real NUMA style computation. Whether it'd be good to do 3d rendering over this bus is a different story, but then again: CUDA is not for 3d rendering anyway.

Gipsel · Mar 4, 2011

RecessionCone said:
PCI-E 3.0 x16 is 12.8 GB/s. QPI (used to make NUMA CPU configurations) is only at 25.6 GB/s. So I'd say it's comparable to QPI, and can definitely be used for real NUMA style computation. Whether it'd be good to do 3d rendering over this bus is a different story, but then again: CUDA is not for 3d rendering anyway.

Yes, and CPUs get away with 20 GB/s memory bandwidth so it must be enough for GPUs also. Uhm, no.
CUDA is for massively data parallel algorithms. On average, those tend to work on a lot of data, hence an appropriate bandwidth is needed. Remember that you want to be faster than on a CPU!

moozoo · Mar 4, 2011

GPU interconnect

What is needed is for GPU B to access GPU A's memory at most 2 to 4 times slower than it accesses its own memory and visa versa. say 50-100GB/s

The max under PCIe 3.0 would be less than 32 GB/s (32 lanes of 1GBs).

MfA · Mar 5, 2011

pcchen said:
The CUDA 4.0 UVA has to go through the PCIe bus so it's not very fast, at least not suitable for real NUMA style 3D rendering. Basically, if you want to have two GPU working like one GPU, you need a very fast link between them and that's going to be very expensive.

I disagree, allowing one GDDR5 link out of 4 to be used for chip to chip would probably be negligible in production cost.

Question is, will it give enough advantage to earn back the non recurring costs? Probably yes ... NVIDIA is making enough headway in HPC to start thinking about creating an interconnect fabric.

rpg.314 · Mar 6, 2011

MfA said:
Question is, will it give enough advantage to earn back the non recurring costs? Probably yes ... NVIDIA is making enough headway in HPC to start thinking about creating an interconnect fabric.

I think this is on the cards when they come out with their own arm core and systems built around it.

pcchen · Mar 6, 2011

MfA said:
I disagree, allowing one GDDR5 link out of 4 to be used for chip to chip would probably be negligible in production cost.

Question is, will it give enough advantage to earn back the non recurring costs? Probably yes ... NVIDIA is making enough headway in HPC to start thinking about creating an interconnect fabric.

I guess the problem is that if you want it to be useful in 3D rendering (i.e. games) one link is not good enough. Ideally you want the same amount of memory bandwidth for "off-chip" memory to avoid any performance pitfall. However, with reasonable amount of cache it's probably possible to do with half of that. Unfortunately, for a high end card it's going to be a lot trouble.

The biggest problem is that for gaming this is not worth it. AFR style multi-GPU rendering is "good enough" for most people, and for some games it's even possible to do SLI style rendering. So the benefit of doing this is not very significant.

For computing purposes, this should be quite useful, but then again you don't really need this much bandwidth between GPUs for computing. It's much easier for programs to be NUMA-aware instead of putting special links between GPUs (and you won't have these special links between multi-card environment anyway), so it's better to simply using PCIe for these transactions.

MfA · Mar 6, 2011

pcchen said:
The biggest problem is that for gaming this is not worth it. AFR style multi-GPU rendering is "good enough" for most people, and for some games it's even possible to do SLI style rendering. So the benefit of doing this is not very significant.

High geometry load and deferred rendering are on the rise so SLI is losing relevance, and AFR will forever be a nasty hack unless you can consistently maintain a really high framerate.

For a sort middle architecture with replicated "normal" textures and geometry the only thing which has to cross the link is the transformed geometry (doesn't have to be tesselated if the driver can determine bounding surfaces for the the tesselated surfaces and it is small enough in relation to tile size) and copies of deferred render targets. I'd say a quarter is about right for that.

For computing purposes, this should be quite useful, but then again you don't really need this much bandwidth between GPUs for computing.

Completely depends on the application, for the applications for which clusters became popular ... yeah you're right. For all the applications which still needed classical supercomputers with high bandwidth interconnect, not.

pcchen · Mar 7, 2011

MfA said:
High geometry load and deferred rendering are on the rise so SLI is losing relevance, and AFR will forever be a nasty hack unless you can consistently maintain a really high framerate.

Yeah, I agree.

For a sort middle architecture with replicated "normal" textures and geometry the only thing which has to cross the link is the transformed geometry (doesn't have to be tesselated if the driver can determine bounding surfaces for the the tesselated surfaces and it is small enough in relation to tile size) and copies of deferred render targets. I'd say a quarter is about right for that.

If normal textures are duplicated on both cards, this is probably workable. Render targets could be more demanding than geometry, but I think it's probably also workable, although game engines may have to take that into performance consideration. Either way, I think this will be better than current SLI/AFR system, even with only quarter bandwidth for interconnection.

Completely depends on the application, for the applications for which clusters became popular ... yeah you're right. For all the applications which still needed classical supercomputers with high bandwidth interconnect, not.

True, but if you want real "NUMA" style architecture, you probably want much better interconnection than, say, a quarter of main memory bandwidth. Current multiple CPU systems already have more than that.

Personally I'd like to have a multi-GPU system which can be seen (and programmed) as one GPU. That'd be fantastic.

Dual GPU cards and shared memory pool

MistaPi

rpg.314

RecessionCone

compres

pcchen

Moderator

RecessionCone

RecessionCone

Gipsel

moozoo

MfA

rpg.314

pcchen

Moderator

MfA

pcchen

Moderator

Similar threads