CUDA 4.0 has a unified memory space, so that you can access memory from one GPU by the other, without issuing an explicit memcpy, and without routing data through the CPU.
Sorry for the n00b question: Is it already available and/or have you already ran some benchmarks to see latencies/BW?
PCI-E 3.0 x16 is 12.8 GB/s. QPI (used to make NUMA CPU configurations) is only at 25.6 GB/s. So I'd say it's comparable to QPI, and can definitely be used for real NUMA style computation. Whether it'd be good to do 3d rendering over this bus is a different story, but then again: CUDA is not for 3d rendering anyway.The CUDA 4.0 UVA has to go through the PCIe bus so it's not very fast, at least not suitable for real NUMA style 3D rendering. Basically, if you want to have two GPU working like one GPU, you need a very fast link between them and that's going to be very expensive.
Yes, and CPUs get away with 20 GB/s memory bandwidth so it must be enough for GPUs also. Uhm, no.PCI-E 3.0 x16 is 12.8 GB/s. QPI (used to make NUMA CPU configurations) is only at 25.6 GB/s. So I'd say it's comparable to QPI, and can definitely be used for real NUMA style computation. Whether it'd be good to do 3d rendering over this bus is a different story, but then again: CUDA is not for 3d rendering anyway.
I disagree, allowing one GDDR5 link out of 4 to be used for chip to chip would probably be negligible in production cost.The CUDA 4.0 UVA has to go through the PCIe bus so it's not very fast, at least not suitable for real NUMA style 3D rendering. Basically, if you want to have two GPU working like one GPU, you need a very fast link between them and that's going to be very expensive.
Question is, will it give enough advantage to earn back the non recurring costs? Probably yes ... NVIDIA is making enough headway in HPC to start thinking about creating an interconnect fabric.
I disagree, allowing one GDDR5 link out of 4 to be used for chip to chip would probably be negligible in production cost.
Question is, will it give enough advantage to earn back the non recurring costs? Probably yes ... NVIDIA is making enough headway in HPC to start thinking about creating an interconnect fabric.
High geometry load and deferred rendering are on the rise so SLI is losing relevance, and AFR will forever be a nasty hack unless you can consistently maintain a really high framerate.The biggest problem is that for gaming this is not worth it. AFR style multi-GPU rendering is "good enough" for most people, and for some games it's even possible to do SLI style rendering. So the benefit of doing this is not very significant.
Completely depends on the application, for the applications for which clusters became popular ... yeah you're right. For all the applications which still needed classical supercomputers with high bandwidth interconnect, not.For computing purposes, this should be quite useful, but then again you don't really need this much bandwidth between GPUs for computing.
High geometry load and deferred rendering are on the rise so SLI is losing relevance, and AFR will forever be a nasty hack unless you can consistently maintain a really high framerate.
For a sort middle architecture with replicated "normal" textures and geometry the only thing which has to cross the link is the transformed geometry (doesn't have to be tesselated if the driver can determine bounding surfaces for the the tesselated surfaces and it is small enough in relation to tile size) and copies of deferred render targets. I'd say a quarter is about right for that.
Completely depends on the application, for the applications for which clusters became popular ... yeah you're right. For all the applications which still needed classical supercomputers with high bandwidth interconnect, not.