NVidia multi GPU (non-SLI) with PCIe switch

Ext3h

Regular
So, maybe one of you got an idea. I'm facing an stability issue using multiple NVidia GPUs under Windows as soon as they are sharing PCIe lanes via a PCIe switch.

E.g. an example configuration would be 4x1050 (for the given application that is actually the ideal configuration), whereby each 2 GPUs share 16 PCIe 3.0 lanes via PCIe switch (as they need to shuffle around a lot of texture data).

Problem is, as soon as the PCIe bus load raises, and contention occurs, frequent TDR watchdog violations are pretty much guaranteed. Respectively via CUDA, reproducible "launch errors". The simplest way to reproduce the issue is a simple CUDA application which does nothing except copying from shared to device memory and back, concurrently on 2 GPUs sharing the same lanes.

Tried with different GPU series (Quadro, Geforce), different architectures (Kepler to Turing), full range of performance classes (GT to top model), different (workstation) mainboards with different switches (all Intel CPU with quad channel though), always the same issue.

Keeping the GPUs isolated, on non-shared lanes, avoids the issue. Dropping from PCIe 3.0 to any lower standard doesn't. Neither does disabling about every single power saving option we could think of.

As soon as contention occurs on the switch, blue screens with TDR Watchdog violations are pretty much certain.

Even some extremely basic load, such as having monitors attached two GPUs, and then just playing a hardware accelerated video in VLC is enough to trigger the issues. Well, the bluescreens only happen if the GPU has a monitor attached / using DX API on that GPU. If only using it with cuda related APIs, it "only" shows in the form of the mentioned launch errors, but it's still triggering every few seconds when put under full load.

Not a new issue either. Dates back to Windows 7 times, but has become increasingly more frequent and easier to reproduce as the driver continues to be optimized.
 
Because I'm limited by the hardware decoder throughput, and below a Quadro RTX 4000, that's identical for all of Turing / Pascal / Volta. And for perf / price, the RTX4k is still significantly worse. Going multi CPU just to have more lanes is also much worse, and Nvidia GPUs on Zen 1 Threadripper dont achieve full PCIe throughput due to jitter in memory access latency. Zen 2 Threadripper is once again way too expensive.
 
Last edited:
Curious why do you avoid SLI?
Edit: Disregard ... forgot no SLI for these gpu's.
 
Really ?
EPYC 7232P 128 lanes $450 according to wiki
And another $700 for the only 4x PCIe 16x electrical mainboard with SP3 socket currently on the market. TR4 socket boards are much cheaper.

Thanks, but in terms of hardware, a switched PCIe is currently the only cost effective solution. Next best solution is shrinking the nodes down to just two GPUs, but that's a waste of rack space. Going for 2x RTX4k achieves the perf goal per node, but the prices went up like crazy, recently.

Question is primarily why that is so unstable. I have neither the equipment nor the knowhow how to debug PCIe related issues. I can post the source for the sample application for reproducing the problem tomorrow.

EDIT:
Or to see if someone here by chance knows a PCIe switch model which is known not to trigger issues with Nvidias GPUs.
 
Last edited:
Does linux/bsd ect have the problem, will your software port ?
The demo application should be portable, it's just plain cuda driver api. The real one, nope, it's primarily used as an UI application and is tied to D3D11 for now.

Didn't get around to test on Linux yet, will try that next. Even though I'm not sure how to distinguish bad PCIe driver from bad hardware when it does work on Linux. Only if it doesn't work, I could assume it to be bad hardware / firmware.
 
Test application known to trigger the issues:
https://gist.github.com/Ext3h/6eb2df21873f5524bfd70a0368872d4e

Requires cuda driver API, and is actually meant to benchmark PCIe performance.

Reference results are around the magnitude of ~12GB/s single direction, combined ~20GB/s bidirectional for exclusive PCIe 3.0 16x. If the single GPU results are significantly lower, then multi GPU runs won't run into contention
 
Couple of new insights:
  • Doesn't happen on Linux
  • Does also happen on a multi CPU GPU-server, despite each CPU only driving a single GPU with dedicated PCIe lanes
And it's not just "launching kernels" or performing copies which is yielding "random" "CUDA_LAUNCH_ERROR" errors, but also memory management functions. When putting the GPUs under too much stress, there is always a good chance of a "CUDA_ERROR_OUTOFMEMORY" on memory allocations. Or, far worse, a silent memory leak on the matching free function.

So, what's happening?

I'm suspecting a backfiring deadlock concealing mechanism in the nvcuda.dll (driver API). Smart hack to conceal soft locks in the driver, by just returning soft faults and essentially saying "your fault, please restart your application". Except it's guaranteed to blow up whenever the CPU side is stalling. Or one of the GPUs isn't reachable in a timely manner due to contention on PCIe / QPI / whatever. Or even just if a Windows API is stalling for a moment.

Looks like the bug not happening without PCIe switch on a single socket system was just a red herring.
 
Happen to see a post regarding measuring performance on single, pci-e and nvlink which reminded me about this thread and the pci-e issues you were having.

While it won't solve your problem I found it interesting that scaling differences between single, pci-e and nvlink can be quite surprising. He is using two GTX 2080's for his training.
What I did not expect is that using NVlink bridge will make VERY significant impact. Here are results using nvcaffe:
single GPU 450 images/second
dual GPU via single PCIe switch 535 images/second
dual GPU via NVLinks (enabling P2P) 830 images/second

The model I train has massive last layer in order of 200K-300K outputs, so I believe this dictates lots of data need to be copied between GPUs hence fast link makes such impact.
https://devtalk.nvidia.com/default/topic/1066863/b/t/post/5403333/
 
Reported, yes. But don't have much hope they will ever process it. NVidia representative shook it of with a politely phrased "you don't cause enough revenue to bother us with multi GPU issues".
 
Last edited:
Based on this report some drivers work with dual nvlink and others don't. Have you tried the drivers tested as working either using the windows version or their linux counterparts?
 
Based on this report some drivers work with dual nvlink and others don't. Have you tried the drivers tested as working either using the windows version or their linux counterparts?
Had all of the driver versions in testing at one point, but looks like NVLink issues are just yet another, unrelated issue. Good to know though, now I'm confident never to even touch NVLink.

In the meantime I realized that even HPs (and other big vendors) pre-built GPU servers suffer from the very same issue in certain configurations. E.g. PCIe > DRAM bandwidth is yet another constellation which triggers this precise issue. You get that one e.g. if you decide to equip your GPU server with only 1 DRAM channel populated.

I suppose it all boils down to a primitive scheduling issue. If it is possible for one GPU to starve PCIe for the other, it's game over. So if you use NVidia GPUs, you must ensure that the GPUs are always bottle-necked by un-shared PCIe link, never by any shared resource.
 
Back
Top