NVidia multi GPU (non-SLI) with PCIe switch

Discussion in 'PC Hardware, Software and Displays' started by Ext3h, Nov 4, 2019.

  1. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    So, maybe one of you got an idea. I'm facing an stability issue using multiple NVidia GPUs under Windows as soon as they are sharing PCIe lanes via a PCIe switch.

    E.g. an example configuration would be 4x1050 (for the given application that is actually the ideal configuration), whereby each 2 GPUs share 16 PCIe 3.0 lanes via PCIe switch (as they need to shuffle around a lot of texture data).

    Problem is, as soon as the PCIe bus load raises, and contention occurs, frequent TDR watchdog violations are pretty much guaranteed. Respectively via CUDA, reproducible "launch errors". The simplest way to reproduce the issue is a simple CUDA application which does nothing except copying from shared to device memory and back, concurrently on 2 GPUs sharing the same lanes.

    Tried with different GPU series (Quadro, Geforce), different architectures (Kepler to Turing), full range of performance classes (GT to top model), different (workstation) mainboards with different switches (all Intel CPU with quad channel though), always the same issue.

    Keeping the GPUs isolated, on non-shared lanes, avoids the issue. Dropping from PCIe 3.0 to any lower standard doesn't. Neither does disabling about every single power saving option we could think of.

    As soon as contention occurs on the switch, blue screens with TDR Watchdog violations are pretty much certain.

    Even some extremely basic load, such as having monitors attached two GPUs, and then just playing a hardware accelerated video in VLC is enough to trigger the issues. Well, the bluescreens only happen if the GPU has a monitor attached / using DX API on that GPU. If only using it with cuda related APIs, it "only" shows in the form of the mentioned launch errors, but it's still triggering every few seconds when put under full load.

    Not a new issue either. Dates back to Windows 7 times, but has become increasingly more frequent and easier to reproduce as the driver continues to be optimized.
     
  2. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    15,050
    Likes Received:
    2,386
    Why not swap 4x1050's for a top of the range card
     
  3. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Because I'm limited by the hardware decoder throughput, and below a Quadro RTX 4000, that's identical for all of Turing / Pascal / Volta. And for perf / price, the RTX4k is still significantly worse. Going multi CPU just to have more lanes is also much worse, and Nvidia GPUs on Zen 1 Threadripper dont achieve full PCIe throughput due to jitter in memory access latency. Zen 2 Threadripper is once again way too expensive.
     
    #3 Ext3h, Nov 5, 2019
    Last edited: Nov 5, 2019
  4. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    15,050
    Likes Received:
    2,386
    Really ?
    EPYC 7232P 128 lanes $450 according to wiki
     
  5. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,003
    Likes Received:
    1,687
    Curious why do you avoid SLI?
    Edit: Disregard ... forgot no SLI for these gpu's.
     
  6. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    And another $700 for the only 4x PCIe 16x electrical mainboard with SP3 socket currently on the market. TR4 socket boards are much cheaper.

    Thanks, but in terms of hardware, a switched PCIe is currently the only cost effective solution. Next best solution is shrinking the nodes down to just two GPUs, but that's a waste of rack space. Going for 2x RTX4k achieves the perf goal per node, but the prices went up like crazy, recently.

    Question is primarily why that is so unstable. I have neither the equipment nor the knowhow how to debug PCIe related issues. I can post the source for the sample application for reproducing the problem tomorrow.

    EDIT:
    Or to see if someone here by chance knows a PCIe switch model which is known not to trigger issues with Nvidias GPUs.
     
    #6 Ext3h, Nov 5, 2019
    Last edited: Nov 5, 2019
  7. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    15,050
    Likes Received:
    2,386
    Does linux/bsd ect have the problem, will your software port ?
     
  8. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    The demo application should be portable, it's just plain cuda driver api. The real one, nope, it's primarily used as an UI application and is tied to D3D11 for now.

    Didn't get around to test on Linux yet, will try that next. Even though I'm not sure how to distinguish bad PCIe driver from bad hardware when it does work on Linux. Only if it doesn't work, I could assume it to be bad hardware / firmware.
     
  9. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Test application known to trigger the issues:
    https://gist.github.com/Ext3h/6eb2df21873f5524bfd70a0368872d4e

    Requires cuda driver API, and is actually meant to benchmark PCIe performance.

    Reference results are around the magnitude of ~12GB/s single direction, combined ~20GB/s bidirectional for exclusive PCIe 3.0 16x. If the single GPU results are significantly lower, then multi GPU runs won't run into contention
     
  10. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Couple of new insights:
    • Doesn't happen on Linux
    • Does also happen on a multi CPU GPU-server, despite each CPU only driving a single GPU with dedicated PCIe lanes
    And it's not just "launching kernels" or performing copies which is yielding "random" "CUDA_LAUNCH_ERROR" errors, but also memory management functions. When putting the GPUs under too much stress, there is always a good chance of a "CUDA_ERROR_OUTOFMEMORY" on memory allocations. Or, far worse, a silent memory leak on the matching free function.

    So, what's happening?

    I'm suspecting a backfiring deadlock concealing mechanism in the nvcuda.dll (driver API). Smart hack to conceal soft locks in the driver, by just returning soft faults and essentially saying "your fault, please restart your application". Except it's guaranteed to blow up whenever the CPU side is stalling. Or one of the GPUs isn't reachable in a timely manner due to contention on PCIe / QPI / whatever. Or even just if a Windows API is stalling for a moment.

    Looks like the bug not happening without PCIe switch on a single socket system was just a red herring.
     
  11. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,003
    Likes Received:
    1,687
    Happen to see a post regarding measuring performance on single, pci-e and nvlink which reminded me about this thread and the pci-e issues you were having.

    While it won't solve your problem I found it interesting that scaling differences between single, pci-e and nvlink can be quite surprising. He is using two GTX 2080's for his training.
    https://devtalk.nvidia.com/default/topic/1066863/b/t/post/5403333/
     
  12. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    15,050
    Likes Received:
    2,386
    ps: have you reported your issue to nvidia
     
    A1xLLcqAgt0qc2RyMz0y likes this.
  13. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Reported, yes. But don't have much hope they will ever process it. NVidia representative shook it of with a politely phrased "you don't cause enough revenue to bother us with multi GPU issues".
     
    #13 Ext3h, Nov 26, 2019
    Last edited: Nov 26, 2019
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...