Ext3h
Regular
So, maybe one of you got an idea. I'm facing an stability issue using multiple NVidia GPUs under Windows as soon as they are sharing PCIe lanes via a PCIe switch.
E.g. an example configuration would be 4x1050 (for the given application that is actually the ideal configuration), whereby each 2 GPUs share 16 PCIe 3.0 lanes via PCIe switch (as they need to shuffle around a lot of texture data).
Problem is, as soon as the PCIe bus load raises, and contention occurs, frequent TDR watchdog violations are pretty much guaranteed. Respectively via CUDA, reproducible "launch errors". The simplest way to reproduce the issue is a simple CUDA application which does nothing except copying from shared to device memory and back, concurrently on 2 GPUs sharing the same lanes.
Tried with different GPU series (Quadro, Geforce), different architectures (Kepler to Turing), full range of performance classes (GT to top model), different (workstation) mainboards with different switches (all Intel CPU with quad channel though), always the same issue.
Keeping the GPUs isolated, on non-shared lanes, avoids the issue. Dropping from PCIe 3.0 to any lower standard doesn't. Neither does disabling about every single power saving option we could think of.
As soon as contention occurs on the switch, blue screens with TDR Watchdog violations are pretty much certain.
Even some extremely basic load, such as having monitors attached two GPUs, and then just playing a hardware accelerated video in VLC is enough to trigger the issues. Well, the bluescreens only happen if the GPU has a monitor attached / using DX API on that GPU. If only using it with cuda related APIs, it "only" shows in the form of the mentioned launch errors, but it's still triggering every few seconds when put under full load.
Not a new issue either. Dates back to Windows 7 times, but has become increasingly more frequent and easier to reproduce as the driver continues to be optimized.
E.g. an example configuration would be 4x1050 (for the given application that is actually the ideal configuration), whereby each 2 GPUs share 16 PCIe 3.0 lanes via PCIe switch (as they need to shuffle around a lot of texture data).
Problem is, as soon as the PCIe bus load raises, and contention occurs, frequent TDR watchdog violations are pretty much guaranteed. Respectively via CUDA, reproducible "launch errors". The simplest way to reproduce the issue is a simple CUDA application which does nothing except copying from shared to device memory and back, concurrently on 2 GPUs sharing the same lanes.
Tried with different GPU series (Quadro, Geforce), different architectures (Kepler to Turing), full range of performance classes (GT to top model), different (workstation) mainboards with different switches (all Intel CPU with quad channel though), always the same issue.
Keeping the GPUs isolated, on non-shared lanes, avoids the issue. Dropping from PCIe 3.0 to any lower standard doesn't. Neither does disabling about every single power saving option we could think of.
As soon as contention occurs on the switch, blue screens with TDR Watchdog violations are pretty much certain.
Even some extremely basic load, such as having monitors attached two GPUs, and then just playing a hardware accelerated video in VLC is enough to trigger the issues. Well, the bluescreens only happen if the GPU has a monitor attached / using DX API on that GPU. If only using it with cuda related APIs, it "only" shows in the form of the mentioned launch errors, but it's still triggering every few seconds when put under full load.
Not a new issue either. Dates back to Windows 7 times, but has become increasingly more frequent and easier to reproduce as the driver continues to be optimized.