GPU Stability testing for Nvidia cards - PowerSweep

T2098

Newcomer
Background:
-------------------------------

I'd been working on dialing in the undervolt for my two Ampere cards and was getting irritated by the usual methodology people use for stability testing; to manually lock the card to a specific point on the V/F curve, run a stability test, then manually change to a different point on the curve, test, and so forth.

I'd also had to do this manually once in the past trying to diagnose random instability on a Turing-based card that happened at completely stock settings, but only at intermediate points along the V/F curve that you'd be unlikely to hit during gaming, but might during regular Windows use of the PC. Trying to figure out which exact point was unstable was also super irritating and took me forever.

There are 3rd party tools that do an 'OC' or 'Undervolt' scanner like MSI afterburner, but I've never had much faith in the exact methods they're using to try to determine what is and isn't 'stable' as those can be workload dependent, and even MSI Afterburner's 'OC Scanner' appears to only test 4 different V/F points and then interpolate between them, which means they can miss some sad spots on the curve.

On modern Nvidia cards a game that uses heavy RT / path tracing can be quite unstable even though a heavy rasterization-only game will be just fine at the same settings as well, so I wanted the ability to pick an assortment of workloads to assess stability, to be able to test all the operating points on the curve automatically with heavy RT loads, heavy GPGPU compute loads, etc.

So I cooked up a quick Powershell script that simply sweeps the power limit up and down forever, slowly, throughout the entire usable range, which eventually ends up hitting every point on the V/F curve (for the most part.)
It's still pretty rough, but it works as is, and if anyone thinks this might be helpful/useful I'll keep filling in the gaps.



Download Link:
------------------------------
Script contents are here, just copy/paste into notepad and save it as a file with a .PS1 extension: https://pastebin.com/rsiRjSTH



Requirements:
-------------------------------

1) Windows PC with a modern Nvidia card.
2) At least one directory containing nvidia-smi.exe should be included in the PATH environment variable. (essentially, you should be able to just open a command prompt and run 'nvidia-smi' and have it return status output)
3) The script must be run as administrator, as nvidia-smi requires admin rights to make any changes to things like power caps.
4) On most systems you'll have to unblock the script or set PowerShell to be allowed to run unsigned scripts: https://learn.microsoft.com/en-us/p...urity/set-executionpolicy?view=powershell-7.4


Notes:
--------------------------------
Opening up the script in notepad, there are some options up near the top of the script you can tweak, for example:

- Changing RunMode to 2 lets you manually input upper and lower bounds for power levels, as for some cards that have a really wide range like my 3090 (100w to 500w) some of the power levels end up being redundant as the card just sits at FMin or FMax for a good chunk of the sweep, so setting it to only sweep between 200w and 400w makes a bit more sense.

- Changing PowerLimitStep changes the magnitude of each step change (in watts).
- Changing TestDuration changes how long each power level is tested for.


Features still to come:
--------------------------------

- Automated logging to a CSV file, so if the PC crashes, you'll know which V/F point and power level was in use at the time of the crash (among other things like temperature, load, fan speed, etc.)
- A secondary operating mode that works the same way as MSI Afterburner's V/F point 'lock' mode, that locks the GPU at a specific frequency/voltage directly, versus letting the power throttling pull the frequency back automatically.
This has been problematic so far as it ends up 'locking' the card at low loads to frequencies higher than you'd ever hit organically when using the device, causing it to crash, so I'll probably just make this mode interactive and ask the user to specify a maximum frequency to test.

Example output:

--------------------------------

1724365672081.png
 
Back
Top