We are using a large number of GPU cards for intensive calculations and are increasingly concerned by the reliability of these cards.
Last year, we deployed many hundreds of systems based on the GT200 GPU (nVidia) and are now facing a much higher failure rate than expected. Many of these cards that fail have run for less than 6-12 months. The typical problems we see are:
We are now looking at integrating Fermi GPU cards and are very concerned about reliability. What can we do to help reduce failure rate? What are the most important causes of GPU card failures?
Thanks,
Sebeng.
Last year, we deployed many hundreds of systems based on the GT200 GPU (nVidia) and are now facing a much higher failure rate than expected. Many of these cards that fail have run for less than 6-12 months. The typical problems we see are:
- Computer will not boot (BIOS won't even POST)
- Computer hangs with corrupted video when 3D rendering is started
- Or more sporadic driver crash every few hours of intense usage
We are now looking at integrating Fermi GPU cards and are very concerned about reliability. What can we do to help reduce failure rate? What are the most important causes of GPU card failures?
- Operating temperature: How should a 10deg temperature reduction improve the overall MTBF?
- Thermal cycling : Is it better to let the system run all night instead of shutting it down to avoid thermal cycling ?
- GPU / Memory clocks : Should we consider reducing clocks to increase MTBF at the cost of overall performance?
- Card screening: Are there valid tests that we can run in production to identify the weaker cards and reject them?
Thanks,
Sebeng.