GPU ASICs / PCB preventing failures

sebeng · Jun 19, 2010

We are using a large number of GPU cards for intensive calculations and are increasingly concerned by the reliability of these cards.

Last year, we deployed many hundreds of systems based on the GT200 GPU (nVidia) and are now facing a much higher failure rate than expected. Many of these cards that fail have run for less than 6-12 months. The typical problems we see are:

Computer will not boot (BIOS won't even POST)
Computer hangs with corrupted video when 3D rendering is started
Or more sporadic driver crash every few hours of intense usage

We see a lot more failures from the GPU cards than from the motherboard and CPU.

We are now looking at integrating Fermi GPU cards and are very concerned about reliability. What can we do to help reduce failure rate? What are the most important causes of GPU card failures?

Operating temperature: How should a 10deg temperature reduction improve the overall MTBF?
Thermal cycling : Is it better to let the system run all night instead of shutting it down to avoid thermal cycling ?
GPU / Memory clocks : Should we consider reducing clocks to increase MTBF at the cost of overall performance?
Card screening: Are there valid tests that we can run in production to identify the weaker cards and reject them?

It would be interesting to hear from engineers working in ASIC and PCB design and understand how these high-end GPU cards are designed with respect to reliability compared to CPUs and motherboards.

Thanks,

Sebeng.

Jawed · Jun 19, 2010

http://www.behardware.com/articles/773-5/components-returns-rates.html

versus:

http://www.behardware.com/articles/773-2/components-returns-rates.html

indicates that what you are experiencing is "normal", at least for consumer products, with a healthy mix of "bleeding edge" enthusiast components.

Tesla cards have de-rated clock speeds, which in theory improves reliability. They should also be more thoroughly tested during manufacture.

Which GT200 card is currently deployed? Is is something like the Tesla 1060? Honestly I think you should take up the whole question with NVidia.

There has been research on some related topics:

http://forum.beyond3d.com/showthread.php?t=54676

and you'll find there's software called MemtestG80:

https://simtk.org/home/memtest/

Jawed

Grall · Jun 28, 2010

Jawed said:
Honestly I think you should take up the whole question with NVidia.

Honestly, I don't expect anyone to get any straight, non-spun answers out of them.

After all, this is the company that was boldly lying through their teeth all throughout the entire bumpgate incident just to mention one example. Nvidia has a history of staggering dishonesty on a multiplicity of levels.

If the choice exists for them to either admit that their Fermi tesla cards aren't exactly the most reliable PC components ever made (true) and lose a sale to another hardware provider, or do a little song and a dance to win the contract...well, what do YOU think they'll do???

sebeng · Jun 30, 2010

Can some ASIC designer shed some light on the impact of temperature, clocks and thermal cycling on an ASIC failure rate?

Are there known metrics that describe the impact of each of these variable?
For example, would lower clocks help in reducing ASIC failures, or is it only the ASIC operating temperature that really matters?

Ozo.

mczak · Jun 30, 2010

sebeng said:
Can some ASIC designer shed some light on the impact of temperature, clocks and thermal cycling on an ASIC failure rate?

Are there known metrics that describe the impact of each of these variable?
For example, would lower clocks help in reducing ASIC failures, or is it only the ASIC operating temperature that really matters?

I could be wrong (and I'm not a ASIC designer) but I believe clocks per se have no impact on asic failure rate. Of course, higher clocks (cooling etc. being the same) also imply higher temperature, and might need higher voltage. Temperature certainly has an impact but I guess it's highly non-linear. Voltage (and current) also has an impact (outside of temperature, high voltage might cause electromigration).
Not sure if thermal cycling has any impact on the asic itself but certainly causes stress on the packaging.

MrBelmontvedere · Jun 30, 2010

I don't heve any hard numbers to back me up, but most with much familiarity with GPUs (or basically any electronic component) will tell will tell you there is a correlation between temperatures and failure rate. changing the temperature=changing the laws the laws of physics.

GPU ASICs / PCB preventing failures

sebeng

Jawed

Grall

Invisible Member

sebeng

mczak

MrBelmontvedere

Similar threads