NVIDIA shows signs ... [2008 - 2017]

Status
Not open for further replies.
Changing the packaging one month into the life of G92b really does look like a smoking gun.

Jawed


Or is meant to reassure suddenly skittish OEMs. What do you suppose those OEMs said to them once they explained what the cause of the problem was?

If it was me as head of OEM buying, I'd have said something like "you understand I'm not buying any future models from you manufactured in such a way, right?" Once burnt, twice shy, etc.
 
Is it cheaper to use the same backfill and packaging tech for everything instead of running two packaging processes in parallel?
 
Or is meant to reassure suddenly skittish OEMs. What do you suppose those OEMs said to them once they explained what the cause of the problem was?

If it was me as head of OEM buying, I'd have said something like "you understand I'm not buying any future models from you manufactured in such a way, right?" Once burnt, twice shy, etc.
Yeah I can imagine that. Further, NVidia may have been upfront about it, offering this rather than waiting for customers to demand it.

Still it seems bloody strange to make the change after G92b appeared. G92b was late, too, and still managed to appear in both packagings. WTF.

Jawed
 
More from Charles. Quite interesting actually, as it's a simplified technical explanation of exactly what is going wrong and why.

The defective parts appear to make up the entire line-up of Nvidia parts on 65nm and 55nm processes, no exceptions. The question is not whether or not these parts are defective, it is simply the failure rates of each line, with field reports on specific parts hitting up to 40 per cent early life failures. This is obviously not acceptable.
And here's the second part.

So, the failure chain happens like this. NV for some unfathomable reason decides to design their chips for high lead bumps, something that was likely decided at the layout phase or before because the bump placement is closely tied to the floorplan. At this point, they are basically stuck with the bump type they chose for the life of the chip.

The next choice was the underfill materials, and again, they chose the known low Tg part that had far less tolerances than the newer to the market high Tg materials. It was a risk vs risk proposition, likely with a lot of cost differences as well. They chose wrong, very wrong. The stiffness of the Namics material might be perfect below the Tg, but once you hit it, it is almost like it isn't there, and the stress transfers to the bumps while they are hot and weak.

Let's go down the checklist for Nvidia. High thermal load? Check. Unforgiving high lead bumps. Check. Eutectic pads? Check. Low Tg underfill? Check. Hot spots that exceed the underfill Tg? Check. If you are thinking this looks bad, you are right, expensive too.
 
Last edited by a moderator:
So NV have changed both bump & substrate materials.
Question is how common is that sort of thing normally?
 
Assuming Charlie has got the full story right, when you look at it as a whole, you can just see mistake after mistake with the benefit of hindsight. Each part of the chip is built upon the shaky foundation of some other part that can't handle the temps, including the parts that are designed to guard against exactly these kinds of thermal stresses. High voltages and heat going on and off in all different parts of the chip create massive thermal stresses, and they are just not handled properly in the design and materials decisions that were taken.

You can see how lots of design decisions that made sense individually sort of all piled up into a big fat road crash. They needed the voltage so they used ultra high lead bumps. The eutedic pads were probably cheaper and made better sense, but were a material mismatch to the lead bumps due to the voltages and heat Nvidia decided to run. The TG was understood and widely available, but not adequate for the heat generated due to the clocks Nvidia decided to run to compete with AMD and keep margins high, and so failed to support the chip and it's connections when heated.

All laid out like this, it's a total face-palm.
 
I'm puzzled with this:

Charlie said:
To be fair to Nvidia, about the time when the G84 and G86s were hitting the market, high Tg underfills were pretty rare and new to the market. Low Tg underfills, such as the Namics material that NV used had been available for a while, and were 'known'.


Now, if they were pretty rare at the time, shouldn't he have checked to see if others like, say..., ATI, were using those kinds of defective materials ?
Isn't this attack pretty much aimed at Nvidia, without even verifying industry-wide use of those materials, and compare them against those manufacturers' RMA/failure rates ? Think about it, it's completely one-sided hate.

I also question the sudden "high end technicality" of Charlie's writings as of late (more specifically, ever since the defect stories started ventilating out of the Inquirer).
He's a man that never wrote a single article with any shred of insight into the actual facts, and now, suddenly, he's an electronics expert with advanced knowledge to judge what is and isn't "shoddy engineering" on Nvidia's part (i guess all other chip designers just build bug-free, defective material-free semiconductors).
Even more specifically, how many articles has he written in the last 6 months for the Inquirer, and how many of those criticize Nvidia in one way or the other ?
Frankly, judging from his rants, it's a miracle that that company still holds 60% and 16% of AMD and Intel chipset market shares respectively, let alone having the majority of the discrete GPU market, instead of ATI...
 
Last edited by a moderator:
Charlie hating Nvidia is nothing new, in fact it's the complete norm for him. It's to be expected. Why people even care about what he says is beyond me, when he gets something right the other channels will spill it down anyway so you won't miss out on it.
 
I'm just impressed at how well laid out it is, I mean I'm getting it! :oops:

contrary to some popular belief, charlie is actually a good journalist, knows tech, knows how to explain it, and reports things only with multiple reasonable sources. He may be wrong at times, or not have the whole picture, but what he reports does come from the facts available.

Thats not to say that he can't be fed bad facts but in general he has a pretty good track record.

If everything he reported is true, the current nvidia situation is a massive blunder of epic proportions. All this stuff is spec'd out fairly clearly by vendors and then condensed down to design rules and requirements. but even then, everywhere that I've ever worked has added additional margin on top of everything to be on the safe side.

In essence, you do not want your package to be your weakest link in things like thermal stress, EM, etc. There are enough dragons on the die itself that it isn't worth adding them on the package.

Part of this comes down to the current trends in the graphics space where the vendors are pushing the performance side of things so hard that they've lost sight of the efficiency and thermal aspects of the designs. We've got graphics cards right now pushing 2x the power of any PC cpu, the CPU guys would love to have power budgets in the 200+ watt range instead of 45-130 range.

It wouldn't surprise to see a slow down in the performance increases in the coming years as the designs have basically already pushed the power budgets as far as they will go along with generally the die sizes as far as they will go as well.
 
Now, if they were pretty rare at the time, shouldn't he have checked to see if others like, say..., ATI, were using those kinds of defective materials ?
Isn't this attack pretty much aimed at Nvidia, without even verifying industry-wide use of those materials, and compare them against those manufacturers' RMA/failure rates ? Think about it, it's completely one-sided hate.

the materials are not defective. They do exact what they say they do.

think of it this way, are your 100 MPH rated tires defective cause they fail when you go 150 MPH? Or your graphics card or CPU defective because they don't work overclocked 20%?


I also question the sudden "high end technicality" of Charlie's writings as of late (more specifically, ever since the defect stories started ventilating out of the Inquirer).
He's a man that never wrote a single article with any shred of insight into the actual facts, and now, suddenly, he's an electronics expert with advanced knowledge to judge what is and isn't "shoddy engineering" on Nvidia's part (i guess all other chip designers just build bug-free, defective material-free semiconductors).
Even more specifically, how many articles has he written in the last 6 months for the Inquirer, and how many of those criticize Nvidia in one way or the other ?
Frankly, judging from his rants, it's a miracle that that company still holds 60% and 16% of AMD and Intel chipset market shares respectively, let alone having the majority of the discrete GPU market, instead of ATI...

you mean a reporter gets a story of fairly significant magnitude and actually reports on it? OMG the shame, the crying shame.
 
the materials are not defective. They do exact what they say they do.

think of it this way, are your 100 MPH rated tires defective cause they fail when you go 150 MPH? Or your graphics card or CPU defective because they don't work overclocked 20%?

I think the problem is Nvidia have been pushing thermal limits for quite a while now. They've been upclocking, allowing "special editions", doing whatever it takes to make sure their cards always come out on top in the benchmarks. Turn's out they've gone beyond the limits of the materials as their chip designs use those materials out of spec. They've ignored thermal and power limits in the pursuit of ever faster clockspeeds in order to compete.

I wonder if there's a disconnect at Nvidia between the desgin and choice of materials, and a year or two later when someone in marketing decides what clockspeeds they need to beat the competition. Or worse yet, they don't care and think they can get away with failures after a couple of years, and people being forced to buy a new card every two years due to failure is fine.
 
Last edited by a moderator:
Part three of Charlie's article:

As far as we are able to tell, contrary to Nvidia's vague statements blaming suppliers, there are no materials defects at work here. Every material they used lived up to the claimed specs, and every material they used would have done the job while kept within the advertised parameters. Nvidia's engineering failures put overdue stress on the parts, and several failures compounded to make two generations of defective parts. The suppliers and subcontractors did exactly what they were told, Nvidia just told them to do the wrong thing.
 
the materials are not defective. They do exact what they say they do.

think of it this way, are your 100 MPH rated tires defective cause they fail when you go 150 MPH? Or your graphics card or CPU defective because they don't work overclocked 20%?

Do you overclock your mobile GPU a lot ?
Remember, G86 is a low-end GPU (in fact, the lowest end of the entire G8x/G9x era, save for the IGP's), and even it can and will lower its core clockspeed from 450MHz to about 100MHz when not under load. And there's a step in-between too.

you mean a reporter gets a story of fairly significant magnitude and actually reports on it? OMG the shame, the crying shame.

Yeah, and "who" exactly got him that story ? And has he verified that those failure rates are higher than average for discrete GPU-equipped laptops ?
If he doesn't check both sides of the story, then that just makes him somebody's ***ch, not a "journalist".
And who know's, maybe that "somebody" just happens to be a current and/or future competitor...
 
Do you overclock your mobile GPU a lot ?
Remember, G86 is a low-end GPU (in fact, the lowest end of the entire G8x/G9x era, save for the IGP's), and even it can and will lower its core clockspeed from 450MHz to about 100MHz when not under load. And there's a step in-between too.

It's not the user that's overclocking these chips, it's Nvidia. Nvidia have them operating outside the spec of the materials used in their construction. In fact it's this very extreme up-and-down clocking in mobile parts that is exacerbating the problem with materials failure because the materials used are not rated for this kind of environment and behaviour.
 
Last edited by a moderator:
It's not the user that's overclocking these chips, it's Nvidia. Nvidia have them operating outside the spec of the materials used in their construction. In fact it's this very extreme up-and-down clocking in mobile parts that is exacerbating the problem with materials failure because the materials used are not rated for this kind of environment and behaviour.

Then what are they rated for, exactly ?
 
Then what are they rated for, exactly ?

For instance, the underfill is rated for lower temperatures than Nvidia chips get to. When it gets too hot, the underfill stops supporting the structure of the balls and pads. Because the balls and pads are made of different materials, they suffer more from thermal stressing and migration. This is also due also to the design of the chip and the way it operates ie, spread of power/signal pins, the distribution of the most active ie hotter parts of the chip, etc.

The underfill designed to help stop that thermal stressing isn't rated for those high temps and so there's effectively no support layer between the chip and substrate when the chip is hot. This leads to failure of the power and signal connections that the balls and pads carry when the physical connections fail.

Read the article, it's all there.

It might have been more precise if Aaron Spink's analogy was more along the lines of "What happens when your car manufacturer puts tyres rated for 50 mph on your car, and then sells you the car as being capable of a top speed of 100 mph?" Here, not only is the car being rated and sold to you from the manufacturer at a speed which one of it's major components cannot support, but it is being sold to you for use at a speed that everyone will drive at during normal usage anyway.

It's not the fault of the tyre company that makes 50 mph tyres. That product is rated correctly at what it can do. It's the fault of the car maker that puts those tyres on your car, and tells you they are good for well above their actual limits. There's not much surprise that components stressed beyond what they are designed for have a much shorter lifespan and fail in use.

The fault is Nvidia's for putting together a design that can't be supported by it's materials, and using materials that can't support the design.
 
Last edited by a moderator:
For instance, the TG is rated for lower temperatures than Nvidia chips get to. When it gets too hot, the TG stops supporting the structure of the balls and pads. Because the balls and pads are made of different materials, they suffer more from thermal stressing and migration. This is also due also to the design of the chip and the way it operates ie, spread of power/signal pins, the distribution of the most active ie hotter parts of the chip, etc.

The TG designed to help stop that thermal stressing isn't rated for those high temps and so there's effectively no support layer between the chip and substrate when the chip is hot. This leads to failure of the power and signal connections that the balls and pads carry when the physical connections fail.

Read the article, it's all there.

It might have been more precise if Aaron Spink's analogy was more along the lines of "What happens when your car manufacturer puts tyres rated for 50 mph on your car, and then sells you the car as being capable of a top speed of 100 mph?" Here, not only is the car being rated and sold to you from the manufacturer at a speed which one of it's major components cannot support, but it is being sold to you for use at a speed that everyone will drive at during normal usage anyway.

It's not the fault of the tyre company that makes 50 mph tyres. That product is rated correctly at what it can do. It's the fault of the car maker that puts those tyres on your car, and tells you they are good for well above their actual limits. There's not much surprise that components stressed beyond what they are designed for have a much shorter lifespan and fail in use.

The fault is Nvidia's for putting together a design that can't be supported by it's materials, and using materials that can't support the design.

They aren't all falling apart.
For instance, HP has no mention of the dv2500 14.1'' series laptops on their Nvidia GPU RMA support page, even though it has two different variants of the G86 chip (8400M GS -64bit- and 8400M GT -128bit-, both coming with 1.2GHz GDDR3), and it has existed for much longer that the "G86-failure scandal".
How do you explain that ?

I'd take it that a critically-heating, massively produced defective GPU inserted into an old, smaller than standard chassis (14.1'', instead of 15.4'') would have been the first to show any symptoms, and yet...

edit
In fact, neither the dv2500 series or later, nor any of the dv6500 or later, and the dv9500 or later are listed on that page.
Strange, because all of them come with a choice of Intel integrated graphics, or Nvidia's G86, G84 and later chips.
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top