The Intel Execution in [2024]

Shifty Geezer · Jul 30, 2024

DegustatoR said:
Sure. Any CPU may fail earlier than you would expect, even those which don't have the issue. It is impossible to say that it definitely would though. And this applies to the affected CPUs too.

But if you know there has been more wear and tear on one part than others, you'd deduce the likelihood of MTBF to be lower for those parts with more wear. You can't say every CPU with the issue will fail earlier than those without, but you can say in all fairness that there is an increased risk. Not everyone who smokes will get lung cancer, but smoking absolutely increases the risks of getting lung cancer. Not every CPU that's been pushed past ideal operating parameters will fail early, but it absolutely will increase the risks of any of those processors failing earlier if the impact of that was physical degradation.

DegustatoR · Jul 30, 2024

Shifty Geezer said:
you can say in all fairness that there is an increased risk

Sure. But this isn't the same as saying that all these CPUs are "damaged". And it is important to understand that this "risk" isn't quantifiable - it may work 5 years instead of 20, or 20 instead of 50, or 1 instead of 10. All of these are completely different risks to the end users.

Subtlesnake · Jul 30, 2024

DegustatoR said:
Sure. But this isn't the same as saying that all these CPUs are "damaged". And it is important to understand that this "risk" isn't quantifiable - it may work 5 years instead of 20, or 20 instead of 50, or 1 instead of 10. All of these are completely different risks to the end users.

I don't think anyone in this thread tried to claim that all Intel CPUs vulnerable to the bug are damaged. Just that there are forms of damage that are not visible/detectable to the user. If the crashes are a result of a gradual process of degradation, then it doesn't make sense to think that everything is perfectly fine the day before the first crash starts. It would be like claiming that your brain is perfectly fine up to the moment that you start experiencing Alzheimer’s symptoms. To say that Alzheimer’s is a consequence of accumulated damage to the brain, which is made more likely by certain risk factors, isn't to say that brains in general are damaged, or people with those risk factors automatically have damaged brains.

And the risk factor as a whole can be assessed by looking at the failure rate.

DegustatoR · Jul 30, 2024

The point is - we don't even know if the affected CPUs have a higher risk of failing sooner than the non-affected ones. We don't have the data to back up any claim here just yet. It is entirely possible that the CPUs which are fine now will work just fine for years after the m/c update. It is also inevitable that some of them will fail within a year from that point. Without knowing the scale of such failures we can't say anything about how "damaged" they are.

Shifty Geezer · Jul 30, 2024

DegustatoR said:
The point is - we don't even know if the affected CPUs have a higher risk of failing sooner than the non-affected ones.

Cannot know for certain per CPU, sure. Logically though, they do have a higher risk of failure, but as you say it can't be quantified. It might be a 5% increased risk of failure, or 500%, or 0.000005%. The fundamentals of silicon chip engineering tell us that juicing them with more power will increase wear and tear and increase the chances of failure.

DegustatoR said:
We don't have the data to back up any claim here just yet.

I don't think any claim has been made. No-one has said these CPUs will all be dead in 5 years, or the risk of failure is 20x higher than other Intel CPUs, or made any claim to the impact. All that's been said is that there's a physics-induced elevation in risk of failure across the entire range and now the consideration is how much impact that will have and if enough to be of concern.

DegustatoR said:
It is entirely possible that the CPUs which are fine now will work just fine for years after the m/c update. It is also inevitable that some of them will fail within a year from that point. Without knowing the scale of such failures we can't say anything about how "damaged" they are.

You said you weren't going to talk about semantics any more! If I take an angle grinder to my bicycle wheels, they are now damaged. It might be enough to impact the bicycle, or not, but they are damaged. There's nothing wrong with calling the over-voltaged CPUs damaged. Their physical integrity has been compromised against what was intended and what should have happened. The question is whether that damage is enough to matter or not. And if you hate the word 'damage' because of connotations, then just say "increased wear and tear". It doesn't matter what word or phrase is used; the meaning is what matters.

Going back to homerdog's query,

1) If I apply the microcode update can I expect it to be fine for the rest of its life? 2) Has it already been damaged?

Point 2) Has the increased voltage irreversibly affected the physical integrity of the chip? Yes
Point 1) Will the CPU function just fine for the rest of Homerdog's using it? No-body knows.

DegustatoR · Jul 30, 2024

Shifty Geezer said:
Point 2) Has the increased voltage irreversibly affected the physical integrity of the chip? Yes

You don't know this either. So no, and nobody knows too.

DavidGraham · Jul 30, 2024

All 13th and 14th gen users should default their CPUs to their default specs immediately and then stress test their CPUs with a powerful stress benchmark for several hours. Mildly damaged CPUs might not crash on desktop, during browsing, or gaming, but they might crash under continuous 100% CPU usage or in a future heavy gaming workload.

We have to consider that the issue didn't originally manifest itself during normal usage, initially it only manifested during the intensive shader compilation processes that pushed these CPUs to the brink or near it for dozens of minutes.

digitalwanderer · Jul 30, 2024

DegustatoR said:
You don't know this either. So no, and nobody knows too.

Nobody knows yet for certain but all evidence is pointing to even the chips that are working fine are at high risk of a much rapider aging process and shorter life. Microcode update will hopefully slow or prevent further damage but if the problem is either oxidation or over-volting the CPUs have already been damaged even if they're not showing symptoms. I'm the dumb one and even I get that, you're too bright not to see it.

You may be right but that would have to mean the root cause is something different entirely and I haven't heard of any alternative explanations that fit the problem so well. Not trying to fight I just don't understand why you're arguing your position as it seems pretty indefensible.

Albuquerque · Jul 30, 2024

DegustatoR said:
No, it can't. If a CPU works without issues then it's not "damaged", period.

Fully incorrect and belies your inability to understand how modern solid state electronics physically function. This is literally a physics certainty and not some random handwaving "everything derates to entropy" opinion. You can absolutely damage a "thing" and it can still work and yet still be damaged; damage is not a binary state. If you want to use a word which reflects a binary state which might be applicable, one word you could use instead is destroyed.

LEDs are an easy to observe example of damaged electronics that still function: you can immediately burn out an LED by significant overvoltage, however it is quite easy to damage an LED by moderate overvoltage. The resultant damage is significantly lower life expectancy, but not complete blackout. The damage is done, whether the light "works" right now or not. And this is not the same as standard degradation, as the overvoltage (in both the CPU case and the example LED case) are beyond the rated limits of the conductors and dielectrics of the solid state electronics.

Higher voltages will absolutely, unequivocally, damage dielectrics and nanometer-scale circuit pathways far beyond what standard use would cause. This isn't wear and tear. It's also been studied for literally decades by EE students and major semiconductor fabs worldwide for exactly the reasons Intel's chips are failing. People with double-doctorates and fabs with literally billions of dollars on the line are who work these problems, and whose professional opinions I will trust on the matter.

Shifty Geezer · Jul 30, 2024

DegustatoR said:
You don't know this either. So no, and nobody knows too.

How is the higher voltage that Intel is changing in the m/c update affecting the processors? Does higher voltage increase the rate of physical changes in silicon transistors? If not, why won't m/c update fix existing crashing CPUs?

The fact the problems cannot be reversed means the changes are physical, the silicon degraded too much. Which means the high voltage degraded the silicon. Which is what you'd expect as higher energy increases 'wear and tear'.

The only alternative I can think of is the existence of manufacturing defects where some chips are susceptible to degradation at higher voltages where others are not. thus two chips subjects to the same higher voltages respond differently.

entity279 · Jul 30, 2024

Albuquerque said:
to understand how modern solid state electronics physically function. This is literally a physics certainty and not some random handwaving "everything derates to entropy" opinion

As Shifty mentioned, I do think there is an issue with terminology here mainly though.

Literature as far as i recall distinguishes between "defects" on one side and "failures" & "errors" (iirc error is a more severe brand of failure) on the other side. "defects" are the underlying cause of failures, but they are not directly observable (while failures can be).

By definition defects (we've been using "damage" here I guess here instead of "defect" ) exist before failures occur.

homerdog · Jul 30, 2024

Lol my question was

I'm wondering to what extent it's a problem on the lower end CPUs. If I apply the microcode update can I expect it to be fine for the rest of its life? Has it already been damaged?

I got a response about the nature of damage that belongs in a vsauce video so I stopped engaging with that. Everybody understands what damage is.

digitalwanderer · Jul 30, 2024

entity279 said:
As Shifty mentioned, I do think there is an issue with terminology here mainly though.

Literature as far as i recall distinguishes between "defects" on one side and "failures" & "errors" (iirc error is a more severe brand of failure) on the other side. "defects" are the underlying cause of failures, but they are not directly observable (while failures can be).

By definition defects (we've been using "damage" here I guess here instead of "defect" ) exist before failures occur.

"defect"<>"damage"

The defect is what is damaging the chips to the point of failure. Even if the chip hasn't failed and hasn't shown any signs doesn't mean that the defect in it hasn't already damaged the chip it just means it hasn't damaged it enough to the point of failure yet.

As mentioned above this really isn't a semantics issues so much as a physics one and I'm using the word "damage" for "damage". The defect is responsible for the damage but the defect is not the damage.

digitalwanderer · Jul 30, 2024

Even Intel admits they're damaged:

Intel Acknowledges Irreversible Damage from Instability in 13th and 14th Gen CPUs, No Recall Planned

Intel has acknowledged significant concerns related to the 13th and 14th Gen CPUs, admitting that users may face irreversible damage due to voltage stability issues, while confirming that no recall will take place.

thinkcomputers.org

Albuquerque · Jul 30, 2024

Yeah, I subscribe to the Digi thought model in that a defect is one of origin / source / manufacturing in this case, and damage is something which occurs after the device has finished manufacturing. I agree with how he described the defect as the algorithm itself which then enabled the later damage to occur.

I also like Entity's callout of errors vs failures, and I think it perfectly demonstrates how indeterminate damage can and will eventually cause problems (in the form of errors or failures) but not all damage results in said errors or failures.

Sounds like we've found the final bottom of that rabbit hole, even with Intel themselves calling it damage. So what's next? I feel like Intel owes a whole lot of people a whole lot of replacement chips, but obviously that's not gonna happen without some sort of class action lawsuit -- in which the lawyers will become rich and the victims will get a $10 gift card

(if you don't get the uber gift card joke, see also: Crowdstrike)

DegustatoR · Jul 31, 2024

digitalwanderer said:
Even Intel admits they're damaged:

Again, Intel "admits" this only for CPUs which are showing issues when doing regular workloads. Not ALL 13/14th series (or 65W+, or whatever).

homerdog said:
I got a response about the nature of damage that belongs in a vsauce video so I stopped engaging with that. Everybody understands what damage is.

You got a bunch of conjecture from those who don't understand what "damage" means. Reality is - no one can answer that question for you, only time.

DavidGraham · Jul 31, 2024

Bad news! Intel to Cut Thousands of Jobs to Reduce Costs, Fund Rebound.

https://twitter.com/x/status/1818399384330989719

Bloomberg - Are you a robot?

arandomguy · Jul 31, 2024

digitalwanderer said:
Even Intel admits they're damaged:

Intel Acknowledges Irreversible Damage from Instability in 13th and 14th Gen CPUs, No Recall Planned

Intel has acknowledged significant concerns related to the 13th and 14th Gen CPUs, admitting that users may face irreversible damage due to voltage stability issues, while confirming that no recall will take place.

thinkcomputers.org

Intel is not acknowledging "damage" at all? If you look at the actual statements from Intel and do a search for the term "damage" it is nowhere to be found. The word damage used in the article is the writers intepretation.

This shouldn't be surprising though as acknowledging "damage" officially like that would likely not be wording their legal and communications departments would suggest as there is real liability implications going forward from that type of terminology.

Intel does use the term "degradation" which circles back a bit to the issue at hand and semantics debate. I know we don't think of this way but all chips degrade upon being subjected to power run through them and they do have finite lifespans. In practice as I mentioned earlier even though the warranty for these products might be 3 years (or less in some cases) I think we've just been conditioned to accept that they will run within spec for much longer than that and beyond the functional usability timeline from said degradation from use. Where we have the problem and I guess contention though is how much as this issue accelerated said degradtion in terms of the lifespan of those CPUs. With that then would is the expectation that a CPU should only last the 3 year explicit warranty timeline or the much longer implicit timeline from past experience and norms. If the latter, what should that be and should Intel make accommodations for that.

I'm going to use a hypothetical here and say for example you were informed that there is a batch of CPUs that were accidently exposed to excess voltage during factory testing and we know the exact batch numbers. Would you really be fine buying that with no incentives and the standard 3 year warranty even if they worked fine out of the box and weren't DOA? I doubt it, if you were in the no question asked return period you'd return it for sure. Just like I doubt if you were to be in the second hand market now for CPUs the relative value for the affected 13th/14th gen CPUs have no gone done regardless if they are working within spec currently or not.

orangpelupa · Jul 31, 2024

DavidGraham said:
Bad news! Intel to Cut Thousands of Jobs to Reduce Costs, Fund Rebound.

https://twitter.com/x/status/1818399384330989719

Bloomberg - Are you a robot?

Goddammit. I'm investing in INTC for long term. This is not good long term.

Their fabs are already struggling to beat TSMC. They are already using TSMC to make their chips.

How can they be better if their workers are afraid of their job securities

pcchen · Jul 31, 2024

orangpelupa said:
Goddammit. I'm investing in INTC for long term. This is not good long term.

Their fabs are already struggling to beat TSMC. They are already using TSMC to make their chips.

How can they be better if their workers are afraid of their job securities

Long term wise Intel might have to get out of the fab business. It's not what Intel likes but it might be inevitable.

The problem with Intel's fabs is not whether they are cutting edge or not. They are (or at least were). They are very good at what they are doing. It's not a question. The problem is cost.
Intel didn't have to worry about cost when their process was way better than anyone elses'. However, even at its heyday Intel didn't fab chips for others (they did some small scale projects for some), the main reason is cost.
You can easily hide high costs when you are a vertical integrated company. Your margin is the combination of CPU + process + packaging. As long as your CPU makes a lot of money, you can be much more expensive than your competitors in process and packaging. On the other hand, AMD can only make margin on the CPU, as they have to pay TSMC and others for the process and packaging.
This works well even if your process is hugely more expensive than your competitors as long as your process is competitive performance-wise. However, for Intel there are two emerging problems: first is that Intel's process is no longer more advance than competitors like TSMC, and second that Intel's process is much more expensive than competitors. And combined with the fact that Intel's CPU no longer dominates the market, suddenly you have a lot of fab capacities need filling but you don't have enough CPU to fill them, and that's a big problem.
Intel tries to go into the foundry market in order to help them filling the extra capacities, but with uncompetitive cost and performance it's hard to attract customers. Performance they might be able to solve, but cost is something that they can't, especially if they insist to build most of their fabs in the US.

So with these in mind it's quite clear why Intel went on doing GPU and AI chips, because that's one way to fill up their fabs.

The Intel Execution in [2024]

Shifty Geezer

uber-Troll!

DegustatoR

Subtlesnake

DegustatoR

Shifty Geezer

uber-Troll!

DegustatoR

DavidGraham

digitalwanderer

Albuquerque

Red-headed step child

Shifty Geezer

uber-Troll!

entity279

homerdog

donator of the year

digitalwanderer

digitalwanderer

Intel Acknowledges Irreversible Damage from Instability in 13th and 14th Gen CPUs, No Recall Planned

Albuquerque

Red-headed step child

DegustatoR

DavidGraham

arandomguy

Intel Acknowledges Irreversible Damage from Instability in 13th and 14th Gen CPUs, No Recall Planned

orangpelupa

Elite Bug Hunter

pcchen

Moderator

Similar threads