The 8th SPE.

There is no reason why a CPU should fail like that; look at the PC world, servers are on 24/7 without 'breaking' because their CPU mystically fails due to a solar flare on Sirius A. The Cell is just another CPU, same as every other CPU on the planet - they don't generally break once they've been tested in the plant unless you do something stupid to them (putting the wrong voltage through, incorrectly seating them, ...).

They're hardly immortal. There is a failure curve associated with CPUs, just as there are with other manufactured goods.

CPUs do fail, though they rarely fail within the expected lifetime of the chip (I think desktop CPUs are given less than a decade for their mean time before failure).
It helps that most consumers don't buy the thousands of CPUs needed to get a statistical grasp of the failure rate, nor does anyone actually run them for the decades needed for a good fix on the actual life span.

Big purchasers probably do have a better idea of the number of failed CPUs, and the manufacturers have warranty hotlines for more than overclockers and bad home builders.

It's a worsening problem as process geometries shrink, clock speeds climb, and power budgets swell.
Tiny, thin wires and transistors do not stand up forever to the pull of current flow and increasing temperature.
Small defects tend to magnify over time, even if the chip is well taken care of.

All chips will fail after a long time of thermal cycling and electromigration, we just don't usually keep them that long. However, they don't always fail spectacularly. A chip can go bad long before it stops working, it may just silently corrupt data or produce bad results in one out of a trillion cycles in one out of ten trillion inputs. A consumer would probably never notice.

There are big-tin servers that are a lot more picky, and they do see a much higher number of reject CPUs than a desktop would.
 
They're hardly immortal. There is a failure curve associated with CPUs, just as there are with other manufactured goods.

CPUs do fail, though they rarely fail within the expected lifetime of the chip (I think desktop CPUs are given less than a decade for their mean time before failure).
It helps that most consumers don't buy the thousands of CPUs needed to get a statistical grasp of the failure rate, nor does anyone actually run them for the decades needed for a good fix on the actual life span.

Big purchasers probably do have a better idea of the number of failed CPUs, and the manufacturers have warranty hotlines for more than overclockers and bad home builders.

It's a worsening problem as process geometries shrink, clock speeds climb, and power budgets swell.
Tiny, thin wires and transistors do not stand up forever to the pull of current flow and increasing temperature.
Small defects tend to magnify over time, even if the chip is well taken care of.

All chips will fail after a long time of thermal cycling and electromigration, we just don't usually keep them that long. However, they don't always fail spectacularly. A chip can go bad long before it stops working, it may just silently corrupt data or produce bad results in one out of a trillion cycles in one out of ten trillion inputs. A consumer would probably never notice.

There are big-tin servers that are a lot more picky, and they do see a much higher number of reject CPUs than a desktop would.

Yes, thank you.

To exemplify what you say: our ops guy here at XXX company (no, not porn) says that servers that run 24/7 under constant load for a long time, when powered down and cold enough, won't come back up. They've been deformed due to wear and tear and high temperatures, and can no longer function under normal (room) temperature.
 
Naah. 8 SPEs will never happen on a PS3. If an SPE doesn't need to be redundant for yield reasons, Sony will just integrate something else like chipset logic or hardware required for BC in the place the SPE used to be.
 
All chips will fail after a long time of thermal cycling and electromigration, we just don't usually keep them that long. However, they don't always fail spectacularly. A chip can go bad long before it stops working, it may just silently corrupt data or produce bad results in one out of a trillion cycles in one out of ten trillion inputs. A consumer would probably never notice.

that's a pretty grim scenario there. i thought these days most consumer-grade IC of the magnitude of a CPU had their internal test blocks, which of course not impervious, would still test (some of) the different functional units (not necessarily at a cpu 'unit' level) for most common signs of breakdown. of course that's all at the die level, if a packaging pin bends, god forbid, you're on your own.
 
Last edited by a moderator:
that's a pretty grim scenario there. i thought these days most consumer-grade IC of the magnitude of a CPU had their internal test blocks, which of course not impervious, would still test (some of) the different functional units (not necessarily at a cpu 'unit' level) for most common signs of breakdown. of course that's all at the die level, if a packaging pin bends, god forbid, you're on your own.

Most test logic to my knowledge is pretty much always relegated to factory testing, it's not really meant to catch problems outside of a lab with the proper equipment to analyze the results.

The CPU has no real way of knowing what the right answer is supposed to be in the case of data corruption. In the manufacturing lab, the testing hardware knows what signals are supposed to result from a huge run of test inputs. Without that outside reference, one combination of signals is as good as another.
It's like asking a camera to take a picture of itself without a mirror.

If there's a major short, the chip's not even functioning enough to test, making test logic irrelevant anyway.

It would kill performance if they tried, since the chip would have to stop every time it calculated something and then compare it to some unbelievably huge and slow table of results to see if it was right.

Some chips do have sensors that detect current and voltage levels, or will. Intel's Montecito did have Foxton planned, and Power6 will likely have something similar. Those are intended to control clock speeds based on power draw, but they could pick up electrical problems. That's not really helpful because the sensors aren't measuring down at the level they'd need to be, and at most they'd just know something was wrong somewhere in the system's wiring, not necessarily the chip itself.

The closest you'd get is what IBM does with its Z series processors, which have two processor cores that work in tandem on the same instruction. If they disagree, they retry to see if it was just a transient error. If they fail again, the chip takes itself offline and tells the mainframe to assign the software to some other portion of the system.
 
Naah. 8 SPEs will never happen on a PS3. If an SPE doesn't need to be redundant for yield reasons, Sony will just integrate something else like chipset logic or hardware required for BC in the place the SPE used to be.

I don't think 8 SPEs is going to happen either, but I also doubt whether the lines will in the future be fabbing 7SPE Cells, or Cell's with replaced logic. I guess it just depends on what the Cell landscape looks like five years out from now, but if Cell does gain strong traction in the industry - and it hasn't beendoing too bad for itself so far - then I think there could remain advantages in using derivatives of the 'mainstream' Cell offerings and simply masking an SPE off. This would allow Sony's, IBM's, and whoever else's fabs to dynamically adjust production loads centered towards different delivery targets, rather than having one, two, or however many other lines in a specialized PS3 niche.
 
Back
Top