HDD Puzzle: best way to determine which one is on its last legs?

orangpelupa

Elite Bug Hunter
Legend
Situation
  • I have 8 HDDs spread into 4 blocks/groups.
  • 1 group (containing 2 HDDs), make clicking and spin-up-spin-down sound for like 4x before it stabilizes and operating normally.
so, what's the best way to determine which one is on its last legs?

What i have in mind
  1. is to power off the HDD group that's making problematic sounds
  2. unplug 1 HDD and power on. If its no longer clicking, then the one i unplugged is the problematic one.
  3. plug the problematic HDD again and do full backup.
but that means, at minimum, the problematic HDD will get 2x power cycle. If possible, I want to avoid power cycling it. As I'll be risking myself with the HDD being totally dead.

EDIT:

uh.. I'll just backup both HDD, to be extra safe.

Seagate has been contacted with SMART info of the HDD, and they say i can RMA it to them (like.. in less than 15 minutes i emailed them, they contacted me lol. so fast!). WD still hasn't responded.
 
Last edited:

BRiT

(>• •)>⌐■-■ (⌐■-■)
Moderator
Legend
Alpha
In general the procedure to figure this out on your own is:

1. Run SMART short tests on all drives and look at the SMART results afterwards.
2. Run SMART Extended tests on all drives and look at the SMART results afterwards (can take 8 to 12 hours or more depending on drive size)

If a drive fails the SMART short tests, it's dead or will be shortly.
Running SMART Extended tests can sometimes detect read errors or other indicators of soon to fail issues.

There are also certain SMART fields that are indicators of early failures. (Will post more details later, but pretty sure I've posted about them before)
 

orangpelupa

Elite Bug Hunter
Legend
In general the procedure to figure this out on your own is:

1. Run SMART short tests on all drives and look at the SMART results afterwards.
2. Run SMART Extended tests on all drives and look at the SMART results afterwards (can take 8 to 12 hours or more depending on drive size)

If a drive fails the SMART short tests, it's dead or will be shortly.
Running SMART Extended tests can sometimes detect read errors or other indicators of soon to fail issues.

There are also certain SMART fields that are indicators of early failures. (Will post more details later, but pretty sure I've posted about them before)

none of them fails the short self-test. but the statistics are concerning. I've sent the screenshot of the statistics to WD and Seagate. Currently, only Seagate has replied (and told me to RMA it to them). Despite Seagate seems to be the more healthy one (just 1 concerning stats, while WD has multiple)

EDIT
forgot to mention that the WD is still going extended self test, in 10%, after hours. Have not tried the Seagate extended self test, as i still in process of copying it to other hdd.
 

Attachments

  • seagate Screenshot 2022-07-20 102513.jpg
    seagate Screenshot 2022-07-20 102513.jpg
    46.7 KB · Views: 8
  • wdpurz.png
    wdpurz.png
    18.2 KB · Views: 8
Last edited:

orangpelupa

Elite Bug Hunter
Legend
WD has responded, and they also say to RMA the HDD to them and they will replace it.

i wonder why my HDDs got a higher failure rate than my SSDs. Simply unlucky with bad batches?
 
There are only 2 types of HDDs -- those that have already failed and those that have not yet failed. All HDDs will fail. It's in their nature with mechanical parts.
Same applies to ssd's as well though, in their case it is just a matter of write cycles...

FWIW last week I reached a milestone, I have had to replace more SSDs due to failure than HDDs now in my life. My Frankenstein's server (the end point of most outdated components) started developing IO errors on boot partition which was a 64GB Kingston SSDNow. None of the HDDs in it have failed so far in the last 15 years, even the infamous 60GB IBM DeathStar from 2001 (which is in fact an RMA replacement I got under warranty period) is still ticking (in the good sense of the word).

BTW yesterday I finally decided to pile up the most obsolete stuff to free up physical storage space on shelves. Funny how a 545.5MB ST3660 used to seem spacious at the time...
 

BRiT

(>• •)>⌐■-■ (⌐■-■)
Moderator
Legend
Alpha
I want to print that out, and put in on a frame.

Its very profound

You left out the last 2 most important words -- "yet failed". It's vastly different meaning when you chop it short. With those final two words it implies that all hard drives fail, which is the original intent. ie: Hard drives that failed and hard drives that will fail.
 

orangpelupa

Elite Bug Hunter
Legend
Wheeee! just my luck as usual ROFL.
I ordered 2 new WD purple HDDs: 1 working fine, and 1 dead on arrival.

Now ordered 1 more at ~50% more expensive price (as now is no longer "discount days"). while hoping the return for the dead HDD will work fine.

Same applies to ssd's as well though, in their case it is just a matter of write cycles...

FWIW last week I reached a milestone, I have had to replace more SSDs due to failure than HDDs now in my life. My Frankenstein's server (the end point of most outdated components) started developing IO errors on boot partition which was a 64GB Kingston SSDNow. None of the HDDs in it have failed so far in the last 15 years, even the infamous 60GB IBM DeathStar from 2001 (which is in fact an RMA replacement I got under warranty period) is still ticking (in the good sense of the word).

BTW yesterday I finally decided to pile up the most obsolete stuff to free up physical storage space on shelves. Funny how a 545.5MB ST3660 used to seem spacious at the time...

whoa whoa that 60GB HDD is probably immortal or something hahaha. truly an astonishing statistical anomaly.

Hmm. with HDD, failures are usually easy to observe: weird sounds, very slow access time, etc. But with SSD, are there any "intro" before it develops IO errors and completely went kaput? On my sandisk microsd card, it simply went into read-only mode. but its performance didn't (noticeably) degrade.

btw i remember i used to have 100 or 200MB HDD, and it was huge and make things very fast (before that i was using diskette).

You left out the last 2 most important words -- "yet failed". It's vastly different meaning when you chop it short. With those final two words it implies that all hard drives fail, which is the original intent. ie: Hard drives that failed and hard drives that will fail.

interesting that they meant different things in english.

in indonesian they are the same, the difference is that the one with "yet failed" results in a more "official-sounding" Indonesian. Like when you use it in official business.

this makes me remember the headache of translators' job. Or funny amusing of "TL notes" in unofficial manga/anime english translations haha
 

BRiT

(>• •)>⌐■-■ (⌐■-■)
Moderator
Legend
Alpha
in indonesian they are the same, the difference is that the one with "yet failed" results in a more "official-sounding" Indonesian. Like when you use it in official business.

this makes me remember the headache of translators' job. Or funny amusing of "TL notes" in unofficial manga/anime english translations haha

Maybe a better English phrase that translates easier would be:

"There are two types of hard drives, those that failed, and those that will fail."

The Translator Notes are something that official anime could use, as sometimes it provides more context around the scene or intent.
 
none of them fails the short self-test. but the statistics are concerning. I've sent the screenshot of the statistics to WD and Seagate. Currently, only Seagate has replied (and told me to RMA it to them). Despite Seagate seems to be the more healthy one (just 1 concerning stats, while WD has multiple)

I didn't look at your screen shots until now, the Seagate operational hour counts are obviously nonsense - the tool is not reading the data right.

For comparison, the old 60GB and 150GB disks in the Frankenstein's server have seen a little under 10 and 9 years of active time, respectively.
 

orangpelupa

Elite Bug Hunter
Legend
I didn't look at your screen shots until now, the Seagate operational hour counts are obviously nonsense - the tool is not reading the data right.

For comparison, the old 60GB and 150GB disks in the Frankenstein's server have seen a little under 10 and 9 years of active time, respectively.

Yeah the smart data is ridiculously high. Super high hours and high priority unload.

Not sure whether the hdd itself is fine or not. It does finally able to finish extended smart test.

Its the WD that fails extended smart test
 

arandomguy

Regular
WD mixes SMR and CMR as well. Regardless of the manufacture or model line I'd check to make sure what you are getting. Funny enough even though SMR functionally and is marketed as providing higher capacity we know it's actually for company margins and hence why basically all the lower capacity drives are the ones using SMR.

Something I found recently with WD is that their 6tb and higher drives apparently all now employ an "aggressive" preventative/preemptive wear leveling function that causes a regular thumping noise. Now I'm afraid to buy one due to noise.

I'm curious are you buying individually shipped drives? If so how are they packaged and delivered? I try to avoid buying shipped drives, and if they do come anything other than specifically (actual HDD specific packaging, not just generic padding/bubble wrap or worse) packed they go back. This is overly paranoid but even buying drives from a store I bring bubble wrap to put them in and lay them flat to bring home.

Also how they installed? I only have ever had drives installed flat and on spaced trays with damping. Not sure if that helps the overall failure rate as well due to the vibration factor.

I also pretest every drive. At least 1 complete read->write->read full pass (so far no failures) spaced out over 2-3 days.

So far I've only had my oldest drives suffering real failures, we're talking 10+ years old by now. I actually have a Hitachi drive that caught on fire (power connector is partially melted off, beware molex->sata adapters) with dead sectors (hasn't developed new ones in 2? 3 years?) still in use. The maximum recorded temperature reading on that is an impressive 55C (typical 30C).
 
Last edited:

orangpelupa

Elite Bug Hunter
Legend
i bought them individually from the official stores.

2 WDs were basically packed unsafely in a cardboard boxes. 1 DOA, 1 works fine. I buy 1 more WD, this time packed with lots and lots of bubble wraps. DOA.

When i RMA my Seagate to Seagate, i packed it with wooden box on the outer layer, and bubble wraps inside of it, and double anti static bag on the HDD itself. Dunno how Seagate will ship my replacement HDD.

I installed them vertically.

I also have 3 HDDs horizontally installed. 1 is toshiba/hitachi that got bad sectors and lots of head failure count but still work (slowly) and out of warranty :(
 

orangpelupa

Elite Bug Hunter
Legend
replacement HDD from seagate has arrived with zero issue. the replacement is a recertified HDD from seagate with the same model. now copying data to it and will be doing smart extended test on it.

EDIT:
it is noisier tho. like.. i can hear it when it start and when it stops/idles.

EDIT2:
its slow. it only goes around 100MB/s max.

i wonder if they replaced my CMR HDD with SMR HDD despite the model number stays the same.
 
Last edited:

BRiT

(>• •)>⌐■-■ (⌐■-■)
Moderator
Legend
Alpha
If you're gauging speed from data copy to it from your troubled drives, the limiting factor is likely the troubled drives.

As for SMR drives, you will see normal performance until you hit the write threshold at which point you will know it as it will tank far below the speeds you're seeing now, like 10 MB/s area.
 
Top