3 drive Raid Failed 1 drive, forced the raid back online and 2nd drive failed.

knc

Active Member
Reaction score
43
Location
Kingston, Ny
All within the space of about a day. I have NEVER seen drives fail that close to one another, Client had just recently moved to a web based practice management application so the Server was no longer necessary only used for reference every now and again. Dell T410 3 drive raid 5.
 
Was the server powered down at some point? I've seen multiple drives fail within a real short period of time when a server that had been running for a long time, gets shut down. And has time to cool down. And then gets powered up...heats up..that whole process can cause parts that were getting close to going...decide to jump off the cliff.

Has happened to me when clients move to all new offices. Not related to what level of RAID it was at all...a physical drive failure is the drive itself, irrelevant of RAID level.
I remember my biggest heart attack...many years ago when I took on a new healthcare client (server 2000 or may have still been NT4 ), they had an existing big Dell PowerEdge server for their main LOB app, we moved their office to a new building..plans to replace this server were in the works for end of the year, but not yet. Went to power up this Progress database server..and sure enough, 2x drives tanked on the second volume which was R5. This was many years ago..back when R5 was popular, and good D/R-Biz continuity products like Datto were not around (this client was using SymantSuck Bloatup Exec back then before I switched them over to...I think Paragon...before Datto). Anyways..yeah, working with Dell support into the dark hours of the night finally managed to get the drives forced back online long enough to keep her running til new drives could be cycled in.
 
What do you guys use as an "early warning system" to preclude this type of thing?

RAID 6 + two backups. There really is no early warning you can trust. A second drive, even a third drive, can easily fail during any rebuild operations. Just the other day, for the first time ever, I had a RAID 5 come in for recovery where 4 out of 8 drives had failed. Fortunately, none had catastrophically failed, and it was a 99.999% recovery.

@knc Let me know if your client needs data recovered from the RAID. I can probably give you a decent price on it.
 
RAID 6 + two backups. There really is no early warning you can trust. A second drive, even a third drive can easily fail during any rebuild operations. Just the other day, for the first time ever, I had a RAID 5 come in for recovery with 4 failed drives out of 8. Fortunately, none had catastrophically failed, and it was a 99.999% recovery.

Let me know if your client needs data recovered from the RAID. I can probably give you a decent price on it.

Thank you, we have good backups, but since it isn't important any longer they don't want to recover the data.
 
Was the server powered down at some point? I've seen multiple drives fail within a real short period of time when a server that had been running for a long time, gets shut down. And has time to cool down. And then gets powered up...heats up..that whole process can cause parts that were getting close to going...decide to jump off the cliff.

I don't think this phenomenon has anything to do with heat and cooling really. In fact, it's been pretty well established that heat/cold has very little effect on the life of drives. It more relates to bad sectors developing in the drive's service area. When a drive is powered on it goes through a bootup cycle where it reads the firmware code from the platters. If the drive is left powered on for years on end, it may develop bad sectors in that firmware code but will continue to operate so long as it's never powered off because the code is already in the drive's RAM. As soon as it's powered off, it fails to initialize the next time.
 
All striping RAID carries with it this risk, which is why you have backups. More to the point, if you buy a server and get all the drives at the same time the drives are often all from the same manufacturing lot, they are all going to have similar lifespans and issues as a result. Cascade failure of a drive array is the normal result, not the exception.

For this reason I only deal with RAID 10 these days, and after 2-3 years pass I replace half of all the mirrors with new media. Then I don't have to worry about it. RAID 6 is nice, but honestly doesn't solve the problem.

Finally, the drives may actually not be bad. Striping RAID has many issues that can come up where the stripes become misaligned, if you zero the drives and build a new array I'll bet you'll find they spin right up. There's a reason large storage systems are moving away from hardware level RAID, it's just better to have the OS do it via the filesystem. You then span multiple platforms, different disk sizes, no controller faults to leave you hanging, all sorts of upsides. RAID is just on the way out, that is unless it's a mirror those are still useful.
 
For this reason I only deal with RAID 10 these days, and after 2-3 years pass I replace half of all the mirrors with new media. Then I don't have to worry about it. RAID 6 is nice, but honestly doesn't solve the problem.

Actually, RAID 6 is safer than RAID 10 because you can lose ANY 2 drives and still rebuild. With RAID 10 if you lose two from the same mirror set, you're back in the same predicament as RAID 5 with two failed drives.

The only disadvantage to RAID 6 is performance, but a good RAID card can take care of that.
 
You're correct in regards to the fault risk, but not due to the general risk. Striping based RAID topology have massive issues on rebuild. RAID 6 doesn't solve these issues, it just papers over them with a second parity stripe. It is however a bucket more efficient in terms of storage, but you simply cannot beat the reliability of rebuilds when working with mirrors.

Also, stripes eat SSDs, which is where the future of storage lives anyway.
My use cases are almost all hypervisors, which all but dictates use of RAID 10 anyway, unless you're forking over huge money for a proper SAN.
 
In fact, it's been pretty well established that heat/cold has very little effect on the life of drives. .

I've actually seen contrary to that in my ~25 years of IT...heat/cold does have an effect. Plenty of times I've seen servers in areas with poor cooling have higher than average HDD failure rates. I've seen servers in server rooms where the HVAC went out for a weekend or more...and the little room got hot..and HDDs blew up. And for my core clients that I get an open checkbook with, I demand server rooms with good cooling..and I keep it 65* in there..and I experience far lower than typical HDD failure rates.

A lot of server guys will tell you..take a server that has been running for something like 5 years straight..power it down for a few hours...move it...power it back up, and that's the highest risk period of that servers lifetime that you'll have something tank on ya. Along with other things related to "thermal creep"..something we learned about way way back in A+ courses.

Sky Knight works in hell..in the desert..he can testify for hours..days...weeks...non stop...about dying server HDDs at crazy high rates due to his death valley-like region.
 
I suspect 99% of what you've observed is what I'd classify as "indirect causation". In general, heat can be very bad for electronics. Hard drives themselves don't tend to be highly prone to heat related issues unless it's excessively hot. Other components such as motherboards, CPUs, and power supplies don't fare as well. So if a power supply craps out it may be prone to take drives out along with it, that's true.

I'm just saying that the failure shortly after power on isn't about heat. It's about the drives going through an initialization cycle which they haven't done in a long time. This could be done in the antarctic or death valley, and you'd still see it happen. There's a reason S.M.A.R.T. keeps track of power on cycles. That process is highly straining on drives, and each drive will only succeed in doing it so many times.

When we work on failed drives, especially ones where we replace read/write heads, we do everything possible to avoid additional power up cycles. It's very common to only get a drive to boot up once or twice after it has been worked on internally.
 
Sky Knight works in hell..in the desert..he can testify for hours..days...weeks...non stop...about dying server HDDs at crazy high rates due to his death valley-like region.

Yes, heat kills electronics, it's the enemy of all things digital. But honestly I'm not sure if the heat is what fries things out here, the dirty power, or a combination of both. But I can tell you that given any drive if it's in a room that's never allowed to get over 82F I can frequently get a good solid decade without issues. But as soon as that room clears 90... forget it, drives are faulting within weeks on the long end.

But if you're paying attention here you'll note my definition of hot is WELL ABOVE the normal data center standard.
 
FWIW, I had a HDD setup in a dock trying to recover the data from it. Most of data was already backed up but the client asked if I could recover the rest anyway, but the drive kept failing and dropping out.
So for giggles I removed the paper sticker, grabbed a heatsink and fan from an AMD CPU and stuck it to the HDD with some thermal paste and zip ties. I used a 12 volt adapter to power the fan.
it never failed again and I was able to recover the data without any further issues.
 
Y But I can tell you that given any drive if it's in a room that's never allowed to get over 82F I can frequently get a good solid decade without issues. But as soon as that room clears 90... forget it, drives are faulting within weeks on the long end..

Yup, I agree. Many..many times...seen it. Server room, AC goes out, a cabinet or two of servers puking tons of heat in a tiny server room..gets hot in their quickly! By the time it's discovered...often too late, lights in front of hard drives are red, some drives tanked, RAID dong its job waiting for drives to be replaced. Or worse..too many drives tanked at once. I'll eliminate dirty power as a contribution...usually on banks on APC 2200 or 3000 units.

I like 65* F for my server rooms. Yes...when I'm scheduled to work at those clients for a day, I keep a sweatshirt hoodie in there to keep myself warm. And I get good long lives out of those drives in the servers/SANs.
 
Back
Top