This is not how I like to start my morning

Markverhyden

Well-Known Member
Reaction score
11,031
Location
Raleigh, NC
Noticed my email server (R730XD) was offline this AM when I got up. Check via web and still not working. So I go to the network room and got a bunch drives offline, 6 out of 12. Of course this is more than panicking. Doing some more investigation and all the offline drives are marked foreign. More digging and, while unusual, one Dell rep said that one drive can fail and in turn cause other drives to go offline due to bad writes. In this case, after reseating, all drives came back but still foreign. So I bite the bullet and reboot telling it to import all foreign drives. Luckily ESXi came back up but the email server was corrupted so I had to restore from a backup. Fortunately no other VM was running so they should be ok. Has anybody heard of such a thing? This happened 2-3 years ago on a different server.

Screen Shot 2022-01-04 at 10.13.59 AM.png
 
Only time I have ever seen multiple or all drives failing was due to faulty controller. Very odd that they all would fail at the same time w/o ever telling you they are at least in predictive failure first. Maybe a power issue or power surge corrupted the controller memory or something? Definitely strange. Install all your Dell firmware updates just to be sure.
 
I had one other similar failure, maybe 5 years ago, which had 5 drives in RAID 5. Not a customer of mine, just someone called in for a boots on the ground T&M thing. The catch is another tech had some in ahead of me and had messed around with the drives so I didn't see the original condition but it was similar. But all the drives were out of sync/foreign when I arrived. Since it was running M$ Server and the other tech had let it boot and start check disk is was a disaster.

We did have a power outage Monday night but I don't have the exact time. I know because I have a Linux box with FDE and it was at the LUKS login screen. The server was plugged in UPS's so it didn't shutdown. What's interesting is, looking at the front, every other drive had the blinking amber light. Most cards like that have two sets of cables so that makes me think it was a transient card issue. They're not cheap so I'm reluctant to buy one. Just need to be more diligent on the backups thing.
 
We did have a power outage Monday night but I don't have the exact time. I know because I have a Linux box with FDE and it was at the LUKS login screen. The server was plugged in UPS's so it didn't shutdown. What's interesting is, looking at the front, every other drive had the blinking amber light. Most cards like that have two sets of cables so that makes me think it was a transient card issue. They're not cheap so I'm reluctant to buy one. Just need to be more diligent on the backups thing.
Just after you make sure your backup is complete and that restore tests out all right, come up to that UPS, pull the power from it, and see what happens. I suspect the transient on the card port may have something to do with the transient on the UPS, the latter being not uninterruptable enough.
 
When is the last time you changed the battery on the raid card?
Yup - this is one of those things like CMOS batteries that should be on the maintenance list for sure. Some cards have an optional capacitor-based backup module, which seems like an obvious thing that should exist for all of them but doesn't...
 
I have seen Dell RAID controllers "forget" things more often than other brands. I have them label previous RAID members as "foreign". Yeah it's nerve racking forcing an import.
 
Back
Top