iDRAC to the rescue - sort of.

HCHTech

Well-Known Member
Reaction score
3,835
Location
Pittsburgh, PA - USA
I got an email from one of my iDRAC setups yesterday warning of a failed hard drive - boy that felt good. Just like you imagine when you setup a warning system, you want to get a notice of a problem before there is any downtime.

So I check to see if I've got an 8TB SAS drive in inventory to take with me - oops, guess I must have used the one I had at some point and not re-ordered a replacement - that was a mistake. Well, it's a single disk in a RAID10 array, I guess I'll just get the warranty replacement going with Dell, then go onsite.

Once on the phone with Dell, it turns out that they made a mistake when registering this service tag last year, so we can't proceed until they fix that - ugh, thanks, Dell. 3 hours later, I get a call back that they've fixed their problem and we can proceed with the warranty claim. 45 minutes of nonsense follows, I think it must have been the guy's first day on the job or something. Multiple holds while he checked with his superior. Finally get to the end and instead of ordering the drive, they send an email for me to fill out most of the same information they took over the phone - ugh, bureaucracy at its finest. The email requires the Dell part number, which of course isn't reported anywhere in iDRAC, so I give up and head out to the clients.

I arrive at the clients, get into iDRAC again and confirm which drive is the problem, remove it and take a picture of label so I have the Dell part number. Then I look through their cold spares to see if there is an 8TB drive there. OS SSD, check. Redundant power supply, check. No 8TB drive. Hmm, I wonder why that is - another mistake. Anyway, I remount the drive so it doesn't get lost while I wait for it's replacement to arrive. To my surprise, iDRAC now reports the drive is healthy and the array is rebuilding. I wait for a few minutes, but the rebuild is continuing without error. Weird.

Once back in the office, I decide to wait a bit to send back the warranty claim email - If I end up trying to claim a healthy drive, they'll probably charge me the full Dell extortion rate for it. The rebuild was at about 60% done when I finished up yesterday, and this morning I see the array is reporting as healthy again, and so is the problem drive.

So now I don't know how to proceed other than to just wait and see if there are additional errors reported on that drive. Plus, what even was the original problem that caused iDRAC to mark it failed? If it doesn't turn out to be failing, I just wasted a few non-billable hours chasing around, and I'm not quite as fond of the iDRAC warning system as I was a couple of days ago. I should probably run some diagnostics on that drive to sleep better, but that will either take a few hours on downtime to run the Dell diags during a maintenance window, or I'll have to purposefully degrade the array again to remove the drive and test it on the bench. Frustrating.
 
That, or HP's iLO are good, when they're set up and working. Nice to have "the system" automatically phone home to mothership, and submit the ticket..to have a replacement warranty part shipped overnight direct to the client. LIke a failed drive, to hot swap out, or failed chassis fan, whatever.

Ahh..but yes there are times when getting the server say, a support renewal. Just this morning Dell called for me, to ask about the expired warranty of a clients server. I had spent days..and days..and days..last fall...trying to get it renewed with Dell. It was like I was e-mailing with a parrot that was just repeating the same thing in email each day. This morning...I picked up the phone, and he asked about renewing the warranty and I screamed "I tired...for about 96 hours worth of my time last fall...you guys were the most PAINFUL experience I had gone through in months!" And I hung up.
 
Was the drive completely failed or in predictive failure? If you pulled the drive and reinstalled it, the controller may just see it as a new drive and automatically start rebuilding. I would imagine you would start seeing bad counters again and eventually it will report the same status. If its under warranty and logs show the drive failed or was in pred failure I would still work with support to get parts dispatched and swap it out. All that is covered under warranty. Just go on their site dell.com/support and start a chat. Tell them that you drive reported failure. They will ask for controller log, once they verify the controller reported drive failure they will dispatch parts. takes like 15 mins. They dont have you run tests on the drives.
 
Was the drive completely failed or in predictive failure?

No, not predictive failure. Marked as failed. The array was degraded, of course, and the drive's status was failed. I also saw in the details that it was power status was "spun up", so not a motor failure. The logs clearly show the failure, but they also now show that same drive as currently healthy. ¯\_(ツ)_/¯

Original screen shots:

1672769767749.png
1672769809423.png

Current screen shots:
1672770068626.png
1672770168932.png
 
@HCHTech - regardless if it shows healthy now, still chat with dell and have them review the controller log. They may dispatch parts regardless. If you're seeing a lot of 'Unexpected SCSI Sense' for the same disk in the iDRAC logs too its usually a sign a drive is going to fail soon.
 
I've had failed drives start "working" after being reinserted back into the array. Ok course all I did was say an extra prayer that the overnighted drive would arrive in time before it fails again. And that's why I've been doing a hotspare as well the last few years.
 
Back
Top