Server maintenance fun (not!)

HCHTech

Well-Known Member
Reaction score
4,492
Location
Pittsburgh, PA - USA
I have a client with 2 physical servers (with a smattering of server VMs split among the 2). I got an alert last week about a CMOS battery needing replaced in one of them, so I scheduled a maintenance window for today to take care of that, plus some other tasks on the to-do list (upgrading the BIOS to get the new TPM certificate, installing the latest version of iDRAC, etc.)

I decide to tackle the battery first. I export the settings from iDRAC, power down the VMs, power down the server, remove the power cords from the supplies, then take it apart and swap the battery. Naturally, Dell put the battery underneath one of the riser cards, so that's a little obnoxious, but whatever, I guess. I get it back together, reconnect it to power, slide it back into the little rolling rack they have and power it up. I swear, every time I have to do system-level stuff to a server, I hold my breath until it comes back up again. Been doing server stuff for 15 years or so now, and it never changes.

Anyway, fears realized - it didn't come up. The console is trying to do a PXE boot and iDRAC says "no physical drives detected" in the storage pane. I didn't even TOUCH the RAID card or drives! The whole rest of my day mucking around with this flashes before my eyes, my heart rate doubles and I start mentally making the list of what to do, hoping the backups are available and good, reviewing my phone call with the owner - all of it.

Luckily, it occurs to me after the initial panic attack subsides that I should just check to make sure I didn't accidentally unseat something, or put the riser cards back in in the wrong order, or something equally stupid. I couldn't find anything loose or unhooked, but took out the cards plugged into the riser, and took out the riser to examine it. I put the riser card back in, reinserted the 3 cards connected to the riser and then buttoned everything back up again. This time, it started normally. I think it took a good 30 minutes for my heart rate to go back down to normal - the rest of the work all went normally, thank goodness. 1:00 isn't too early for a beer or two is it?
 
Yeah... I wouldn't have pulled the power from the supplies. That passive energy into the board keeps the programming intact while I swap the battery. It's riskier, but it's better than watching the BIOS go poof and running the risk of scrambling the RAID configuration. Though in your case the battery was harder to access.
 
I swear, every time I have to do system-level stuff to a server, I hold my breath until it comes back up again. Been doing server stuff for 15 years or so now, and it never changes.
I hear you, yeah a bit of kit been running for years and one has to power it down, lots of very shallow breathing and praying to computer Gods that the drives haven't seized through being stopped and that the sodden thing comes back to life, and if it does I think its time to buy a lottery ticket
 
Have had that happen to servers that have been running 24x7 for years on end (with only monthly updates and reboots...but no power off)...you go to power them down (say the business physically relocates...so you have to cold power down the servers to transport them)...yup...fail to boot up (Dells too)...RAID config somehow whacky. HPs did a better job at seeing the RIS (Raid Instruction Set) files on each drive from an existing RAID volume to put the pieces back together again if you had to cold power or swap hardware like a new RAID card.
 
.fail to boot up (Dells too)...RAID config somehow whacky.

Yeah, I had the "Everything" configuration backup from iDRAC, but I've never actually had to use one of those, so I don't know if that would have worked or not. Also, as far as I know there is no way to backup the RAID configuration other than taking pictures of the various screens, but that's not really a solution. The last time I had a RAID card warning, I called Dell support to come an do the swap. Given the enshitification of all support these days, I'm not sure I would trust that method today...
 
Yeah... I wouldn't have pulled the power from the supplies. That passive energy into the board keeps the programming intact while I swap the battery. It's riskier, but it's better than watching the BIOS go poof and running the risk of scrambling the RAID configuration. Though in your case the battery was harder to access.

I've hot-swapped a few things in my time, but I couldn't even see the darned battery until I removed the riser card. That process just felt too dangerous to do with power applied. All's well that ends well, I guess, yikes.
 
"I swear, every time I have to do system-level stuff to a server, I hold my breath until it comes back up again."

I did maintenance on IBM P (PowerPC) series servers for several years around 2010 or so. Big iron like that takes forever, as in several minutes, to complete POST. Inevitably I have to quell an automatic reaction that something is wrong since boot up time is much longer than Wintel boxes.
 
I've hot-swapped a few things in my time, but I couldn't even see the darned battery until I removed the riser card. That process just felt too dangerous to do with power applied. All's well that ends well, I guess, yikes.
Yeah, you did the best you could and what you didn't wasn't incorrect. It's just... ugh... hardware!
 
Back
Top