Thinking like a sysadmin
August 24, 2010
I’ll admit right away: Sitting under our staircase, hunched over my laptop, is not really my idea of a great weekend.
I spent a significant part of last weekend in deep frustration. My home server had crashed, and it was absolutely my own fault. When this server is down, it means no internet connection in the house. Plus, a mailing list, my wiki, redirection to my blog and a few other services are down. So obviously this needed to be fixed, and fast.
The trouble began when I tried to create a RAID array last week. After some hassle with a cheap SATA controller “with Raid”, I decided that I would just go for a software RAID. Now, in an attempt to be responsible and actually prepare this operation, I googled a bit and found various posts about how to go about doing this in Ubuntu, caveats to avoid and so on. Among other interesting facts, I discovered that the kernel I was running was known to cause some trouble with RAIDs.
So time to ugrade to the newest release, a leap of 3 or 4 Ubuntu releases. Unfortunately, my patience was running low at this point, so I ignored the HUGE warning and started the upgrade over my SSH connection. Because I couldn’t be bothered to pull out the server from under the staircase and attach keyboard and monitor to it.
As you will have guessed by now, this was an enormous mistake. One of the stupidest things I have yet done in my career as an amateur sysadmin. Because of course, somewhere in the course of the upgrade, something went wrong and broke the network connection. Along with DNS, DHCP, internet connectivity and everything else we rely on this machine for.
After a reboot, I was still unable to contact the machine. So I had to pull it out and attach an actual monitor to it. To my horror, it turned out that the boot process died instantly, because it was unable to mount the root filesystem. WTF?
Let me just cut to the chase here, because too many hours were spent trying out solutions that just didn’t work.
The solution was slightly interesting, though. Not having an internet connection (except on my HTC Hero) made it pretty hard to google for possible solutions, look up GRUB docs etc., so I ended up taking a different approach. I had an Ubuntu 10.4 installer CD lying around, so I did a fresh install using some free space on the hard drive. Then, I compared the GRUB configurations of the broken and the fresh installations. It turned out that the upgrade had been interrupted before it had created an initrd for the new kernel! So I was a bit surprised that it had updated the GRUB config to a half-baked state – the entry in menu.lst contained a kernel line but no initrd.
So stealing a fresh kernel+initrd from the new installation and Voila!, my system was back to normal (except for lots of unconfigured packages etc.).
Lesson learned: As a software developer, I’m enthusiastic, optimistic and sometimes downright reckless. As a sysadmin, pretty much the same. A good sysadmin needs to be conservative and cautious, and I understand why they sometimes get annoyed with us developers.
Keep this in mind the next time your local sysadmin doesn’t “just update that shit” right away.