I was recently computing happily along, not suspecting anything, when suddenly things began to go terribly wrong. Web pages stopped loading. My email connections timed out. I could not remotely access my clients’ networks. The situation was serious. Then my wife told me that she could not get to any web sites. The situation had become critical!
My first thought was that my Internet service had gone away, but when I tried to log in to my cable modem my browser told me it could not resolve the address. When I tried to access the cable modem using it’s IP address, it worked fine, and I found that the modem was merrily humming away, completely unaware of the internal problems.
My next move was to log in to the server and make sure my internal DNS was still working. I was not able to log in by using the server’s name, so I tried using it’s IP address. No joy. Before heading into the basement, I made one last stab at it; I used ping to see if the server was alive. It wasn’t.
So, down two flights of steps and into the basement. I connected an LCD monitor to the server to see if the console screen would provide any clues and found that something had caused the server to try to reboot only to find that the boot drive was unreadable. Not good. So I booted the server with an Ubuntu “live CD” to investigate.
Boot Drive Failure
As it turns out, the hard drive that contained the system files required to boot the machine was dead – to the system, it didn’t even appear to exist. Oh well, drives are cheap and boot drives are easy to build; all I needed was a Debian Linux install CD and a few minutes. And, of course, I needed copies of the very customized configuration files that make my server my server (no worries there – I had recent backups).
Data Drive Failure
After installing the latest version of Debian Linux on a new boot drive, recovering the customized configuration files, and installing Oracle’s VirtualBox virtual machine software, I finally was ready to restart the RAID-1 array that contains the actual virtual machines that do the real work that the server is meant to do. When I attempted the restart, only one of the two data drives could be found by the system. The other data drive, like the boot drive, was dead.
I very quickly replaced the failed data drive and the RAID software automatically began to synchronize the data – basically copy all of the data from the remaining drive to the new drive – so that I would once again be protected. In a RAID-1 array the hard drives are “mirrored” such that both drives contain a copy of all of the data. That way, if one of the drives fails, the data can be recovered and, even better, the system is still operational while the recovery is in progress. Note that this only works if only one of the drives fail; if they both fail at the same time, the data is gone for good unless there is some other form of backup in use.
Because they are so large – many gigabytes each – I don’t back up my virtual machine files on a regular basis. I do back up the data that is on them, so I could recover if necessary, but that would be a major effort. While I was lucky this time that one of the data drives survived, I wanted to find a way to have better protection. I’m now using four 500 Gb drives in a RAID-6 array that yields 1 Tb of usable space, and can survive the loss of two of the drives simultaneously. Losing two out of four drives at the same time is, I think (and hope), very unlikely except in the event of a local disaster such as the house burning down or being totally flooded, etc., in which case I’d have a lot more important stuff to worry about.
Heat Stroke
During the rebuild of the server and the change from RAID-1 to RAID-6 I noticed that one of the four fans on my “SATA RAID hot-swap enclosure” was not working and one of the others was making a very unpleasant sound. The particular enclosure that I was using – emphasis on was using – occupied three full-height 5.25″ drive bays but held four 3.5″ SATA hard drives, which meant that the drives were very close together. I typically use either Western Digital or Seagate hard drives, and have had pretty good luck with both brands even though almost all of the drives I’ve used recently get really hot after they’ve been running for a while (such as in a server that’s powered on 24/7).
After getting the server rebuilt and back in operation I turned my attention to the data drive that had not failed. I plugged it into my testing computer, booted up the latest version of Debian Linux, and used a utility program called palimpsest to look at the SMART (Self Monitoring And Reporting Tool) data on the drive. Holy heat stroke, Batman! The surviving drive registered an “Air Flow Temperature Failed in the Past.” I don’t know what the air temperature got up to, but the threshold value is 45oC (that’s oC not oF). I can only imagine what temperature the other drive experienced that eventually caused it to fail.
So the RAID-1 array did it’s job – it saved me the time, effort, expense, and lost productivity it would have cost me to rebuild the virtual machines that serve the needs of my consulting business. Without the surviving drive from the original RAID-1 array it would have taken several days to rebuild the virtual machines, recover their data, and test them to make sure everything was correct. Not having to do that meant that I had those days to serve clients. In contrast, it took only a few hours to get the server back online, and most of that was spent copying data from the RAID-1 array to the new RAID-6 array. This experience forced me to really think about my approach to data protection and server uptime, and what I’ve learned translates directly into the approach I use with my clients’ servers.