It’s a common theme you read on many sysadmin forums – ‘RAID is not backup!’ I have always agreed with that statement, but it didn’t hit home until recently.
A little over a month ago, I was on site in Kentucky to switch some T1 lines around. When I got there, I noticed one of the drives on their server had failed. I requested a replacement drive from the corporate office. Since I was stuck on hold with the telco during the data line switchover, I ran a backup of the server. The next morning, the replacement drive had not arrived. I left instructions to just swap the drive out when it did show up, and started on my drive back up to Michigan, with the backup tape in my laptop case.
A few hours later, my phone rings. “Jim, I switched out the drive, and now everyone says all their files are missing.” I walk through a couple of checks, and come to the conclusion that this is pretty much the worst case scenario – one of the other RAID drives failed during the array rebuild, and took the entire array down. Worse yet, I have the only full backup tape, and I’m on the road almost halfway between Michigan and Kentucky. A long weekend was in store for me.
Fortunately, the server that went down was ‘only’ their file & print server, and not the Exchange server or only DC for that domain. Another plus was that the server was down over a (relatively slow) weekend, as opposed to the middle of the week. To work around some of the issues, DHCP services were moved to the primary router at the Kentucky site, and DNS was repointed to Michigan. Users could still access email and the terminal system. Corporate IT began building a new server in Michigan, so I could start restoring data as soon as I got back.
After 6 hours of restoring the tape, the replacement server was mostly back up and running, with users losing less than 12 hours of saved work, and no email. Printing was an issue on the new server, as it was loaded with newer drivers that caused problems for some of the older PCs.
Lesson learned: RAID is not backup. As drive capacities become larger, the likelyhood of having additional drives fail during the rebuild increases. To help work around this, build your RAID arrays with at least 1, preferably 2 hotspare drives for automatic failover, and configure your server to send email or text alerts when it detects hardware issues.