Saturday, November 6, 2010

Catastrophic data loss - Learn from our unfortunate incident

I'm on the IT committee at my children's school.  I got an email on Friday that the server had a problem and they lost all the data, even the backups.  How could this happen? They had (almost) all the pieces in place to avoid data loss.

Here is everything that was done to avoid this problem.

  • Used Enterprise grade hardware - Dell 2900 with redundant power supplies
  • Used Raid 5 - 3 disks
  • Used a battery backed raid controller - Perc 5i
  • Used a Battery backup
  • Used an external drive for backups

Well let me explain, they have been having power issues for a few weeks. We found out on Tuesday that the UPS was no longer providing battery power. We started to look for a new one, but we had not purchased one yet. So let me list how each of these safeguards failed.

  • Battery Backup - Old, did not provide adequate battery power or brown-out protection
  • Battery Backed Raid controller - Battery had gone bad and needed replaced, we didn't have the proper alerting in place to warn on this. When the server went down hard the writes stopped (no battery on controller and no battery on drives) and this caused the MFT to get fubarred. 
  • External drive for backups - power was plugged into the same UPS, it fried the external hard drive
  • Raid 5 - worked, the raid set was still in place and a drive did not fail.
  • Server - redundant power supplies plugged into the same UPS.
Several utilities were used to see if we could pull data off the drives. No luck. 

On Friday we started a re-install of SME on the server, however a decision was made to move back to Windows and now seemed like the perfect opportunity to accomplish this.

Here are the steps we are taking to avoid total data loss in the future.
  • Order new Raid controller battery
  • New UPS - a used one has already replaced the defective one
  • Local Backup drive plugged into a different UPS
  • Off-site backup, some thing like Carbonite Pro
In my 12 years working in the Enterprise I have never had so many things go wrong at once or permanently lost data. (knock, knock)  Hopefully the new plan we put in place will avoid this from happening again.