DalVZ4 Emergency Raid Maintenance
At about 3am EST 4/8/2011, DalVZ4 suffered a raid degradation that caused the system to become unstable and crash. We are currently investigating the cause of the RAID issues with Softlayer, working very closely with their support staff.
So far, it’s looking like an issue with the RAID controller itself and/or the cabling as the RAID card has stopped detecting all hard drives in the server chassis.
We’ll be updating this blog post as more information becomes available.
As a general reminder, please do not submit support tickets when issues are posted here. Answering unnecessary contact distracts technical staff from remedying issues.
Update @ 6:47am 4/8/2011 EST: We’re working diligently to get another server online and ready to take the load of DalVZ4. Any client that had backups on our Central Backup system will be able to be restored sooner than later. The final damage to the RAID array has yet to be assessed as we’re only attempting to access it when necessary. Once the RAID array is mounted, we’ll attempt to RSYNC the data to a storage server as an emergency backup. From that backup, we’ll restore accounts one-by-one. Our goal is to minimize downtime and get all clients back online as soon as possible.
Unfortunately, we must admit that there is a high probability of severe data corruption and/or loss. If you have a backup, we do recommend locating it and preparing for a possible restoration as the RAID array appears to be heavily damaged as two of four drives are known to be dead and the other two are intermittent in connectivity to the RAID card. If you have backups on our Central backup servers, those will be be perfectly fine as they’re kept separately from our VPS nodes.
Update @ 7:09 4/8/2011 EST: The Raid Array in DalVZ7 is fully corrupt as best we can tell. The Superblock on /dev/sda2 (the partition containing VPS data) has been corrupted, causing a total loss of inode counts. We’re running an FSCK on the hard disks to see if we can recover anything at all by attempting to rebuild the superblock, but there’s not all that much we can do at this point to restore the RAID array.
Once the new server is online, we’ll start re-creating VPS servers within it. If the old servers hard disks are recoverable, we’ll offer all clients tarballs of the data that we can retrieve. Alternately, we can rsync the data into their VPS. However, this will only work if (and only if) the fsck works without further damaging the hard disk array as it stands now.
Brief Synopsis of what happened: Our RAID 10 arrays work by what’s known as “striping a mirror” which means that there are four hard drives with two hard drive on each side of the “stripe” replicating each other (Raid 1) with the two raid one mirrors being striped together, allowing for a Raid 0 array with Raid 1 on either side. This provides performance and redundancy increases all across the board. Raid 10 allows for up to two drives to fail at a time. The caveat is that the two drives allowed to fail must be on opposite sides of the stripe. If the two failed drives happen to be on the same side of the stripe, all data is lost in the array. What happened this morning is that the two drives that failed were on the same side of the stripe.
We’re working with Softlayer to find out why/how this happened. Their automated systems detected no RAID errors at any point. In addition, the RAID array was rebuilt with one hard drive replaced last week as Drive 0 had reported failure. After replacing the drives, all drives/array points reported as error-free, and no anomalies were detected until a loss of ping was discovered this morning shortly before this thread was created. In discussing the circumstances with Softlayer, we believe that there may be an underlying issue in either A) The Raid cards hardware that is located on this server or B) The firmware version on the raid card. We had a similar even on DalVZ3 last month, although the data was recoverable and the drive mountable on DalVZ3 unlike in this scenario.
Once the FSCK is complete, we will update this post once again with the outcome. If we’re able to restore data, all clients will be notified ASAP as the array may not be stable enough to last for very long before needing another FSCK, which might render the RAID more damaged. |