Below is a detailed and a bit summary of the reasons for and our response to the outage on March 18th and 19th and the actions we plan to take to ensure that such an outage doesn’t occur again. In total there was about 8 hours of downtime for our clients websites and 2 hours of downtime for the CRM portion of EasyBroker. Our last unplanned outage was on February 14, 2012 over a year ago. We’ve maintained over a 99% uptime for the year however we understand that EasyBroker is fundamental to your business and plan to take steps to ensure that we can more quickly respond should a similar problem arise.
The problem and the various attempts to solve it
On March 18th our main production server started behaving strangely and rebooting and making a few changes which took about an hour we were able to get the server back online but without discovering the root cause. During the day we investigated with our host Rackspace but we were still unable to find the root of the problem.
On the morning of the 19th the problem returned however we weren’t able to get the server to come back online. We tried to swap in a newly built virtual server but unfortunately as we were migrating the server Rackspace had an outage with their Cloud Files server which caused our migration to freeze. While trying to resolve the migration issue with Rackspace we launched a new server on a different ip and were able to quickly get the CRM portion of EasyBroker back online. Unfortunately all of the domain names of our clients’ websites were pointing to the old server ip so it would probably take from four to eight hours to migrate to the new ip. Rackspace told us that it would take about 30 minutes to get the Cloud Files service back online so we assumed it would be best to wait.
Unfortunately after the Cloud Files service came back online they couldn’t unfreeze the box. After about 5 hours of working with Rackspace they realized there was a hardware issue, most likely the hard drive, which was also the root cause of the problems on the server up until now. Rackspace changed the hardware and then we were able to get the server fully migrated.
Plans to alleviate future outages
The biggest problem we had with this outage is that we couldn’t move to a new server because the ip was locked to the current one. This would not have been a factor if we had used CNames instead of A records for each of our clients’ domains. We could have easily transitioned everything to a new server with a simple DNS change. We’ve already updated all the domain names of all our clients’ domains that we manage to use CNames so that we’re prepared for this in the future.
Another way we could have resolved the problem is by using a load balancer so that we can easily swap in new servers without having to take down EasyBroker. A load balancer enables us to use a single ip for multiple servers. We are currently in the process of adding a load balancer so that in the future we can resolve this problem with no or at least a very limited amount of downtime.
Conclusion
We are very sorry for the downtime and realize that it affects each of our clients’ businesses. We have over a 99% uptime for the past few years however we will still take additional measures to reduce the possibility of such outages in the future.