Building a Bulletproof Cloud: How Mule iON Stays Up Even When EC2 is Down

October 18 2011

2 comments 0

High availability. Fault-tolerance. Redundancy. Region failover. These are all major features that users look for when determining which cloud platform to use. They are not, however, easy problems to solve when building a cloud platform. Previously, we discussed the technology surrounding Mule iON’s architecture. Now, we will take a deeper dive into these components and how we carefully built Mule iON to resist outages or failures on Amazon EC2.

Redundancy Within AWS Availability Zones

Mule iON elasticsearch set-up Having availability within different AWS availability zones is a critical piece of the iON puzzle. As an example, we will take our elasticsearch (our logging framework) strategy. We currently have 4 different elasticsearch nodes running in a clustered setup, where each node is in a different availability zone in the Eastern region. All 4 nodes write log data to the S3 gateway, which in turn writes the data to an S3 instance, as depicted by the picture to the left.

When an application is deployed onto Mule iON, it then writes data to the logging aggregation tier, which sits in front of each instance. One of the nodes will index this data, but the tier ensures that the data is replicated across each node. Each node has 5 shards (or a portion of the data) and also contains 1 replica of another node. If an elasticsearch node were to be terminated (either manually or triggered by an outage), then the replica of the node that is down would promoted to become the master node. With this setup, Mule iON has the ability to lose any node in the elasticsearch cluster without losing any log data or having any downtime for the logging framework.

Building for Region Failover

What would happen if the entire East region of EC2 had an outage? We needed to be sure that we could bring up the iON platform set-up in the West region with ease. For elasticsearch (as described above), we are planning on periodically syncing the data to an S3 instance in the West region. This allows all of the infrastructure to be replicated, but what about customer application data?

In order to get around this problem, we are using a standard Mongo replicasets set-up (shown to the right). We have 2 Mongo servers in the East region (1 of them being the primary database) and a 3rd server as a hot standby in the West region. Each Mongo instance has its own volume where it stores the data locally. The data replication between the 3 instances is taken care of by Mongo automatically as part of the standard setup. Moreover, the instance in the West cannot be the primary nor can it be read from as long as the East instances are up. If the entire East region were to fail, then we could manually set the Mongo West server to be the primary one and bring up 2 new instances as back-ups, similar to the current setup.

Back-Up Strategy

In order to build a cloud platform that has redundancy between different AWS regions, we were forced to rethink our back-up strategy. Initially, we were storing all configuration information on AMIs (Amazon Machine Instances) and all data on EBS volumes. However, we found some major disadvantages to using EBS.

  • It is much harder to scale with EBS volumes, since they are attached to a particular instance.
  • EBS volumes can only be accessed from an EC2 instance within the same availability zone.
  • There is a higher risk of I/O failures when writing data to EBS than S3.

As a result of all this, we have moved away from using EBS and now use S3 to move data (since S3 can be accessed from any availability zone within a region) and periodically sync it to other regions.

Upgrade Strategy

We have also found that using S3 with Chef makes upgrading Mule iON much easier. Instead of storing all of our configurations on a role (or AMI), we instead only have a few basic packages installed on the role, along with the Chef package. We then store the rest of our components in a S3 instance as key-value pairs as shown in the example below.

Key Value
ion-console console.war
chef-configuration chef-recipe.config
elasticsearch elasticsearch.jar

As seen in the diagram to the right, if we need to upgrade elasticsearch, then we can place the new elasticsearch distribution in the S3 instance and then trigger a script with the role name to update. This script gets any updated Chef recipes, pulls the necessary configurations, then applies it to the given role. If we ever need to make configuration changes, we change the Chef recipe value above. If the East region were to fail, we also have the ability to re-run the same Chef recipes with the EC2 instances in the West region and bring up all of the instances again. Using Scalr, we are also able to tie these roles and recipes to certain events, such as instance start-up or shutdown. This allows us to do seamless updates and zero-downtime upgrades since we do not need to restart any roles.

Final Thoughts

Amazon provides us all the functionality and tools to build a fault-tolerant, highly-available cloud platform, but we still need to figure out how to assemble it. While building out iON there were a couple of EC2 outages, these provided an invaluable lesson for architecting in the cloud, “assume everything will fail”. When architecting Mule iON, we had to make sure to have back-ups, redundancy, and emergency procedures in place to deal with EC2 outages.

Despite all of this, building a bulletproof cloud has more challenges; How does iON handle security? How does iON’s load balancer work? How do we resist outages on other dependent systems? Stay tuned for more in the coming weeks!

We'd love to hear your opinion on this post

2 Responses to “Building a Bulletproof Cloud: How Mule iON Stays Up Even When EC2 is Down”

  1. What is the delay in synchronization you have noticed when you run a replica set across us-west and us-east? is the node in us-west mostly upto speed?

    if you do failover from us-east to us-west, how does the service restoration work? how long does setting up a new replica set in the us-west take?
    I assume the steps you guys take are
    1.) promote us-west node to primary
    2.) add a node to that replica set (is this node revived from backup? and hence will not be upto speed with the other node? )

    Good article!

  2. […] How is log data protected? How is the surrounding infrastructure secured? We previous talked about how iON stays up and running even through EC2 outages. Today, we will talk about iON security to show how we protect customer information and the […]