High availability. Fault-tolerance. Redundancy. Region failover. These are all major features that users look for when determining which cloud platform to use. They are not, however, easy problems to solve when building a cloud platform. Previously, we discussed the technology surrounding Mule iON’s architecture. Now, we will take a deeper dive into these components and how we carefully built Mule iON to resist outages or failures on Amazon EC2.
Redundancy Within AWS Availability Zones
Having availability within different AWS availability zones is a critical piece of the iON puzzle. As an example, we will take our elasticsearch (our logging framework) strategy. We currently have 4 different elasticsearch nodes running in a clustered setup, where each node is in a different availability zone in the Eastern region. All 4 nodes write log data to the S3 gateway, which in turn writes the data to an S3 instance, as depicted by the picture to the left.
When an application is deployed onto Mule iON, it then writes data to the logging aggregation tier, which sits in front of each instance. One of the nodes will index this data, but the tier ensures that the data is replicated across each node. Each node has 5 shards (or a portion of the data) and also contains 1 replica of another node. If an elasticsearch node were to be terminated (either manually or triggered by an outage), then the replica of the node that is down would promoted to become the master node. With this setup, Mule iON has the ability to lose any node in the elasticsearch cluster without losing any log data or having any downtime for the logging framework.
Building for Region Failover
What would happen if the entire East region of EC2 had an outage? We needed to be sure that we could bring up the iON platform set-up in the West region with ease. For elasticsearch (as described above), we are planning on periodically syncing the data to an S3 instance in the West region. This allows all of the infrastructure to be replicated, but what about customer application data?
In order to get around this problem, we are using a standard Mongo replicasets set-up (shown to the right). We have 2 Mongo servers in the East region (1 of them being the primary database) and a 3rd server as a hot standby in the West region. Each Mongo instance has its own EBS volume where it stores the data locally. The data replication between the 3 instances is taken care of by Mongo automatically as part of the standard setup. Moreover, the instance in the West cannot be the primary nor can it be read from as long as the East instances are up. If the entire East region were to fail, then we could manually set the Mongo West server to be the primary one and bring up 2 new instances as back-ups, similar to the current setup.
In order to build a cloud platform that has redundancy between different AWS regions, we were forced to rethink our back-up strategy. Initially, we were storing all configuration information on AMIs (Amazon Machine Instances) and all data on EBS volumes. However, we found some major disadvantages to using EBS.
- It is much harder to scale with EBS volumes, since they are attached to a particular instance.
- EBS volumes can only be accessed from an EC2 instance within the same availability zone.
- There is a higher risk of I/O failures when writing data to EBS than S3.
As a result of all this, we have moved away from using EBS and now use S3 to move data (since S3 can be accessed from any availability zone within a region) and periodically sync it to other regions.
We have also found that using S3 with Chef makes upgrading Mule iON much easier. Instead of storing all of our configurations on a role (or AMI), we instead only have a few basic packages installed on the role, along with the Chef package. We then store the rest of our components in a S3 instance as key-value pairs as shown in the example below.
As seen in the diagram to the right, if we need to upgrade elasticsearch, then we can place the new elasticsearch distribution in the S3 instance and then trigger a script with the role name to update. This script gets any updated Chef recipes, pulls the necessary configurations, then applies it to the given role. If we ever need to make configuration changes, we change the Chef recipe value above. If the East region were to fail, we also have the ability to re-run the same Chef recipes with the EC2 instances in the West region and bring up all of the instances again. Using Scalr, we are also able to tie these roles and recipes to certain events, such as instance start-up or shutdown. This allows us to do seamless updates and zero-downtime upgrades since we do not need to restart any roles.
Amazon provides us all the functionality and tools to build a fault-tolerant, highly-available cloud platform, but we still need to figure out how to assemble it. While building out iON there were a couple of EC2 outages, these provided an invaluable lesson for architecting in the cloud, “assume everything will fail”. When architecting Mule iON, we had to make sure to have back-ups, redundancy, and emergency procedures in place to deal with EC2 outages.
Despite all of this, building a bulletproof cloud has more challenges; How does iON handle security? How does iON’s load balancer work? How do we resist outages on other dependent systems? Stay tuned for more in the coming weeks!