Building a highly available and fault tolerant cloud platform comes with its share of challenges. What happens when components fail? What happens when the cloud itself experiences downtime? How is it possible to ensure customer apps are always available and their log data is never lost?
These are some of the very questions we ask ourselves when working through the iON architecture. With so many choices, both open-source and commercial, it can be difficult to know where to start, and is not unusual to experiment with several possible solutions before settling on the right technology stack.
When building the iON platform, we need to balance time to market with the need to control our infrastructure, and are pleased with where that has led.
This is the first in a series of blog posts that describe how we meet the challenge of creating a fully redundant, secure, scalable and fault tolerant iPaaS. Below are some of the architectural components with descriptions of how they contribute to the high availability of the iON platform.
Amazon EC2 – The iON platform runs entirely within the EC2 cloud. We believe AWS represents the best balance of control and features to build a highly scalable and resilient platform. All of our services, data storage systems and customer applications run as Amazon machine instances and are governed via AWS. We take advantage of the auto-provisioning and scaling features AWS provides to ensure full redundancy across the iON platform. Some of our servers utilize EBS storage for quick reliable access to instance data and all of our permanent storage and backups are on S3.
Scalr – We use Scalr to manage and provision our infrastructure. Scalr provides us a cloud infrastructure API abstraction layer and added features on top of AWS API to auto-scale and relaunch instances on the fly. Scalr scripting allows our platform to customize instances as they are launched, and auto-snapshot functionality allows us to schedule periodic S3 snapshots for all of our customer, platform and log data.
MongoDB – Our choice of MongoDB was based on the need for reliable, scalable and easy to use data storage for customer applications and statistics. None of this information is particularly relational and we needed to get up and running as fast as possible. Document storage works well for this type of data, with GridFS being useful for customer application binaries. A mongo replicaset with nodes in different EC2 availability zones / regions ensures we can survive the loss of replica members with automatic failover. Our west coast replica is configured as a hot backup and we take frequent snapshots of it to S3.
ElasticSearch – ElasticSearch is an open source, distributed, RESTful Search Engine built on top of Lucene. We needed our customer application log data to be searchable in near real-time and to be highly available, and ElasticSearch has worked out well. We currently run a 4 node search cluster on EC2 with 5 shards and 1 replica per shard, utilizing the S3 gateway for recovery. During the development of the iON logging infrastructure, we frequently ran “kill tests” of ES nodes to verify our ability to bring up replacement servers and ensure customer log integrity. When we need to scale, we can add additional nodes or shards.
With the potential for iON applications to generate significant amounts of log data, and to scale the number of connections over time, we needed a logging aggregation tier. After prototyping and researching other options, in the end, we chose to develop our own components for this layer. We built security and index management directly into the iON logging framework, allowing us to secure customer log data and handle massive amounts of content at scale. We can add additional nodes to this distributed aggregator service to scale further.
Nginx – The use of a load balancer is essential to the scalability of the iON platform. Nginx is an easily configurable and performant load balancer that provides us with a way to scale customer applications as well as internal iON services. iON dynamically reconfigures nginx on the fly to adapt to shifting resources within the cloud and also provide SSL security to platform components and customer applications. We passed on Amazon ELB due to lack of SSL support to customer applications inside the network, but may revisit that in the future. Load balancer redundancy is achieved via Scalr DNS services.
Never before have there been so many tools available for building highly available and scalable cloud platforms. However, that doesn’t make it easy. Check back in our next post on how we have built iON for high availability with EC2 features such as availability zones, region failover and data persistence.