William Liu

Architecting Highly Available Apps on AWS


##Table of Contents

##Summary

These are notes from attending the NYC AWS Training and Certification for Highly Available Applications using Amazon Web Services.

##Highly Available vs Fault Tolerant

####Highly Available

Highly Available means removing single points of failure (because “Everything fails all the time”)

####Number of Nines in Levels of Availability

Number of Nines in Levels of Availability

####Fault Tolerance

Fault Tolerance means built-in redundancy so apps can continue functioning when components fail

##AWS Architecture

AWS encourages everything in the cloud as ‘off-site’ and ‘multi-site’.

####Chaos Monkey

What is Chaos Monkey? Netflix created Chaos Monkey to terminates random instances on a live server to see if an application is fault tolerant. Not recommended to do on a live production server.

####AWS S3

S3 is perfect for hosting your static digital assets (e.g. css, company logo, javascript files). S3 is very efficient.

You can also use Amazon S3’s website feature when only client-side processing is required. There’s no infrastructure to configure/launch.

With S3, there are different file systems with the following URI schemes:

S3 Native FileSystem (URI Scheme: s3n): a native filesystem for reading and writing regular files on S3; 5GB limit on file size
S3A (URI Scheme: s3a), successor to the S3 Native, uses Amazon's libraries to interact with S3. Allows S3a to support larger files
S3 Block FileSystem (URI Scheme: s3): a block-based filesystem backed by S3. Files are stored as blocks, just like in HDFS.

####AWS CloudFront

CloudFront is a world-wide content distribution network (CDN). This distributes content to end users closest to the user so its low latency, high data transfer speeds (and makes use of edge locations). Availability SLA is

Without CloudFront, your content is being loaded from EC2 webservers directly. With no CDN, the response time is longer and server load is higher.

You can use CloudFront for even Dynamic Resources (not just static). This is perfect for intelligently pushing back small packets of data back to the origin server as long as it does NOT require a real time response (e.g. a voting application). You can:

####AWS Route 53

AWS Route 53 is a highly available and scalable DNS (Fun note: 53 is the common port for DNS). Route 53 manages DNS Failover to route around region and AZ level issues. Route 53 also has domain name registration and renewals. Route 53 should point to a Load Balancer.

DNS Failover means that if you have a DNS pointing to a primary server and the health check fails, you can automatically reroute to a different server (even if its just a static site that says there are issues). For example, that setup would look like this:

Record Sets

Amazon S3 website

##Lab 1 Exercise

We will transform a fragile two-tier web application into a resilient, scalable application using AWS CloudFront. Currently it is a web server that relies on a database server. If either instance fails, the site is down. If there is a surge in traffic, performance would degrade. If we scale up, we would need to stop the instances (affecting availability). It would also be difficult to scale horizontally by adding more web servers.

##How to manually fix a bad instance

In EC2, you can see that there is an Elastic IP address (Elastic IPs) attached to the web server (e.g. 54.175.6.128). This enables the instance (e.g. i-038815d0 (cloudwiki-lab1-www)) to be reached via a static address. A DNS A record is typically mapped to the Elastic IP so that you can access the server using a friendly host name like www.jobwaffle.com.

In our EC2, we can create Images under Instances > Actions > Image > Create Image. This is done so we can launch a replacement instance in the same exact state as the current running instance (in case something goes wrong). You can see these images under the Images > AMI.

We will simulate a faulty instance by deleting the web server under instances. We now replace the web server by going to AMIs, selecing the AMI we just created, and clicking ‘Launch’. We then specify the size and security groups (select the www-server security group), and the keypair.

##How to automatically fix bad instances

AWS lets you failover to a backup website if your primary website is unavailable. We will do this by configuring a simple backup website on Amazon S3 and use Amazon Route 53 DNS Failover to automatically route traffic when the primary site is unavailable.

Setup Hosted Zone, Health Check, and set Failover Routing Policy

In our Route 53, we will look at Hosted Zones, which is a hosted zone for your domain name (usually modifying these settings takes hours to take into effect). Under ‘Hosted Zones’ > ‘Go to Record Sets’. You will see 3 records with these types:

We want to create a DNS Failover, which first involves setting up Health Checks. We add a Health Check name, IP Address (from our ‘Hosted Zones’ > Record Set - A record), host name, port, path, request interval, and failure threshold (10 seconds). When you go to the Record Set, you’ll see a few settings.

What this does is that we now automatically check the health of the homepage and verifies that it returns a successful response every 10 seconds. If this check fails, we want to route our traffic to a backup site on AWS S3.

Setup S3 bucket to contain a static version of the site

Go to S3 and let’s assume there’s a bucket that has a static version of your website. There’s an option under ‘Properties’ that allows for Static Website Hosting. Get the link for this endpoint (e.g. i-038815d0.highlyavailable.org.s3-website-us-east-1.amazonaws.com).

Go to Route 53 and under Hosted Zones, we want to ‘Create Record Set’. Under ‘Alias’ we select our S3 endpoint that we created right above. We specify that the Routing Policy is set to Failover and the Policy is ‘Secondary’.

You can now terminate the webserver and you’ll see that there is a static failover (not all the links will work). What happens is that Route 53 has detected the failure of the primary and is now sending traffic to the backup site. Remember that the failover does not have to be a static site; the failover can also be directed to another active site.

##High Availability Cost

Hosting a high availability website can be complex. You want to balance between not paying more than you need to with users having a good responsive experience. We will go over the following Web Tier Core Concepts:

####AWS CloudWatch

Cloudwatch allows you to monitor AWS cloud resources. Alarms can be setup. You can setup metric thresholds that has an action (e.g. auto-scale a new server, alert you of high latency). It can tell you CPU utilization, but not things like memory use (because it is an instance that thinks all memory is used).

####AWS Elastic Load Balancer

Elastic Load Balancer (ELB) is a load balancer that is highly available and helps with auto-scaling, does health checks, etc. ELB spans across multiple AZs. ELB scales smoothly based on traffic. Scaling can take from 1-7 minutes based on traffic profile.

IP Address will change over time (so don’t use IP Address).

For spiky/flash traffic, pre-warm ELB by submitting a request to AWS Support. You can also DIY by slowling simulating more users.

There are two types of ELBs: public ELB (only ELB is public facing, Web/app instances can use private IP Adddresses in private subnets) and internal ELBs (ideal for balancing request between multiple internal tiers).

####AWS Auto Scaling

With auto scaling, we want to scale up and down or rebalance servers across AZs. We have the following types of scaling (manual, by schedule, by policy in response to real-time alerts, auto-rebalance across AZs).

For example, scale up by 10% if CPU utilization is greater than 60% for 5 minutes or scale down by 10% if CPU utilization is less than 30% for 20 minutes. We always scale up more since it takes longer than scaling down.

####Auto Scaling Components

We have the following components:

We can use the AWS CLI for auto-scaling by calling commands like an API (or you can see this under EC2 auto-scaling:

E.g. aws autoscaling create-launch-configuration --launch-configuration-name LC1 \ --image-id ami-570f603ee --instance-type m3.medium

aws autoscaling create-auto-scaling-group --auto-scaling-group-name ASG1 \ --launch-configuration-name LC1 --min-size 2 --max-size 8 \ --desired-capacity 4 --availability-zones us-east-1a us-east-1c --load-balancer-names ELB1

####Bootstrapping

Configuration (like with Chef) of setting up a server when it first starts.

##Lab 2 Exercise - Create High Availability on Web Tier

####Load Balancers

We will continue our previous web application and remove the single point of failure at the web app by adding a load balancer and implementing auto scaling. We will start with the ‘Deployment & Management’ > CloudFormation > Output; notice that this URL resolves to a Elastic Load Balancer. Under EC2, look at the Load Balancers section. Under the ‘Instances’ tab, you can see that there are two Availability Zones (AZs); e.g. we have us-east-1c and us-east-1d.

####Auto Scaling Part 1

So what happens if an instance in an AZ fails? The other instances have to carry the load. We can implement auto scaling to help correct for this. We simulate an instance stopping by stopping a web server instance. The instance will fail an ELB health check and ELB will remove this instance from rotation.

After a few minutes, Auto Scaling will find the unavailable instance, terminate it, and launch a replacement instance, then register it with ELB. You can see this in action in EC2 under Auto Scaling Groups and look at the ‘Activity History’ to see that an instance was terminated, then a new EC2 instance is automatically launching.

####Auto Scaling Part 2

We can generate load on our web servers to trigger Auto Scaling grow the number of servers to handle the load. To see how a group scales, in EC2 look under ‘Details’ to see the ‘desired’, ‘min’, and ‘max’ servers. To see the policies, look under ‘Scaling Policies’ to see the rules (e.g. remove 1 instance when CPU utilization < 40 for 180 consecutive periods of 60 seconds).

We run a custom program (bees with machine guns) that spins up EC2 servers that generate load on our web servers. You can see the load in real time in EC2 > Instances > select a web server > Monitoring. To see a list of events, look under ‘Auto Scaling Groups’ > ‘Activity History’ to see the web servers increase and decrease in usage.

####Summary: High Availability on Web Tier

Summary: The current architecture improves availability and solves the single point of failure at the web tier (via Elastic Load Balancer), using auto scaling to provide fault tolerance and scalability within the web server fleet.

##AWS Storage Options

AWS has a few storage options including scalable storage, inexpensive archive storage, persistent direct attached storage, turn-key gateway solutions. You want to pick the right one for the job. Each has a unique combination of performance, durability, cost and interface.

####AWS Elastic Block Store (EBS)

High performance block storage device, can mount as drives to instances. However, can’t map to multiple instances. This is essentially a network attached hard drive. These are 1 GB to 16 TB in size, are private to your instances, and are replicated within an Availability Zone. Backups can be snapshotted for point-in-time restore. Detailed metrics can be captured with CloudWatch.

EBS Availability

A volume is replicated, but only within a single Availability Zone. Snapshots are stored in S3. You can increase availability by replicating your volumes to another AZ or Snapshot regularly.

EBS Performance

There are two types of EBS, Magnetic and SSD. Recommend not using Magnetic.

Use PIOPS for consistent IO performance

####AWS EC2 Instance Storage

‘Instance Storage’ is a storage local to your AWS EC2 instance. These are basically hard drives that you can’t take with you (i.e. you lose this when your server shuts down) with the following properties:

####AWS S3 Storage

Very very high durability of objects. Unlimited storage of objects of any type.

####AWS Storage Options Summary

##Database Options

AWS supports a variety of database deployment options.

####DIY RDBMS options

Each option solves different DB problems so choose based on experience, features, and cost.

Self managed on EC2

AWS Managed (RDS)

####Database Storage Considerations

####Caching

You can always cache to reduce the number of reads to your database.

####Database Relational Database Replication

You can have a typical master-slave setup. You might have database mirroring. This allows for reporting to hit a slave database instead of the master.

####Database Sharding

You can shard your databases where you split large partitionable tables across multiple, smaller database servers. You need to setup application so it is shard-ware and shards may require periodic rebalancing. This also brings additional challenges like multi-server querying.

####NoSQL Databases

If you don’t need these important features (e.g. transaction support, ACID compliance, joins, SQL) you can then switch to key-value store using NoSQL (very fast, no need to worry about same sharding issues as a relational database).

####AWS RDS

AWS has a relational database called RDS that has an option to one-click high availability. This creates a replicated database to another availability zone.

####AWS DynamoDB

AWS DynamoDB is a fully managed NoSQL database service that provides extremely fast and predictable performance with seamless scalability. There is minimal administration, low latency SSDs, and unlimited potential storage and throughput. There is no need for tuning, is highly durabile, and one of the few services that is fault tolerant (only other is Route 53).

####AWS Database Summary

##Lab 3 Exercise - Create High Availability on Database Tier

####AWS RDS to create highly available relational database

Our goal is to create a highly available database tier. We will use AWS RDS, where Amazon will run the database instance in multi-AZs. When looking at the database, if Multi-AZ option is ‘Yes’, you will see the ‘Availability Zone’ and the ‘Secondary Availability Zone’.

Failover will automatically occur when various events (like rebooting the server) happen. Normally a reboot would cause downtime, but with a Multi-AZ instance, RDS can failover to the standby instance while the primary is rebooting (When you reboot, there will be an option to select ‘Reboot with Failover’, which you should check). You can verify what happens by checking the ‘Events’ log. There are also options to create Read Replicas and Automatic Database Backups. It is as simple as clicking on ‘Instance Actions’ > ‘Create Read Replica’

####AWS DynamoDB (NoSQL) to store session state

We lose session state in case of web server failure. This is also an issue when using Auto Scaling because auto scaled instances should be stateless. There are many possible solutions to store our session state information including putting this outside of web servers in a database, in an in-memory cache (like memcached or AWS ElasticCache), or a high performance durable storage.

For this example, we will store our sessions state using AWS DynamoDB since it is inherently fault tolerant, we do not need to worry about replication, failover or any other issues with high availability. We enter in some data into our application (e.g. create a login, password, etc). We go back to ‘DynamoDB’ and look at the table to see the id and data. By default, there is a provisioned throughput of 10 read and writes per second, which you can change depending on your needs.

####Lab Summary: Use AWS RDS or DynamoDB

In summary, it is easy to create high availability of databases if you use AWS RDS or DynamoDB. You have the option to create your own custom databases and manually setting master-slaves, but that is a lot more difficult.

##High Availability Design Patterns

When it comes to High Availability, we have the following patterns:

####Common Design Patterns

Multi-Server Pattern

Multi-Datacenter Pattern

High Availability Database Pattern

Floating IP Pattern

Floating Interface Pattern

State Sharing

Web Storage Pattern

Scheduled Scale Out

Job Observer Pattern

Bootstrap Instance

High Availability (HA) NAT

HA NAT - Squid Proxy

####VPN and AWS Direct Connect

VPN Connectivity allows connecting dual redundant tunnels between your on-premises equipment and AWS

AWS Direct Connect establishes a private network connection between your network and one of the AWS Regions. AWS Direct Connect is an alternative to using the Internet to access AWS cloud services. This reduces bandwith costs, creates consistent network performance, is a private connection to your Amazon VPC.

####Lab 4 Exercise - Making outbound traffic highly available using NAT instances

In previous labs we created redundant services across availability zones (AZs) within a region and distributing inbound traffic across those services at various application tiers (web, database). Now we want to look at how to make outbound traffic highly originating from application tiers in VPC highly available, using NAT instances that span multiple AZs.

This looks tough, time to call an Amazon representative if you run into this issue.