Public cloud outages – 20:20 hindsight

AWS outage - 20-20 hindsight

Rocky start for public cloud providers

Our favorite public cloud providers have had a rocky start to the year. First, on February 28th Amazon Web Services (AWS), a cloud infrastructure provider for the likes of Netflix, Comcast, Adobe, Dow Jones, GE, and even NASA, suffered at the hands of “Murphy’s Law”. While debugging an issue within the S3 billing system, an incorrectly entered command removed a larger than expected amount of server capacity. This unexpected loss of available capacity forced the remaining servers supporting the subsystem to restart, creating a degradation of service that caused a ripple effect across other services supported by S3.

Then, we witnessed Microsoft’s popular email service Outlook (previously Hotmail) suffer an outage that impacted users from logging into their mail accounts. Most recently on March 15th Microsoft Azure’s storage cluster suffered from a power related ‘brown out’ that resulted in an 8-hour degradation of service.

This is not the first time nor the last time that our large public cloud providers will have hiccups that ripple across the internet. With 20:20 hindsight we should ask ourselves what can we all do to mitigate the impact to our businesses going forward.

For this article, I share my thoughts regarding AWS, my preferred cloud platform.

Why were so many companies affected?
The first question you might ask is, “Why did this affect SO many companies?” AWS is a shared cloud infrastructure provider, which means its clients share certain AWS maintained resources for automation in custom and even commercial applications. AWS provides its own security and logical separation between customers. Nevertheless, there was a domino effect. S3, the origin of the outage, is a core service for AWS, which means many of their secondary features that we know and rely on like Kinesis, DynamoDB, and even the virtual disks on customer servers (EBS), all use S3 at their core. Because of this S3, a normally VERY reliable service (AWS touts an SLA of 99.9%) affected many other services, impacting users’ ability to launch new servers, change configurations, deploy code changes, or even access their consoles to view status.

The thing is, even though the S3 outage was lengthy, is was limited to a single region. AWS by default allows any customer to build redundant infrastructure on, for example, the U.S. west coast, or even in a different country if export regulations allow. So, why are these billion dollar companies not doing this?

Here is a list of what we think are the main reasons:

1. A Perception of Cost

When you think redundancy, you might think back to that line in “Contact” where S.R. Hadden says, “Why build one when you can have two for twice the price”. Traditional thinking leads us to the thought that redundant infrastructure means we must buy two of everything. But in the cloud, this is not always the case. In AWS for example, an entire “scaffold” of critical infrastructure and configurations can be implemented in a second region for a fraction of the cost of your live production environment. It can sit there waiting, even taking a small amount of traffic to ensure it’s working always. In the case of a regional outage, the standby infrastructure can take advantage of the wonders of the cloud and scale up, adding compute, servers, storage, throughput – everything you need to then shift your traffic over to the new region at a DNS level. Sure, you might see a small performance hit if the data must travel further, but at least you’re back up and running, right?

2) Complexity

Multi-region is possible, but that doesn’t mean it’s easy. Adding new environments to the mix can come with double the operational overhead, meaning your teams must jump through more hoops to deploy, audit infrastructure, and calculate costs. Most companies don’t maintain the IT expertise to pull this off, so they balance risk and decide to go single region, thinking that multiple servers or even multiple availability zones are enough. Complicating this, many companies view resiliency as an afterthought because it doesn’t drive revenue. Because of this, leadership prefers to focus development time and resources to building business logic, and rightly so. As we saw from this outage, however, viewing resiliency as an afterthought is not enough. What’s interesting, though, is a little automation can go a long way in simplifying multi-regional architecture. But it’s always going to require a little extra muscle to get there.

3) Obscurity

Autonomy is a modern and highly effective motivational tactic. That’s why some companies allow development teams to be primarily responsible for uptime and allow them the freedom to do it their way. It may look beautiful from the outside, but it comes with its own realities. In an environment with a lack of standards and oversight, technical leadership might not have the visibility into (or the time or knowledge) to audit the resiliency of systems implemented by individual teams. This can lead to differences in disaster recovery between product teams, which creates confusion and too much reliance on undocumented tribal knowledge. Over time, this configuration drift can create uneven recovery periods which, in a highly interdependent environment, can cause the weak link to break the chain.

So, what do we do about this? How do we take our applications to the next level of reliability?

In the cloud, there is a saying: “WTF”. Withstand The Failure.

1) Think multi-regional.

Due to the logical separation between regions in AWS required to isolate failures both software and human, building multi-regional architecture is not always a push-button operation. The answer to this is to leverage infrastructure automation to easily replicate environments. The advantage is not only during the process of creating the infrastructure, but also during updates, and even regular audits to ensure consistency. Products like Terraform allow you to define a set ideal configuration, and then compare your ideal configuration against your infrastructure in its current state. After the comparison, Terraform can then apply the changes necessary. This type of automation can save a tremendous amount of time in the build and audit process, improving lead time on and infrastructure build-out from days to minutes. This is critical for reducing overhead and potential human error in multi-regional architecture.

2) Think provider agnostic.

The new wave of cloud architecture is one that encourages leveraging cloud-native services as accelerators to production. This can be using something as simple and straightforward as Google Apps for email, or as focused as Kinesis for streaming data to serverless processing. These services provide pre-built solutions that software teams used to have to solve in their own way, often cutting corners in the realms of security, repeatability, and operational excellence for these dependent solutions so they can focus on core logic. Though it can be tempting to utilize, for example, AWS managed Kinesis instead of your own Kafka cluster and its maintenance overhead, it’s important to remember that utilizing something someone else built is not ALWAYS better. Understand the failure and choke points to any solution you implement. After all, the cloud is just someone else’s computer. Thinking provider agnostic allows us to view our applications as independent from their underlying hosting provider. That gives us the ability to maintain infrastructure in separate clouds and thus survive a major outage.

3) Think distributed.

By distributing workloads between two regions or two providers, rather than building a second region as a hot standby, organizations can evenly distribute load between regions and or providers. This can help to avoid extra overhead in some cases, and both regions or providers are pulling their weight by load balancing production traffic.

Two weeks after the AWS outage on March 15th, Microsoft’s hosted cloud also suffered service disruption caused by issues with the management systems around their Storage product. This was an almost identical disruption with two main differences: Azure’s root cause was a power failure, and the Azure disruption expanded to 26 out of 28 of their regions, as opposed to AWS’s outage being isolated to a single region. It goes to show, that things WILL break. The question is: Do you want to do something about it now before it happens again.

Do you want to do something about it now before it happens again?

Join us for a webinar on Wed, Apr 12, 2017 2:00 PM – 3:00 PM EDT where we will introduce:

1. An introduction to architecting multi-regional availability for web applications
2. Demonstrate the stability of this architecture through a simulated failure to us-east-1 followed by a failure in us-west-1.
3. Based upon our sample architecture we will compare the costs of a traditional implementation (single region /2 availability zones) as compared to a multi-regional approach. You will be surprised by what you see!

Register Now

Related Posts

About Michael Lucas

Michael Lucas is an accomplished DevOps engineer, software developer, and mentor with a passion for designing self-healing and scalable architecture in AWS. As the AWS Practice Leader for Softchoice, Michael is responsible for the evolution of our offerings, delivery excellence and thought leadership in all things cloud. Prior to joining Softchoice, Michael was the VP of IT and Cloud Architect for the #1 rated Ad Agency in Southern California, where he designed, implemented and managed a highly-automated architecture for the organization’s SaaS portfolio. In his spare time, Michael is a remote-control helicopter hobbyist and enjoys spending time with his 3 kids, wife, and dog.