by Brandon Knitter | Technical Consultant at Taos
Amazon’s most recent outage has garnered a lot of buzz in the industry. Google for “amazon outage 2017” and you’ll find results that range from acknowledgement to outrage. My opinion will fall somewhere in between.
Let’s be honest, shit happens and infrastructure fails (both hardware and software). Shared services such as data stores and network services also fail and have a similar impact on the overall outcome. While this is a known quantity, many choose not to heed the warning and instead forge on with a lack of resiliency. As a developer, operator or designer, it is up to you to prevent or accept the outcome from this approach.
The outage resulted in the inaccessibility of the AWS Simple Storage Service (S3) within one US region. Due to this issue many subsequent AWS services failed because of their reliance on this core storage service. Even the AWS status dashboard was affected and presumably had a dependency on the failed service. There are four regions within the US and there have been no confirmed impacts within the remaining three regions although there are anecdotal reports of impact across more than the one failed region. It is not clear if these reported impacts were due to customer implemented dependencies or indeed are an internal AWS dependency across regions.
The AWS S3 outage resulted in an SLA violation. The design of S3 is intended for 99.99% uptime while the SLA is promised at 99.9% per month. This calculates to an outage tolerance of approximately 48 minutes per month, far less than the 4+ hour outage many experienced. The good news is that durability is provided at 11 nines (99.999999999%) and no user data was lost. Impacted users can apply for a service disruption credit by following the instructions on the SLA page.
A day after the the outage AWS released more details about the impact, what happened, and how they intend to improve both their processes as well as their designs. The industry has much to learn from this outage and the transparency of both the failure details as well as the design implications is helpful for all cloud consumers. The industry has changed in recent years to near full disclosure of all major events, I’m happy to see AWS continue to lead this level of disclosure.
The question remains, though, how can you protect yourself from such an outage?
Amazon provides multiple ways to contain fault with the two primary means built into AWS regions and AWS availability zones (AZ). Both serve a purpose at different levels of geographical isolation with AZs being the most commonly utilized solution within a region in order to contain fault. S3 is one service that is regionally deployed and therefore requires multi-region distribution in order to contain fault, this is achieved through bucket replication.
Deploying across regions, though, is no simple task. Given the choice of active/passive or active/active, the most common solutions implement an active/passive deployment. But even with an active/passive deployment comes implications of design, cost and operational overhead. These design considerations are always difficult to justify and many times result in a single region deployment. Choosing the right solution is a choice every product or company has to make for themselves, and the single region solution is an acceptable one in many cases. Whatever the solution, making certain this decision is transparent to your organization, and possibly your customers and end-users, is highly recommended.
I reached out to a few organizations and asked how they were affected and how they were deployed now and in the future. Each of them were able to justify the design consideration for single- or multi-region. Again, this is a choice. The outrage in the media and industry is confusing when this context is applied. Those surprised by the impact on their solution will want to educate themselves on the options available and the implications of each design consideration. For those companies I spoke with that did deploy within one region and were affected by the outage, their justification led to an acknowledgement, a shrug, and patience as they waited for AWS to restore S3.
Fundamentally, failure will occur. We must accept failure, we must plan for failure, and we must practice failure. All clouds will fail at some point, AWS is just the most recent and most visible. Weigh the options and implications of each design and remember there are many more reasons to choose the cloud than simply reliability itself…but that’s another blog post.
How do you feel about the outage? I’d like to hear from you. Please post your position and feelings in the comment section below.
AWS Status Page
A birds-eye view of the status of every service that AWS provides in every region.
AWS S3 Service Disruption in the US-EAST-1 Region
Amazon’s description of the outage, what was impacted, and how they intend to improve in the future.
AWS S3 Service Level Agreement
In clear writing what is expected of the AWS S3 service.
Regions and Availability Zones (AWS)
Published documentation from AWS on how to think about resiliency within regions and availability zones. This is a good starting point for the AWS proposed resiliency strategy.
Active-Active for Multi-Regional Resiliency
Insight into of how Netflix plans for failure by deploying across multiple regions. This was published in 2013 and is still relevant today.
Should the latest AWS outage scare you away from the public cloud?
A counterpoint on how to think about this failure and how not be discouraged from advancing to the public cloud. The article title is misleading and smells like click-bait, read it anyhow as there is good rationale for sticking to a go-to-cloud strategy.
With Its Recent Outage, Amazon Web Services Is Helping To Sell Hybrid IT
Some amount of FUD is being felt across the industry after the AWS outage. This article doesn’t necessarily help progress the march to public cloud. Hybrid cloud solutions are typically used for a migration, security or cost optimization. It is unlikely that most IT shops can achieve the level of uptime and services AWS offers when measured year-over-year.
Amazon Web Services outage reveals critical lack of redundancy across the internet
A good amount of fear mongering and irresponsible journalism on the part of Geekwire. The meat of the article is at the end where suggestions are made to deploy multi-region and multi-provider, the latter of which is unattainable by most and certainly doesn’t strengthen your relationship with your cloud provider.