By Brandon Knitter – Technical Consultant
Amazon’s most recent outage has garnered a lot of buzz in the industry. Google for “amazon outage 2017” and you’ll find results that range from acknowledgment to outrage. My opinion will fall somewhere in between.
Let’s be honest, shit happens and infrastructure fails (both hardware and software). Shared services such as data stores and network services also fail and have a similar impact on the overall outcome. While this is a known quantity, many choose not to heed the warning and instead forge on with a lack of resiliency. As a developer, operator or designer, it is up to you to prevent or accept the outcome from this approach.
The outage resulted in the inaccessibility of the AWS Simple Storage Service (S3) within one US region. Due to this issue, many subsequent AWS services failed because of their reliance on this core storage service. Even the AWS status dashboard was affected and presumably had a dependency on the failed service. There are four regions within the US and there have been no confirmed impacts within the remaining three regions although there are anecdotal reports of impact across more than the one failed region. It is not clear if these reported impacts were due to customer implemented dependencies or indeed are an internal AWS dependency across regions.
The AWS S3 outage resulted in an SLA violation. The design of S3 is intended for 99.99% uptime while the SLA is promised at 99.9% per month. This calculates to an outage tolerance of approximately 48 minutes per month, far less than the 4+ hour outage many experienced. The good news is that durability is provided at 11 nines (99.999999999%) and no user data was lost. Impacted users can apply for a service disruption credit by following the instructions on the SLA page.
A day after the outage AWS released more details about the impact, what happened, and how they intend to improve both their processes as well as their designs. The industry has much to learn from this outage and the transparency of both the failure details as well as the design implications is helpful for all cloud consumers. The industry has changed in recent years to near-full disclosure of all major events, I’m happy to see AWS continue to lead this level of disclosure.
The question remains, though, how can you protect yourself from such an outage?
Amazon provides multiple ways to contain fault with the two primary means built into AWS regions and AWS availability zones (AZ). Both serve a purpose at different levels of geographical isolation with AZs being the most commonly utilized solution within a region in order to contain a fault. S3 is one service that is regionally deployed and therefore requires multi-region distribution in order to contain fault, this is achieved through bucket replication.
Deploying across regions, though, is no simple task. Given the choice of active/passive or active/active, the most common solutions implement an active/passive deployment. But even with an active/passive deployment comes implications of design, cost and operational overhead. These design considerations are always difficult to justify and many times result in a single region deployment. Choosing the right solution is a choice every product or company has to make for themselves, and the single region solution is an acceptable one in many cases. Whatever the solution, making certain this decision is transparent to your organization, and possibly your customers and end-users, is highly recommended.
I reached out to a few organizations and asked how they were affected and how they were deployed now and in the future. Each of them was able to justify the design consideration for single- or multi-region. Again, this is a choice. The outrage in the media and industry is confusing when this context is applied. Those surprised by the impact on their solution will want to educate themselves on the options available and the implications of each design consideration. For those companies I spoke with that did deploy within one region and were affected by the outage, their justification led to an acknowledgment, a shrug, and patience as they waited for AWS to restore S3.
Fundamentally, failure will occur. We must accept failure, we must plan for failure, and we must practice failure. All clouds will fail at some point, AWS is just the most recent and most visible. Weigh the options and implications of each design and remember there are many more reasons to choose the cloud than simply reliability itself…but that’s another blog post.