By Dan Roncadin, Sr. Technical Consultant
Take a deep breath and begin to accept that every brilliantly architected cloud solution can still fail.
Every time a cloud provider such as Amazon or Microsoft experiences an outage, it’s headline news for the technology world. Questions and commentary abound: How can these systems fail when so much work has been put into their design and implementation? Can cloud-based solutions be trusted? The answer is that yes, cloud solutions will fail just like any other technology. The fact that failure is possible is no reason not to use them.
Experienced technology infrastructure people know that all components fail — thus various design patterns have arisen to mitigate and survive component failures. Load balancers can seamlessly move traffic away from offline web and app servers; databases can fail-over to a secondary node, etc. The key to these architectures is to reduce shared components and single points of failure. Each shared component presents a possible failure that can take down the entire solution. Where things do have to be shared, a lot of effort is put into resiliency and redundancy. An example of this is storage and power infrastructure — both are often engineered to achieve maximum uptime. But even that doesn’t prevent all failures.
Cloud solutions are often seen as an answer to the recursive problem of designing more and more redundancy into infrastructure. The dream is to simply decide how much capacity is needed and in which configuration, and let the service provider deal with preventing and recovering from outages. In many cases this is true — cloud providers operate best-in-class data centers at a sufficient scale to afford great facilities and operational practices. Their core business is keeping infrastructure up and running. Yet, in the past 2 years a number of cloud provider failures have occurred rendering entire cloud solutions unavailable (like Azure) or large sections of data and servers unreachable or unusable (Amazon).
What is delivered in a cloud server solution is much more complicated than a simple server? Provisioning a cloud server requires the interplay of a host of services. The user must request the server through an API or management interface. That user session must be authenticated and authorized via an access management solution. A configuration management system must determine if there is the capacity to create the virtual machine and select an appropriate host location. Storage volumes may need to be created for the machine. A provisioning system must instruct the physical host to instantiate the new virtual machine. Network reconfiguration may take place to assign an IP address and security group to the virtual machine. The tight integration of all these services presents a plethora of failure options. Breakdowns in these systems can leave users unable to create new machines or in some cases access & use existing machines. Add these to the existing set of components at all levels of infrastructure (power, network, disks, storage arrays, servers) and there are infinite ways the system can break down.
With this complex ecosystem of systems and services involved, the new point of failure becomes complexity itself. Some argue that in complex systems, failure is inevitable. As a fantastic reference to this, see Charles Perrow’s book “Normal Accidents,” which examines some of history’s biggest systems failures (Bhopal, Chernobyl, and the Challenger shuttle). Instead of assuming that providers will eventually make all of these interconnected systems and service offerings infallible, it is more prudent to assume that there will at some point be a catastrophic outage. Take a deep breath and begin to accept that every brilliantly architected cloud solution can still fail.
Every environment is different and designing a complete solution requires a lot of thought, but here are some simple ideas to help anyone prepare to survive cloud failures.
Keep A Copy of Critical Data Elsewhere
Use a cloud provider that has multiple hosting locations, and ensure you have a copy of your data in a location outside your primary environment. Consider spreading your data across multiple cloud providers too. If you run your infrastructure on Amazon, perhaps keep a copy of your critical data on Rackspace or vice versa.
Automate Recovery
Recovery time is a key metric for your architecture. How quickly can you provision, configure, and install your environment on a new zone, or even a new provider? Use automation as much as possible, and be ready at the push of a button to spin up your critical environments in another location of your primary provider, or even on a secondary provider. Measure the number of manual steps required to provision your key environments then drive towards reducing them.
Keep your DNS Separate
DNS is key to keeping your site agile and able to be relocated. If your DNS servers are hosted alongside your primary infrastructure, you will be much less able to relocate if necessary (especially if your provider has gone completely offline). Use low TTL values so that DNS servers downstream don’t cache your location for too long.
Create a Static Fallback Site
Part of surviving outages is never going completely offline. An easy way to do this is to create a static version of your site and host it on another provider, or even on a CDN (Content Distribution Network, such as Akamai, or Amazon Cloudfront). If your primary provider does experience an outage, you can quickly switch DNS settings and send users to the simple static site. You can also get more advanced and use an automated DNS solution that will detect failures in your primary location and automatically route users to the static site when necessary. This is useful whether or not you have another location or provider to spin up a new full hosting environment with. Have a mechanism to update the static site so that you can communicate with users by posting updates, to let them know what is happening. Many outages are short-lived, and depending on the uptime demands of your solution, you can deal with short outages with this technique without implementing a full move of your site to another location or provider.
If you take these measures, you will be well prepared to weather a cloud outage when it occurs.