Written by James Leone, Principal Consultant, Taos an IBM Company

It has never been more cost efficient for organizations to manage digital workloads with extreme levels of resiliency. The Microsoft Azure cloud provides a global infrastructure network and a portfolio of services to realize business continuity across an organization’s entire digital estate. Yet forethought and planning are still required. In this article, we’ll take a top-down look at relevant topics for planning and then drill into the tactics to successfully execute on that plan.

    • This is an opinionated article on building resilient Azure workloads
    • Business continuity planning is an essential activity for all organizations – do it!
    • Business impact analysis (BIA) is a practice to assess workload criticality
    • BIA facilitates the risk/reward decision making that balances resiliency and cost
    • IaaS and PaaS put a different level of responsibility on you, the cloud user
    • PaaS-based services simplify the effort required to build resilient systems
    • Azure regional pairs are key to handling large-scale outages beyond your control
    • Use the appropriate disaster recovery strategy based on workload RPO/RTO: Hot, Warm, Lukewarm and Cold
    • Azure availability zones provide powerful capabilities to mitigate localized outages
    • A sound resource tagging scheme, coupled with robust Azure policies, dramatically improves your overall resiliency stance

Business Continuity Planning (BCP)

This write-up is NOT intended to be a comprehensive guide on BCP. However, it is an important backdrop that drives the level of resiliency you need to succeed in Azure.

BCP is a high-level process organizations undertake to evaluate and mitigate various risks which can negatively impact—or completely inhibit—ongoing operations (natural disasters, cyber-attacks, pandemics, data corruption, power outages, etc.). The larger the organization, the more complex it is to address and the more ongoing care is required. BCP examines the impact of disasters on manufacturing of physical goods, supply chain, digital technology and a variety of other functions across different industries.

There are many approaches to break down this large problem into focus areas that are subsequently delegated to respective units within an organization. Examples include:

    • By division
    • By organizational functional
    • By branch location or geography

Business Impact Analysis (BIA)

Organizations are constantly changing and adapting to market conditions. This evolution is crucial for success, but it also means things like compliance and BCP are not point-in-time activities; rather, they are an ongoing part of an organizational lifeblood. BIA is a critical undercurrent to evaluating (and re-evaluating) the consequence of outages.

Regardless of how your organization divides responsibilities for managing digital technologies, BIA provides a mechanism to build an inventory for the key digital systems, digital services and applications used throughout the organization’s functional areas. In addition, BIA helps quantify the criticality of workload and data used across an organization. It often starts as a simple spreadsheet and matures into a system of its own. Whether you are using the advanced automation and rating tables that Taos, an IBM company, provides to large enterprises or just a basic questionnaire, the BIA captures criticality of a workload and its data.

Ultimately, the criticality drives the restore point objective (RPO) and restore time objective (RTO) for each workload under inventory. In turn, the RPO and RTO sets expectations with internal and external users—justifying the investment of money and time.  BIA is the mechanism to clarify what is high, higher and highest priority.

Shared Responsibilities and IaaS, PaaS, SaaS

Let’s dive deeper into the specifics of workloads running in the Azure cloud.

The Infrastructure as a Service (IaaS) paradigm maps closest to traditional IT systems management. These are the building blocks like virtual machines (VMs), local storage (Disks) and software defined networks (VNETs). As organizations mature in IaaS adoption, they will produce custom images that have specific software/middleware components pre-installed with hooks to manage configuration. Hopefully, these activities are realized through automation (but that’s a separate topic).

Platform as a Service (PaaS) heightens the layer of abstraction. With PaaS, youstill need to account for the networking considerations, but VM-level access is rarely provided. Many of the management responsibilities are effectively “outsourced” to Microsoft. The “economies of scale” principle often results in lower cloud fees and less time spent by people within your organization when you use PaaS. The trade-off  includes your teams needing to have a more nuanced understanding of how to configure these PaaS, the associated service
level agreement (SLA) that Microsoft provides (such as RPOs and RTOs), and Microsoft attestations regarding security. In the Software as a Service (SaaS) model, you have even less visibility into and control of the underlying virtual infrastructure and the application layer. In practice, the line between SaaS and PaaS is blurry. For this article, let’s assume that anything you can manage through the cloud management plane (Azure Portal or Azure Resource Management API) is either an IaaS or PaaS.

Inter-Regional Resiliency

The recent UK heatwave brought record-breaking temperatures (104.5 F/40.3 C), resulting in large-scale outages for both Google Cloud and Oracle Cloud. While this example isn’t Azure-related, no cloud provider is immune from widespread outages that can last multiple days. Therefore, it is vital that your BCP playbook account for recovery from disasters of this magnitude.

Regional Pairs

Microsoft has over 40 Azure regions across the globe and an additional 20 under construction. In theory, an organization can select any arbitrary Azure region for disaster recovery, but it would be doing a disservice to itself. Each Azure region is associated with a geography, which helps you comply with various data residency requirements and regulations. This is a particular concern when the data involves personally identifiable information (PII), personal health information (PHI) or is otherwise considered sensitive by governing authorities.

Azure has the concept of Regional Pairs. With few exceptions, the regional pairs are located within the same geography. By design, the paired regions are physically distanced by 300 miles (480 km) and have strong point-to-point connectivity that streamlines data replication speeds, which facilitates better RPOs. Microsoft itself has BCPs that prioritize keeping one of the regional pairs up and running in the event of region-wide outages. This means Microsoft can effectively help you realize your RTOs by focusing their internal efforts to get the regional pair up and running when disaster strikes. 

Regional Failover Strategies

Unfortunately, selecting the regional pair is just the starting point to recovering from a regional outage disaster. The workload RPO/RTO will influence the appropriate disaster recovery strategy.

    • Hot: This strategy can accommodate high RTO and RPO requirements but is the most expensive. With Hot DR strategy, the resources hosting the mission-critical workload are deployed and actively running in both regions. Workload updates involving any kind of outage must be planned carefully so failover isn’t inadvertently triggered. It is important that the two regions effectively mirror each other; this makes automation (configuration management and infrastructure as code) a key practice. Typically, resources in the failover region will have less capacity allocated.
    • Warm: This can be the most cost-effective way to achieve strong RTOs/RPOs and other SLAs for workloads considered critical. The Warm DR strategy requires that key resources for hosting the workload be deployed to the failover region, but they do not need to be continuously running. Automated failover is typically not enabled if using Traffic Manager or Front Door. In fact, DNS with shorter time-to-live can be used to effectively reroute traffic to failover regions. It’s common to have VMs in the failover region in suspended state, which adds some process complexity to rolling out workload updates and ensuring the necessary patches are applied.
    • Lukewarm: This strategy is the most cost effective for accommodating high RPO with lower RTO. Data storage resources exist in the failover region, supporting active data replication, while compute resources (VMs, App Service Plans, etc.) are provisioned in the failover site during the disaster event. By provisioning the non-data resources with a just-in-time (JIT) approach, you avoid the cloud fees and the time cost associated with maintaining those resources (keeping configurations in sync with the primary region, monitoring, patching, etc.). Many of the PaaS resources for data management support cross-regional failover (simplifying effort required to establish the resources in failover region), but it’s still important to examine the replication frequency to ensure RPOs can be achieved.
    • Cold: This strategy should be reserved for applications that have lenient RPO and RTO. If you don’t have good automation (infrastructure as code, configuration management and application deployments) this is a very risky strategy. The likelihood of incorrectly configuring your cloud resources is high. Azure Recovery Service Vault is a vital PaaS if your workload architecture is IaaS-based. Any PaaS services should be examined closely to ensure data replication is enabled.

Regardless of the strategy, these general considerations should be factored into any sort of architectural planning for viable disaster recovery:

    • Ensure sufficient connectivity in both regions. Workloads rarely operate in full isolation and commonly need to communicate with other workloads or end-users need the ability to access the workload. Any communication within a private network requires comparable connectivity in both the primary and failover region.
    • Leverage DNS and don’t hardcode IP addresses. This is an extension of the prior point, but is such a common problem that it’s worth calling out. Communication between components of the workload (VMs and Azure PaaS) should be done using hostnames (preferably private DNS). Similarly, upstream- and downstream-dependent systems should be referenced using DNS and not hardcoded IPs.
    • Practice, practice, practice. Don’t wait until disaster strikes to go through the motions the first time. It’s important to exercise the failover playbook for your critical workloads. If you do not want to risk unintentional disruptions, you can perform the exercise using a non-production environment.

Intra-Regional Resiliency

Full-blown regional outages are a major disruption but are much less likely than localized outages.

Availability Zones

Many of the Azure regions are comprised of three physical locations. These regional datacenters are located in close proximity and interconnected with mesh fiber networking that provides low latency communication (less than 2ms). This enables the concept of Availability Zones. If one of the Availability Zones within a region fails, the other two Availability Zones have enough spare capacity to assume the load of that offline zone. Unless there is a compelling reason to use a region without Availability Zones, you should place all critical workloads in an Azure region with Availability Zones.

When deployed to a region supporting Availability Zones, most PaaS resources can leverage them for improved reliability. However, just as cross-regional capabilities required extra configuration, zone redundancy needs to be configured for PaaS services and/or you need to use the SKU that supports such capability. When you are working in the IaaS paradigm, you need to specify the zone of those VMs you want distributed across Availability Zones. Of course, if you have a situation that requires more dynamic horizontal scaling of VMs, Virtual Machine Scale Sets (VMSS) is the desirable approach.

Availability Sets and Fault Domains

If you are working in a region that does not support Availability Zones, or you have a situation that cannot tolerate the communication latency between Availability Zones, you can improve resiliency by using Availability Sets. When VMs are created in an Availability Set, they are distributed across Fault Domains within the same datacenter. Fault Domains have unique power sources, hardware and networking devices within the same datacenter. When doing this, carefully evaluate the RPOs, RTOs and other uptime SLAs for your workload.

How Can This Be Simplified?

While it has never been more cost effective to achieve high levels of reliability, there’s no easy button. Workload criticality must be initially dispositioned and intermittently re-evaluated. Components of a workload must be architected and configured to meet RTO, RPO and other SLA requirements. Workload teams may not have the skillset to do this, and corporate cloud teams may not have the capacity to manually review each workload across the enterprise portfolio.

At Taos, an IBM company, we’ve found that organizations can fill this gap with a good resource tagging strategy and a robust set of Azure policies. The Azure Policy tool evaluates configuration of different resource types relative to their “criticality” tag and the operating environment of the resource. This is a small piece of the many topics that Taos covers when helping companies establish a cloud platform foundation that aligns with their business needs.