|
The Critical Steps of Implementing a Successful Disaster Recovery Plan
Featured Article by Yatish Mishra of RagingWire Enterprise Solutions
Have you considered what a catastrophic interruption to your mission-critical IT services would do to your business? Today, any amount of downtime can mean lost productivity, lost revenue, lost customers, and lost opportunities. Given the risks to the business, IT executives are being challenged with supporting the company’s disaster recovery plans with proven disaster recovery solutions, services, and technology strategies to reduce your exposure and vulnerability, protect mission-critical operations against diverse downtime threats, and ease recovery if an unforeseeable disaster strikes.
A disaster recovery capability spans both the hardware and software required to run critical business applications and the associated processes to transition smoothly in the event of a natural or human-caused disaster. A disaster recovery plan encompasses the detailed structure, processes, and procedures that will govern how the overall organization will respond in the event of a disastrous event disrupting business operations.
The following outlines the steps you need to take to implement a successful disaster recovery plan. We'll look at critical steps for best practice disaster recovery, including Management Awareness, Disaster Recovery Planning, Resiliency & Backup Services, and Vendor Support Services.
Performance Indicators for Disaster Recovery
Performance indicators provide the mechanism by which you can measure the success of your disaster recovery process and plan. Performance indicators for disaster recovery are somewhat different from those used to measure network performance because they are a combination of project status and test runs of infrastructure. Indicators of success include:
- Periodic reports utilized from the planning group to senior management.
- Representation of senior (Level 3) engineers from across the IT organization, including network engineering, data management, systems management, and application development.
- Periodic tests to verify implementation of the disaster recovery plan and reports about gaps and risks.
- A review process that includes the deployment of new solutions.
- Post Mortem analysis of the disaster recovery handling, effectiveness, and impact on the business (after a disaster occurs).
High-Level Process Flow for Disaster Recovery
The following outlines the traditional process for disaster recovery planning. An essential principal in planning is active communication. Once an organization successfully implements a robust disaster recovery capability, it must take a continuous process improvement approach to ensure the disaster recovery plan remains in alignment with the business over time.
Management Awareness
Management Awareness is the first and most important step in creating a successful disaster recovery plan. To obtain the necessary resources and time required from each area of your organization, senior management has to understand and support the business impacts and risks. Several key tasks are required to achieve management awareness:
Identify Possible Disaster Scenarios
Identify the top ten disasters and analyze their impact on your business. Your analysis should cover effects on communications with suppliers and customers, the impact on operations, and disruption on key business processes. You should complete this pre-study in advance of the disaster recovery planning process, knowing that it will require additional verification during the planning process.
The following are examples of possible disasters: fire, storm, water, earthquake, chemical accidents, nuclear accidents, war, terrorist attacks and other internal and external crimes, cold winter weather, extreme heat, airplane crash (loss of key staff), and avalanche. The possibility of each scenario depends on factors such as geographical location and political stability. In general, severe disruptions of business operations (disasters) are more frequently caused by fire and we, therefore, recommend you start with fire as your first case study.
Assess the impact of a disaster scenario on your business from both a financial and physical (infrastructure) perspective by asking the following questions:
- To what extent would the disaster impact your resources such as staff, mission-critical business processes/application services, logistics, etc.?
- What are the costs to the business due to a disaster in terms of revenue loss, slowed production/manufacturing schedules, productivity, etc.?
- What is required to rebuild?
- How long will it take to recover?
- How long can the business be interrupted, and what are the associated costs per hour/days/weeks?
- What is the impact to the overall organization?
- How are customers affected, and what is the impact on them?
- How much will it affect the share price and market confidence?
Build Management Awareness
A critical success factor in disaster recovery is for senior management to be involved in the disaster recovery planning process with active executive sponsorship. By actively communicating across the organization, the management teams at all levels are aware of the risks and potential risks, threats, and impact to the organization from potential disaster scenarios. The first study on disaster recovery should include an estimate of possible costs and time to implement a disaster recovery strategy. Once management understands the financial, physical, and business costs associated with a disaster, it is then able to build a tangible strategy and ensure that this strategy is implemented across the organization.
Management Approval and Funding
The senior management has to agree on the disaster recovery project, as well as provide financial and human resources for the project. The first step is the announcement of the disaster recovery project and kickoff of a planning group or steering committee, which should be led by a senior manager with named executive sponsorship.
Once approval has been granted, a Disaster Recovery Team must be formed from across the organization with the necessary levels of participation. This team will be led by the senior manager and have active oversight from the executive sponsor. Subprojects will be managed and driven from within the responsible organization. For an example, the IT organization will be responsible for managing the definition, design, implementation, deployment, and ongoing management of the organization’s IT disaster recovery capabilities for mission-critical application services.
Disaster Recover Planning Process
In the planning stage, the DR Team will identify the mission-critical, important, and less-important processes, systems, and services across the business, and put in place plans to ensure these are protected against the effects of a disaster appropriately. Key elements of disaster recovery planning include:
- Establish a planning group.
- Perform risk assessments and audits.
- Establish priorities for your network and applications.
- Develop recovery strategies.
- Prepare an up-to-date inventory and documentation of the plan.
- Develop verification criteria and procedures.
- Implement the plan.
Establish a Planning Group
As we alluded to in the previous section, establish a planning group to manage the development and implementation of the disaster recovery strategy and plan. Key people from each business unit or operational area should be members of the team, responsible for all disaster recovery activities, planning, and providing regular monthly reports to senior management.
Perform Risk Assessments and Audits
The planning team needs to thoroughly understand the business and its processes, technology, and services. The disaster recovery planning team should prepare risk analysis and business impact analysis that include at least the top five potential disasters. The risk analysis should include the worst-case scenario of completely damaged facilities and destroyed resources, especially those facilities where mission-critical application services are located. Having a hardened data center facility is essential. However, planning teams realize that disasters can leave the data center operational but isolated from the rest of the business. Additionally, data center services may be available but there is no provisioned business continuity plan in place for employees to gather to resume business operations.
Analysis should address geographic situations, current design, lead-times of services, and existing service contracts. Each analysis should also include an estimate on the financial impacts of replacing damaged equipment, drafting additional resources, and setting up extra service contracts.
Establish Priorities for Application & Data Services
When you've analyzed the risks posed to your business processes from each disaster scenario, assign a priority level to each business process. At this point it is essential to think in terms of application service as opposed to specific infrastructure (servers, databases, etc.). Application services should be categorized by their level of importance to business operations, specifically:
- Mission Critical: a service outage that would cause an extreme disruption to the business, cause major legal or financial ramifications, or has a measurable impact on revenue generation. The application/data service requires significant effort to restore, or the restoration process is significantly disruptive to the business or other mission-critical application services.
- Business Essential: a service outage or destruction that would cause a moderate disruption to the business, cause minor legal or financial ramifications, or provide problems with access to other systems. The impacted application/data service requires a moderate effort to restore, or the restoration process is disruptive to the service itself.
- Minor: a service outage or destruction that would cause a minor disruption to the business but have no impact on mission-critical activities. The service can be with little or any disruption to mission- and/or business-critical application services.
Resiliency Design and Recovery Strategy
Just as the analysis of the business processes determine the priorities of application services, prioritized application services then drive the relational priorities of supporting server, storage, and network infrastructure. Given employees will be accessing application services via data networks, you must consider this in your assessment. Users of each application service must be identified by site/location. This, in turn, will need to be correlated against the application service priorities and location of key services.
These factors and priorities contribute to a fault-tolerant design, with resilience as a critical design requirement for the network, application, and storage infrastructure.
Develop a recovery strategy to cover the practicalities of dealing with a disaster. Such a strategy may be applicable to several scenarios. However, the plan should be assessed against each scenario to identify any actions specific to different disaster types. Your plan should address people, facilities, network services, communication equipment, applications, clients and servers, support and maintenance contracts, additional vendor services, lead-time of Telco services, and environmental situations.
A recovery strategy should include the expected downtime of services, action plans, and escalation procedures. The plan should also determine thresholds, such as the minimum level at which the business can operate, the systems that must be fully available (all staff must have access), and the systems that can be minimized (not directly related to continued mission-critical/business essential activities).
Your disaster recovery plan should include a backup services strategy, which needs to be consistent throughout the entire organization. Backup scenarios are important to provide higher availability and access to main sites and/or access to existing parallel disaster recovery sites during a disaster.
Resiliency and backup services form a key part of disaster recovery, and you should review these services to make sure they meet the criteria for your disaster recovery plan. Disaster handling also requires communication services, which can greatly reduce the impact a disaster has on your business.
Network Resiliency
Network Resiliency is defined as the ability to recover from any network failure or issue whether it is related to a disaster, link, hardware, design, or network service. A high availability network design is often the foundation for disaster recovery, and can be sufficient to handle some minor or local disasters. Key tasks for resiliency planning and backup services include:
- Assess the resiliency of your network, identify gaps and risks.
- Review your current backup services
- Implement network resiliency and backup services.
Fault Tolerance
Assess the resiliency of your network regarding its overall ability of your wide area network (WAN) to withstand a catastrophic event(s). The following three levels of availability need to be assessed and associated gaps identified: fault tolerance, the network’s ability to adequately perform under increased traffic demands, and dynamic routing capabilities. Doing so helps prioritize risks, sets requirements for higher levels of availability, and identifies the critical elements of your network by evaluating:
- WAN
- Network link capacity and trending
- Baseline traffic analysis based on application services by location
- Carrier diversity regarding Internet gateways
- Local loop diversity
- Facility resiliency regarding points-of-entry and demark
- Building wiring resiliency
- Hardware resiliency including
- Single-point-of-failure for border and internal networks
- Critical redundancy for core network elements
- Power
- Redundant hardware
- Mean time before replacement (MTTR) by critical network element
- Network path availability
- Dynamic routing capabilities
- Network services resiliency
- DNS architecture and security
- DHCP resiliency for local area networks
- Other services resiliency
Vendor Support Services
Having support services in place from your major vendors adds a strong value to disaster recovery planning. For example, specific managed hot standby sites or on-site services with rapid response times can significantly ease disaster recovery. Key questions regarding vendor support include:
- Are support contracts in place?
- Has the disaster recovery plan been reviewed by the vendors, and are the vendors included in the escalation processes?
- Does the vendor have sufficient resources to support the disaster recovery?
Most vendors have experience handling disaster situations and can offer additional support.
Prepare Up-to-Date Inventory and Documentation of the Plan
It is important to keep your inventory up-to-date and have a complete list of all locations, devices, vendors, used services, and contact names. The inventory and documentation should be part of the design and implementation process of all solutions.
Your disaster recovery documentation should include:
- Complete inventory, including a prioritization of resources.
- Review process structure assessments, audits, and reports.
- Gap and risk analysis based on the outcome of the assessments and audits.
- Implementation plan to eliminate the risks and gaps.
- Disaster recovery plan containing action and escalation procedures.
- Training material.
Develop Verification Criteria and Procedures
Once you've created a draft of the plan, you should create a verification process to prove the disaster recover strategy and, if your strategy is already implemented, review and test the implementation.
It's important that you test and review the plan frequently. We recommend documenting the verification process and procedures, and designing a proof-of-concept process. Because disaster recovery is based on experience and each disaster has different rules, the verification process should include an experience cycle. You may want to call on experts to develop and prove the concept, and product vendors to design and verify the plan.
Moving Forward
To move forward, some fundamental decisions need to be made including: How should your plan be implemented? Who are the critical staff members, and what are their roles? Leading up to the implementation of your plan, try to practice for disaster recovery using roundtable discussions, role playing, or disaster scenario training. Again, it's essential that your senior management approves the disaster recovery and implementation plans.
As with all major business initiatives, investing in requirements analysis and planning up front will result in your organization’s ability to manage change and implement an effective disaster recovery plan and capability.
Yatish Mishra
President and CTO
RagingWire Enterprise Solutions, Inc.
Yatish Mishra has 19 years of technical management experience in the IT industry, including building and operating 12 data centers worldwide. Previously, he was vice president of information systems at Photronics, Inc., where he was responsible for all aspects of IT operations worldwide. His responsibilities included applications development, systems administration, security, facilities, network engineering, and strategic planning. Prior to Photronics, Mr. Mishra was vice president of engineering at Microlink Technologies, Inc. and has held various senior-level IT positions managing security, networking, systems, applications development, business development, and customer support. Mr. Mishra holds a bachelor’s degree in applied physics and physical electronics from the University of California, Davis. A forward-thinker in the areas of data center management and how technology is changing the way companies protect their mission-critical systems and data in a post 9/11-world, Mr.
Mishra has spoke at industry trade shows and has been quoted in a number of national trade publications including Computerworld, InfoWorld, Interactive Week, Information Week, Disaster Recovery Journal, Contingency Planning & Management, Byte and Switch, and InfoStor magazine.
|
 |