Difference between revisions of "Disaster Preparedness"

From EITBOK
Jump to: navigation, search
m (Daniel Robbins moved page Disaster Recovery to Disaster Preparedness)
(No difference)

Revision as of 23:01, 3 November 2015

1 Introduction

Disaster preparedness and disaster recovery (DR) supports business continuity planning and includes planning for EIT resiliency, as well as recovery from adversity, such that critical business services affected are restored to a satisfactory working state within an acceptable timeframe after the event.

DR can be defined as “in computer system operations, the return to normal operation after a hardware or software failure.” [1] Also, the “activities and programs designed to return the organization to an acceptable condition. The ability to respond to an interruption in services by implementing a disaster recovery plan to restore an organization's critical business functions.” [2]

This section defines these processes and deliverables, and who should be responsible for planning, creating the documents, and communicating should a disaster occur. The following are some examples for context:

  • Examples of disasters
    • Natural disaster affecting datacenters or EIT service operations (flood, fire, earthquake, wind)
    • Security breach resulting in disaster (destruction of data, admin password changes, virus/malware installation, sabotage)
    • Usage error (accidental deletion, unplug/turn off system resulting in corruption)
    • Utility failure affecting datacenters (loss of power even after UPS)
    • Vendor failure (cloud provider security failure, oil spill)
    • Staffing issue (employment dispute/walkout, epidemic)
  • Examples of unpreparedness
    • Requiring use of computers or printers when power is out
    • Requiring use of internet when power or connectivity is out
    • Single point of knowledge/control for administration access
    • Lack of offsite backup storage
    • Lack of working restoration from backups
    • Lack of failover datacenters in separate locations
    • Undocumented or out of date documentation for system interfaces
    • Requiring use of phones that are out of power
    • Lack of designation of leaders in restoration efforts (who is in charge of restoring service and they know they are in charge)
    • In general, no cohesive, comprehensive EIT service restoration plan

2 Goals and Principles

EIT organizations are responsible for the following goals:

  • To document and plan for appropriate backup and recovery processes for all systems, and priority of systems for restoration.
  • To create and deploy an EIT disaster recovery plan.
  • To ensure that the business has business continuity processes in place in case of a disaster.

Fundamental principles of disaster recovery depends on the business functions within the enterprise, and how critical each is to the health of the business. There are several methods for determining criticality of functions.

  • Hierarchy of need and stated in SLAs, which is that the most critical business functions should be restored first, or in the first phase of disaster recovery.
  • Keep the lights on (KTLO) or keep the business running (KTBR), which are not the same thing
  • All non-critical services are in the final phase of recovery.
  • Industry-specific, so all systems delivering lifesaving functions are the highest priority for recovery efforts, whereas administration systems wait for second or third wave of recovery.

However, a fundamental recovery principle is that all systems to be recovered should be attended to within the specifications for recovery time objectives (RTO) and recovery point objectives (RPO) laid out by the business in the DR plan.

3 Context Diagram

ContextDiagram DisasterRecovery.jpg

Figure 1. Context Diagram for Disaster Preparedness and Recovery

3.1 Gather Inputs

The following inputs are necessary for this process to initiate or continue:

  • Budget for DR plan
  • Business risk factors/risk assessment
  • Business and EIT SLAs, OLAs, and contractual obligations to upstream/downstream systems
  • Business continuity plan
  • Configuration management database (CMDB) and asset inventory (see the Operations and Support chapter)
  • Current enterprise architecture artifacts/source code/document management systems
  • EIT service catalogue
  • EIT staff capabilities
  • Vendor service agreements/maintenance agreements

The obvious business driver is to reduce risk for the business, by providing both mitigation strategies and contingency plans. High-risk projects or operational inefficiencies can lead to lost business, which ultimately causes lost income for the business — this can be the high price of risk.

Another business driver for formal DR processes may be to meet regulatory (i.e., SOX) or sustainability objectives. Part of the information gathering includes conducting workshops or interviews to document the drivers to ensure deliverables meet these requirements.

Another related information-gathering effort is to define and document the technical drivers driving DR, including aging technology or lack of application support capabilities.

4 Description of Activities

4.1 Business Impact Analysis

4.1.1 Define Critical Business Services

With input from the business, such as the risk management group, the business continuity management, audit departments, and executives, define services critical to operations. Business impact analysis is a technique that can be used. Critical services are those that, if missing, the enterprise could no longer meet commitments and deliver business products or services. Use business process diagrams to assist with analysis.

The following list is a suggested structure for determining the service categories and corresponding criticality of organizations services (for definitions of the categories, refer to the Standard Service Definitions section***where should ref go?***):

  • Mission critical
  • Business critical
  • Business operational
  • Administrative services [3]

Examples of typical critical services within an enterprise are safety processes, safety documentation management, communication polices and processes, and financial data and processes.

4.1.2 Map Critical Business Services to EIT Services

This function is often referred to as building an EIT service catalog, which is an important input to disaster recovery planning. A service catalog is “a database or structured document with information about all live EIT services, including those available for deployment…The service catalog includes information about deliverables, prices, contact points, ordering, and request processes.” [4] Templates exist to assist with this mapping. [5] See the Operations and Support chapter for more information on service catalogs.

4.1.3 Define Relevant Disaster Scenarios and Responsible Parties

Clearly define criteria for who declares a disaster, including when and how. Mature organizations have assigned responsible parties when disasters occur so that there is a clear leader determining which processes and procedures to implement during the disaster, and following the communication plan. Not having this in place allows for invalid assumptions about who is in charge, including no one taking ownership, or multiple parties competing for ownership, neither of which helps resolve the disaster and recover service.

4.1.4 Define Successive Waves for Extending Recovery Across the Business

Due to the complex nature of EIT systems within the enterprise today, it is unrealistic to provide recovery for all services in the initial recovery phase. There are different levels of recovery for different tiers of business services, and a corresponding, agreed-to timeframe for recovery of each service within the enterprise. These waves of recovery begin with the most critical services as defined, and move through to the least critical in a timeframe acceptable based on a risk-mitigation process. For example, level one (i.e., Tier 1) recovery may take place within 72 hours of a disaster and would include services such as product production, shipping, and customer service applications. Note: A non-critical service may be recovered in the first pass of recovery based solely on a critical service having it as a dependency.

Critical systems management is a useful process in the identification and documentation of critical systems. [6] Also, it ensures proper application life-cycle management is occurring for these EIT services. [7]

Use risk-assessment techniques to analyze how disaster scenarios could adversely affect the business. One such process would be to tier possible risks into levels such as:

  • Affecting the entire enterprise
  • Affecting only certain business units
  • Affecting a single component (either a technology component or a business unit)
  • Affecting a single business function (such as processing credit card transactions)

4.2 Recovery Objectives and DR Plan

4.2.1 Determine Recovery Objectives and Develop Plan

In cooperation with the business, define the recovery point objectives (RPO) and recovery time objectives (RTO).

RPO is the point in time to which all integrated systems are recovered, taking into account backup schedules, sync points, and data transfer points to ensure data quality and integrity.

RTO is how long it will take to return an EIT service to active duty. This varies depending on the criticality of the service as well as how integrated the service is with other services.

***add diagram***

Configuration management is a process that helps document the business impact of a service, as well as documenting the backup and recovery requirements. Also, it provides an inventory of applications and supporting infrastructure needed in the restoration processes.

Organization and Culture

The risk tolerance and depth of capabilities within the organization have a large impact on the disaster preparedness level within the organization. In other words, the business’s disaster tolerance is the “the time gap the business can accept the non-availability of EIT facilities.” [2] The lower the tolerance, the more extensive and costly DR practices and techniques are deployed.

Also, the business product deliveries determine the requirements of the planning effort and metrics.

4.2.2 Develop Communications Plan

An effective communication plan is an essential component to the successful implementation and adoption of the DR processes. The communication plan should include:

  • How to deliver communications when standard communication systems are unavailable (such as email or phone systems)
  • Who to contact in a disaster situation, including specific lists for specific situations or systems affected
  • What the information each communication should include, and not include

Contact information lists should include the following stakeholders:

  • External partners (service providers and suppliers)
  • Police/fire/municipal departments
  • EIT management and staff
  • Business management and product owners

The DR communication plan should describe the process to provide business updates to business continuity plans after the recovery has been completed.

A process for disaster declaration needs to be included in the DR plan and be well communicated to the team. In this section, all contact information and approval authority should be spelled out (i.e., who has the authority to declare a disaster within the company).

4.2.3 Develop Backup and Archive Strategies and Schedule

  • The area of EIT responsible for DR is either responsible for backup and recovery or works closely with the team who is. Backup types such as archiving and incremental backups need to be scheduled for the varying needs of the systems being supported. Backup standards and recovery strategies should be defined to ensure the business requirements are met.
  • Backup and storage technology has a large role to play in the recoverability of applications and systems. Current backup utilities provide incremental forever-backup processes, which can help reduce the cost of storage used for holding backups. In addition, architecture features such as high-availability options and failover redundancies can both reduce risk of service loss, as well as provide mitigation strategies for unstable or unreliable systems.

4.2.4 Develop and Document DR Plan

Disaster recovery plan (DRP) is “a set of human, physical, technical, and procedural resources to recover, within a defined time and cost, an activity interrupted by an emergency or disaster.” [2]

The DR plan document needs to include all of the information needed to recovery all critical systems a business needs to operate. EIT must work with the business to develop and document a DR plan. See the template at the end of this section for recommended sections of a DR plan.

Data collection techniques are critical to the development of a meaningful DR plan that meets the needs of the business.

4.2.5 Interface with Business Continuity

EIT must communicate their processes to the business, and make consistent updates to the business continuity plan (BCP). As the new business components or services are added, the business assigns a criticality level, which then needs to be translated into EIT services, that internally are assigned a tier to determine the disaster recovery requirements. This relationship between business continuity and EIT disaster recovery is a symbiotic relationship that is critical to the success of both functions within the enterprise.

4.3 Implement and Test DR plan (Drill or Simulation)

The first step to implementing a DR plan is to allocate resources and assign responsibilities. The DR team needs to be assigned early in the process to ensure accountability and understanding of roles at the time of a disaster. Many different roles are needed to define and execute a successful DR plan. The DR test is an opportunity to cross train in the many roles needed, in order to mitigate the risk of key roles not being available should a disaster occur. It is likely that no one from the business DR team will be available for the recovery of the systems, so documentation, testing, and assigning a strategic partner is important to the recovery of business services.

4.3.1 Roles and Responsibilities

Input supplier roles are roles and teams that supply the inputs to the process:

  • Enterprise risk management team
  • BCP manager
  • EIT managers
  • Enterprise architecture team
  • Solution management team

Key roles are the responsible individuals or teams that perform the process:

  • DR team leads
  • Test team
  • Recovery center manager
  • System specialists (multiple)
  • Business management team
    • Facilities manager
    • Service manager

User roles expect and receive the deliverables:

  • Operations management team
    • Backup process manager
  • Test manager
  • Business management team

Stakeholder roles are informed or consulted on the process execution:

  • Enterprise risk management team
  • Operations management team
    • Business continuity manager
  • Business management team
    • Contract manager

4.3.2 Document Recovery Strategies

As mentioned above, there are many strategies to recovery services the business needs to function. There is a different solution for every different service out there. The most important element is to choose a strategy, then document and communicate it.

  • Use a third-party hot recovery site. The hot site should be in a geographical separate location to ensure a natural disaster does not take out both the primary production location as well as the backup site location. These distances vary depending on geographic as well as infrastructure dependences (such as power, water, and network commonalities).
  • Real-time mirroring is a technique used to replicate data to a geographically separate location to ensure data is available should a restore processes be needed.
  • Manual, non-standard, or ad-hoc/on-demand/unscheduled procedures are also an important aspect that is a responsibility of the business units to ensure business continuity during the time EIT is rebuilding system services. Recommend to business management that manual processes either be automated, or have testing be completed on a regular basis. Also recommended is documenting methods used to mitigate problems caused by aging technology, such as having parts inventories, redundant or cold standby equipment.
  • Offsite archiving of data is a technique used to ensure backups are available should a disaster make the primary site unavailable. Offsite services are available through many service provides. Due diligence by the DR team is important to ensure the offsite facilities can guarantee secure and proper handling of backup data, an important enterprise asset.
  • Actions plans and recovery processes differ depending on what type of disaster has occurred. A single-component failure results in a standalone recovery of the failing component (i.e., application, server, or appliance). An enterprise-wide disaster results in a disaster declaration event with a full DR plan being executed with the full DR team being mobilized.
  • Identify and document potential disaster scenarios which have a high probability of happening to the business. For example, intrusion or denial of service attacks could have very adverse effects on a technology company, whereas adverse environment conditions create higher risks to a construction company.

4.3.3 Define a Schedule for Service Continuity Testing

Defining a schedule for the disaster recovery testing is a critical factor to the success of a plan. One method often used is to simulate a disaster to test system recovery. Another process strongly recommended is to have production support test refreshes on a regular (i.e., monthly) basis. This not only ensures backups are usable, but also ensures processes are well documented and actually work, ensuring data quality and integration integrity.

4.3.4 Implement and Test DR Plan (Drill or Simulation)

  • Implementation can take many forms. A hot site contract is an agreement with a third-party vendor to provide facilities and infrastructure needed to restore agreed to services in the timeframe specified. There are many variants to this type of contract depending on the dollar value of the contract and the expected availability of internal staff at the time of a disaster. If the hot sites are geographically distant from the enterprise offices it is likely the contract includes staff to perform the recovery as well.
  • Due to the size and complexity of many enterprises, in-house DR facilities are often the norm, meaning these are secondary owned facilities used as recovery centers for primary facilities if needed.
  • A useful metric from testing processes is timing of actual recovery procedures as well as a measure of the capabilities of the DR team, third-party or secondary facilities, and the level of maturity of both staff knowledge and processes accuracy.

4.4 DR Plan — Change Management

Regular verification and updates to backup processes is necessary to ensure accurate and usable backups are delivered. This change management process needs to provide updates to the documentation of the backup and recovery processes. For example:

  • DR testing cycle changes as services change or risk tolerances change.
  • DR test results always cause process improvements and lessons learned to be added to the documentation.
  • Updates and changes to BCP go hand in hand with the changes to systems and services.

Mature organizations build continual improvement evaluation and activities into all processes.

4.4.1 Update DR Plan Based on DR Test Results and Validation

Validation metrics are the measurements that quantify the success of processes, based on the requirements and goals of the business. The following measures can be used to determine the success of a DR test or simple restore procedure execution:

  • Recovery point objectives met
  • Recovery time objectives met
  • Testing result measurements (for example, timing of restore, accuracy of data, and integration points)
  • Verification of backup usability

5 Summary

Like most others, DR processes are a closed loop of plan > build > test > review with action. Continuous improvement and maturity of these processes are obtained through regular execution of DR tests, measuring results, and then revising the DR plan as necessary. Stakeholder involvement with setting requirements is critical to the success of DR processes.

6 Important Skills

  • Change management
  • Enterprise and business architecture
  • Risk management
  • Configuration management
  • Testing and validation

7 Standards

ISO/IEC 24765:2009 Systems and software engineering

8 References — Bibliography

[1] Systems and Software Engineering Vocabulary. (2009). ISO/IEC 24765

[2] ISACA. (n.d.). http://www.isaca.org/Pages/Glossary.aspx

[3] ITIL Service Catalogue: How to produce a Service Catalogue; http://www.itilnews.com/ITIL_Service_Catalogue_How_to_produce_a_Service_Catalogue.html

[4] “Introduction to the ITIL Service Lifecycle”, Second Edition, Office of Government Commerce, 2010

[5] Dwight Kayto, “Defining IT Services”, Art of Change; http://www.artofchange.ca/images/documents/defining%20it%20services.pdf

[6] British Computing Society, BCS “Delivery mission critical system,” 2011; http://www.bcs.org/content/conWebDoc/43139
http://www.downloads.xdelta.co.uk/2011/2011_07_19-bcs-mission_critical-colin_butcher.pdf

[7] Realtech, “Application Lifecycle Management,” Diagram; http://www.realtech.com/wInternational/software/solutions/application-lifecycle-management/application-lifecycle-managementW3DnavanchorW262110100.php

9 Related and Informing Disciplines

  • Business continuity management
  • Application life cycle management
  • Risk management
  • Change management
  • Enterprise and business architecture
  • Configuration management
  • Testing and validation

10 Appendix

TEMPLATE for a disaster recovery plan

  1. Introduction
    • Scope of the plan
    • Objectives — RTO, RPO
    • Authority
    • Distribution
    • Disaster Declaration Process
    • Plan Review
  2. Recovery
    • Recovery Team
    • Recovery Plan
    • Disaster Preparation
    • Recovery Tasks (Short Term and Long Term)
  3. Backup
    • Backup Strategy for each Critical System
  4. Contact Information
    • Facilities Information
    • Recovery Team Information
    • Other Business Contact of Importance