Difference between revisions of "Disaster Preparedness"

From EITBOK
Jump to: navigation, search
Line 1: Line 1:
<p style="color: red">'''Note: This wiki is a work in progress, and may contain missing content, errors, or duplication. We welcome feedback, edits, and real-world examples. [[Main_Page#How to Make Comments and Suggestions|Click here]] for instruction about how to send us feedback.'''</p>
+
<table border="1" cellpadding="10px">
 +
<tr><td>'''Welcome to the trial version of the EITBOK wiki.  Like all wikis, it is a work in progress and may contain errors. We welcome feedback, edits, and real-world examples. [[#How to Make Comments and Suggestions|Click here]] for instruction about how to send us feedback.'''  
 +
</td></tr></table>
 +
<p>&nbsp;</p>
 
----
 
----
 
<h2>Introduction</h2>
 
<h2>Introduction</h2>

Revision as of 23:59, 24 February 2017

Welcome to the trial version of the EITBOK wiki. Like all wikis, it is a work in progress and may contain errors. We welcome feedback, edits, and real-world examples. Click here for instruction about how to send us feedback.

 


1 Introduction

Disaster preparedness and disaster recovery (DR) supports business-continuity planning and includes planning for enterprise information technology (EIT) resiliency, as well as recovery from adversity, so that critical business services affected are restored to a satisfactory working state within an acceptable timeframe after the event.

DR can be defined as “in computer system operations, the return to normal operation after a hardware or software failure.” [1] Also, the “activities and programs designed to return the organization to an acceptable condition. The ability to respond to an interruption in services by implementing a disaster recovery plan to restore an organization’s critical business functions.” [2]

This chapter defines these processes and deliverables, and who should be responsible for planning, creating the documents, and communicating if a disaster occurs. The following are some examples for context:

  • Examples of disasters
    • Natural disaster affecting datacenters or EIT service operations (flood, fire, earthquake, wind)
    • Security breach resulting in disaster (destruction of data, admin password changes, virus/malware installation, sabotage)
    • Usage error (accidental deletion, unplug/turn off system resulting in corruption)
    • Utility failure affecting datacenters (loss of power even after UPS)
    • Vendor failure (cloud provider security failure, oil spill)
    • Staffing issue (employment dispute/walkout, epidemic)
  • Examples of unpreparedness
    • Requiring use of computers or printers when power is out
    • Requiring use of internet when power or connectivity is out
    • Single point of knowledge/control for administration access
    • Lack of offsite backup storage
    • Lack of working restoration from backups
    • Lack of failover datacenters in separate locations
    • Undocumented or out of date documentation for system interfaces
    • Requiring use of phones that are out of power
    • Lack of designation of leaders in restoration efforts (who is in charge of restoring service and they know they are in charge)
    • In general, no cohesive, comprehensive EIT service restoration plan

2 Goals and Principles

EIT organizations are responsible for the following goals:

  • To document and plan for appropriate backup and recovery processes for all systems, and priority of systems for restoration.
  • To create and deploy an EIT disaster recovery plan.
  • To ensure that the business has business continuity processes in place in case of a disaster.

Fundamental principles of disaster recovery depends on the business functions within the enterprise, and how critical each is to the health of the business. There are several methods for determining criticality of functions.

  • Hierarchy of need and stated in SLAs, which is that the most critical business functions should be restored first, or in the first phase of disaster recovery.
  • Keep the lights on (KTLO) or keep the business running (KTBR), which are not the same thing
  • All non-critical services are in the final phase of recovery.
  • Industry-specific, so all systems delivering lifesaving functions are the highest priority for recovery efforts, whereas administration systems wait for second or third wave of recovery.

However, a fundamental recovery principle is that all systems to be recovered should be attended to within the specifications for recovery time objectives (RTO) and recovery point objectives (RPO) laid out by the business in the DR plan.

3 Context Diagram

07 Disaster Preparedness CD.png

Figure 1. Context Diagram for Disaster Preparedness and Recovery

3.1 Gather Inputs

The following inputs are necessary for this process to initiate or continue:

The obvious business driver is to reduce risk for the business, by providing both mitigation strategies and contingency plans. High-risk projects or operational inefficiencies can lead to lost business, which ultimately causes lost income for the business — this can be the high price of risk.

Another business driver for formal DR processes may be to meet regulatory (i.e., SOX) or sustainability objectives. Part of the information gathering includes conducting workshops or interviews to document the drivers to ensure that deliverables meet these requirements.

Another related information-gathering effort is to define and document the technical drivers driving DR, including aging technology and lack of application-support capabilities.

4 Description of Activities

4.1 Business Impact Analysis

4.1.1 Define Critical Business Services

The first activity is to define services critical to operations. Critical services are those that, if missing, would mean that the enterprise could no longer meet commitments and deliver business products or services. Use business impact analysis, and get input from the business, such as the risk management group, the business continuity management, audit departments, and executives. Use business process diagrams to assist with analysis.

The following list is a suggested structure for determining the service categories and corresponding criticality of organizations services (for definitions of the categories, refer to the Standard Service Definitions section***where should ref go?***):

  • Mission critical
  • Business critical
  • Business operational
  • Administrative services [3]

Examples of typical critical services within an enterprise are safety processes, safety documentation management, communication polices and processes, and financial data and processes.

4.1.2 Map Critical Business Services to EIT Services

This function is often referred to as building an EIT service catalog, which is an important input to disaster recovery planning. A service catalog is “a database or structured document with information about all live EIT services, including those available for deployment…The service catalog includes information about deliverables, prices, contact points, ordering, and request processes.” [4] Templates exist to assist with this mapping. [5] See the Operations and Support chapter for more information on service catalogs.

4.1.3 Define Relevant Disaster Scenarios and Responsible Parties

Clearly define criteria for who declares a disaster, including when and how. Mature organizations have assigned who is in charge during disasters so that there is a clear leader who can decide which processes and procedures to implement, and who knows to follow the communication plan. If no plan is in place, it allows for invalid assumptions about who is in charge, including no one taking responsibility, or multiple parties competing to be in charge, neither of which helps resolve the disaster and recover service.

4.1.4 Define Successive Waves for Extending Recovery Across the Business

Due to the complex nature of EIT systems within the enterprise today, it is unrealistic to provide recovery for all services in the initial recovery phase. There are different levels of recovery for different tiers of business services, and a corresponding, agreed-to timeframe for recovery of each service within the enterprise. These waves of recovery begin with the most critical services, and move through to the least critical in an acceptable timeframe based on a risk-mitigation process. For example, level one (i.e., Tier 1) recovery may take place within 72 hours of a disaster and would include services such as product production, shipping, and customer-service applications. Note: A non-critical service may be recovered in the first pass of recovery based solely on a critical service having it as a dependency.

Critical systems management is a useful process in the identification and documentation of critical systems. [6] Also, it ensures that proper application life-cycle management is occurring for these EIT services. [7]

Use risk-assessment techniques to analyze how disaster scenarios could adversely affect the business. One such process would be to tier possible risks into levels such as:

  • Affecting the entire enterprise
  • Affecting only certain business units
  • Affecting a single component (either a technology component or a business unit)
  • Affecting a single business function (such as processing credit card transactions)

4.2 Recovery Objectives and DR Plan

4.2.1 Determine Recovery Objectives and Develop Plan

In cooperation with the business, define the recovery point objectives (RPOs) and recovery time objectives (RTOs).

RPO is the point in time to which all integrated systems are recovered, taking into account backup schedules, sync points, and data-transfer points to ensure data quality and integrity.

RTO is how long it will take to return an EIT service to active duty. This varies depending on the criticality of the service as well as how integrated the service is with other services.

RecoveryTimeline.jpg

Figure 2. Recovery Timeline

Configuration management is a process that helps document the business impact of a service, as well as documenting the backup and recovery requirements. Also, it provides an inventory of the applications and supporting infrastructure needed in the restoration processes.

Organization and Culture

The risk tolerance and depth of capabilities within the organization have a large impact on the organization’s disaster preparedness level. In other words, the business’s disaster tolerance is the “the time gap the business can accept the non-availability of EIT facilities.” [2] The lower the tolerance, the more extensive and costly DR practices and techniques are deployed.

Also, the business product deliveries determine the requirements of the planning effort and metrics.

4.2.2 Develop Communications Plan

An effective communication plan is an essential component to the successful implementation and adoption of the DR processes. The communication plan should include:

  • How to deliver communications when standard communication systems are unavailable (such as email or phone systems)
  • Who to contact in a disaster situation, including specific lists for specific situations or systems affected
  • What the information each communication should and shouldn’t include

Contact information lists should include the following stakeholders:

  • External partners (service providers and suppliers)
  • Police/fire/municipal departments
  • EIT management and staff
  • Business management and product owners

The DR communication plan should describe the process to provide business updates to business-continuity plans after the recovery has been completed.

A process for disaster declaration needs to be included in the DR plan and be well communicated to the team. In this section, all contact information and approval authority should be spelled out (i.e., who has the authority to declare a disaster within the company).

4.2.3 Develop Backup and Archive Strategies and Schedule

  • The EIT team responsible for DR is either responsible for backup and recovery or works closely with the team who is. Archiving and incremental backups need to be scheduled for the varying needs of the systems being supported. Backup standards and recovery strategies should be defined to ensure the business requirements are met.
  • Backup and storage technology has a large role to play in the recoverability of applications and systems. Current backup utilities provide incremental forever-backup processes, which can help reduce the cost of storage used for holding backups. In addition, architecture features such as high-availability options and failover redundancies can both reduce risk of service loss, and provide mitigation strategies for unstable or unreliable systems.

4.2.4 Develop and Document DR Plan

A disaster recovery plan (DRP) is “a set of human, physical, technical, and procedural resources to recover, within a defined time and cost, an activity interrupted by an emergency or disaster.” [2]

The DR plan document needs to include all of the information required to recovery all critical systems that a business needs to operate. EIT must work with the business to develop and document a DR plan. See the template at the end of this chapter for recommended sections of a DR plan.

Data collection techniques are critical to the development of a meaningful DR plan that meets the needs of the business.

4.2.5 Interface with Business Continuity

The EIT team must communicate their processes to the business, and make consistent updates to the business-continuity plan (BCP). As new business components or services are added, the business assigns a criticality level, which then needs to be translated into EIT services, that are assigned internally to a tier to determine the disaster recovery requirements. The relationship between business continuity and EIT disaster recovery is symbiotic and is critical to the success of both functions within the enterprise.

4.3 Implement and Test DR plan (Drill or Simulation)

The first step to implementing a DR plan is to allocate resources and assign responsibilities. The DR team needs to be assigned early in the process to ensure accountability and understanding of roles at the time of a disaster. Many different roles are needed to define and execute a successful DR plan. The DR test is an opportunity to cross train roles, to mitigate the risk of key roles not being available should a disaster occur. It is likely that no one from the business DR team will be available for the recovery of the systems, so documentation, testing, and assigning a strategic partner is important to the recovery of business services.

4.3.1 Roles and Responsibilities

Input supplier roles are roles and teams that supply the inputs to the process:

  • Enterprise risk-management team
  • BCP manager
  • EIT managers
  • Enterprise architecture team
  • Solution management team

Key roles are the responsible individuals or teams that perform the process:

  • DR team leads
  • Test team
  • Recovery center manager
  • System specialists (multiple)
  • Business management team
    • Facilities manager
    • Service manager

User roles expect and receive the deliverables:

  • Operations management team
    • Backup process manager
  • Test manager
  • Business management team

Stakeholder roles are informed or consulted on the process execution:

  • Enterprise risk management team
  • Operations management team
    • Business continuity manager
  • Business management team
    • Contract manager

4.3.2 Document Recovery Strategies

As mentioned above, there are many strategies to recover the services that the business needs to function. There is a different solution for every different service out there. The most important element is to choose a strategy, then document and communicate it.

  • Use a third-party hot recovery site. The hot site should be in a geographically separate location to ensure that a natural disaster does not take out both the primary production location as well as the backup site location. These distances vary depending on geographic as well as infrastructure dependences (such as power, water, and network commonalities).
  • Real-time mirroring is a technique used to replicate data to a geographically separate location to ensure data is available if a restore processes is needed.
  • Manual, non-standard, or ad hoc/on-demand/unscheduled procedures are an important aspect that is a responsibility of the business units to ensure business continuity while EIT is rebuilding system services. Recommend to business management that manual processes either be automated, or have testing be completed on a regular basis. Document the methods used to mitigate problems caused by aging technology, such as having parts inventories, and redundant or cold standby equipment.
  • Offsite data archiving ensures that backups are available if a disaster makes the primary site unavailable. Offsite services are available through many service provides. Due diligence by the DR team is important to ensure that the offsite facilities can guarantee secure and proper handling of backup data, which is an important enterprise asset.
  • Action plans and recovery processes differ depending on what type of disaster has occurred. A single-component failure results in a standalone recovery of the failing component (such as an application, server, or appliance). An enterprise-wide disaster results in a disaster declaration event with a full DR plan being executed with the full DR team being mobilized.
  • Identify and document potential disaster scenarios that have a high probability. For example, intrusion or denial of service attacks could have adverse effects on a technology company, whereas adverse environment conditions create higher risks to a construction company.

4.3.3 Define a Schedule for Service Continuity Testing

For the success of the recovery plan it is critical to define a schedule for the disaster recovery testing. One method often used is to simulate a disaster to test system recovery. Another process strongly recommended is to have production support test refreshes on a regular (i.e., monthly) basis. This not only ensures that backups are usable, but also that processes are well documented and functional, ensuring data quality and integration integrity.

4.3.4 Implement and Test DR Plan (Drill or Simulation)

  • Implementation can take many forms. A hot-site contract is an agreement with a third-party vendor to provide facilities and infrastructure needed to restore agreed to services in the timeframe specified. There are many variants to this type of contract depending on the dollar value of the contract and the expected availability of internal staff at the time of a disaster. If the hot sites are geographically distant from the enterprise offices, it is likely the contract includes staff to perform the recovery as well.
  • Due to the size and complexity of many enterprises, in-house DR facilities are often the norm, meaning these are secondary facilities used as recovery centers for primary facilities if needed.
  • A useful metric from testing processes is the timing of the actual recovery procedures as well as a measure of the capabilities of the DR team, third-party, or secondary facilities, and the level of maturity of both staff knowledge and processes accuracy.

4.4 DR Plan — Change Management

Regular verification and updates to backup processes are necessary to ensure that accurate and usable backups are delivered. This change-management process needs to provide updates to the documentation of the backup and recovery processes. For example:

  • DR testing cycle changes as services change or risk tolerances change.
  • DR test results always cause process improvements and lessons learned to be added to the documentation.
  • Updates and changes to business-continuity plan (BCP) go hand in hand with the changes to systems and services.

Mature organizations build continual improvement evaluation and activities into all processes.

4.4.1 Update DR Plan Based on DR Test Results and Validation

Validation metrics are measurements that quantify the success of processes, based on the requirements and goals of the business. The following measures can be used to determine the success of a DR test or simple restore procedure execution:

  • Recovery point objectives met
  • Recovery time objectives met
  • Testing result measurements (for example, timing of restore, accuracy of data, and integration points)
  • Verification of backup usability

5 Summary

Like most processes, DR processes are a closed loop of plan > build > test > review with action. Continuous improvement and maturity of these processes are obtained through the regular execution of DR tests, measuring results, and then revising the DR plan as necessary. Stakeholder involvement with setting requirements is critical to the success of DR processes.

6 Key Maturity Frameworks

Capability maturity for EIT refers to its ability to reliably perform. Maturity is a measured by an organization’s readiness and capability expressed through its people, processes, data and technologies and the consistent measurement practices that are in place. Please see Appendix F for additional information about maturity frameworks.

Many specialized frameworks have been developed since the original Capability Maturity Model (CMM) that was developed by the Software Engineering Institute in the late 1980s. This section describes how some of those apply to the activities described in this chapter.

6.1 IT-Capability Maturity Framework (IT-CMF)

The IT-CMF was developed by the Innovation Value Institute in Ireland. It helps organizations to measure, develop, and monitor their EIT capability maturity progression. It consists of 35 IT management capabilities that are organized into four macro capabilities:

  • Managing IT like a business
  • Managing the IT budget
  • Managing the IT capability
  • Managing IT for business value

The three most relevant critical capabilities are Technical Infrastructure Management (TIM), Information Security Management (ISM) and Enterprise Information Management (EIM).

6.1.1 Technical Infrastructure Management Maturity

The following statements provide a high-level overview of the Technical Infrastructure Management (TIM) capability at successive levels of maturity.

Level 1 Management of the IT infrastructure is reactive or ad hoc.
Level 2 Documented policies are emerging relating to the management of a limited number of infrastructure components. Predominantly manual procedures are used for IT infrastructure management. Visibility of capacity and utilization across infrastructure components is emerging.
Level 3 Management of infrastructure components is increasingly supported by standardized tool sets that are partly integrated, resulting in decreased execution times and improving infrastructure utilization.
Level 4 Policies relating to IT infrastructure management are implemented automatically, promoting execution agility and achievement of infrastructure utilization targets.
Level 5 The IT infrastructure is continually reviewed so that it remains modular, agile, lean, and sustainable.

6.1.2 Information Security Management Maturity

The following statements provide a high-level overview of the Information Security Management (ISM) capability at successive levels of maturity.

Level 1 The approach to information security tends to be localized. Incidents are typically not responded to in a timely manner.
Level 2 Defined security approaches, policies, and controls are emerging, primarily focused on complying with regulations.
Level 3 Standardized security approaches, policies, and controls are in place across the IT function, dealing with access rights, business continuity, budgets, toolsets, incident response management, audits, non-compliance, and so on.
Level 4 Comprehensive security approaches, policies, and controls are in place and are fully integrated across the organization.
Level 5 Security approaches, policies, and controls are regularly reviewed to maintain a proactive approach to preventing security breaches.

6.1.3 Enterprise Information Management Maturity

The following statements provide a high-level overview of the Enterprise Information Management (EIM) capability at successive levels of maturity.

Level 1 Management has limited awareness of information management opportunities.
Level 2 Basic and discrete information management approaches are in place, typically by function or line of business.
Level 3 Standardized information management policies, standards, and controls are in place across the IT function, enabling formal oversight of all aspects of information management.
Level 4 Comprehensive information management policies, standards, and controls are in place across the organization. Business intelligence and analysis are recognized as key to organizational success.
Level 5 Information management policies, standards, and controls are continually reviewed based on agreed risk tolerance factors. Their scope effectively extends to key business ecosystem partners.


7 Key Competence Frameworks

While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected 3 major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.

Creating mappings between these frameworks and our chapters is challenging, because they come from different perspectives and have different goals. There is rarely a 100% correspondence between the frameworks and our chapters, and, despite careful consideration some subjectivity was used to create the mappings. Please take that in consideration as you review them.

7.1 Skills Framework for the Information Age

The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes 7 levels of competency which can be applied to each skill. Not all skills, however, cover all seven levels. Some reach only partially up the seven step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. It is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)

SFIA skills have not yet been defined for the this chapter.


7.2 European Competency Framework

The European Union’s European e-Competence Framework (e-CF) has 40 competences and is used by a large number of companies, qualification providers and others in public and private sectors across the EU. It uses five levels of competence proficiency (e-1 to e-5). No competence is subject to all five levels.

The e-CF is published and legally owned by CEN, the European Committee for Standardization, and its National Member Bodies (www.cen.eu). Its creation and maintenance has been co-financed and politically supported by the European Commission, in particular, DG (Directorate General) Enterprise and Industry, with contributions from the EU ICT multi-stakeholder community, to support competitiveness, innovation, and job creation in European industry. The Commission works on a number of initiatives to boost ICT skills in the workforce. Version 1.0 to 3.0 were published as CEN Workshop Agreements (CWA). The e-CF 3.0 CWA 16234-1 was published as an official European Norm (EN), EN 16234-1. For complete information, please see http://www.ecompetences.eu.

e-CF Dimension 2e-CF Dimension 3
E.3. Risk Management (MANAGE)
Implements the management of risk across information system s through the application of the enterprise defined risk management policy and procedure. Assesses risk to the organisation’s business, including web, cloud and mobile resources. Documents potential risk and containment plans.
Level 2-4

7.3 i Competency Dictionary

The Information Technology Promotion Agency (IPA) of Japan has developed the i Competency Dictionary (iCD), translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. It is an extensive skills and tasks database, used in Japan and southeast Asian countries. It establishes a taxonomy of tasks and the skills required to perform the tasks. The IPA is also responsible for the Information Technology Engineers Examination (ITEE), which has grown into one of the largest scale national examinations in Japan, with approximately 600,000 applicants each year.

The iCD consists of a Task Dictionary and a Skill Dictionary. Skills for a specific task are identified via a “Task x Skill” table. (Please see Appendix A for the task layer and skill layer structures.) EITBOK activities in each chapter require several tasks in the Task Dictionary.

The table below shows a sample task from iCD Task Dictionary Layer 2 (with Layer 1 in parentheses) that correspond to activities in this chapter. It also shows the Layer 2 (Skill Classification), Layer 3 (Skill Item), and Layer 4 (knowledge item from the IPA Body of Knowledge) prerequisite skills associated with the sample task, as identified by the Task x Skill Table of the iCD Skill Dictionary. The complete iCD Task Dictionary (Layer 1-4) and Skill Dictionary (Layer 1-4) can be obtained by returning the request form provided at http://www.ipa.go.jp/english/humandev/icd.html.

Task DictionarySkill Dictionary
Task Layer 1 (Task Layer 2)Skill ClassificationSkill ItemAssociated Knowledge Items
Formulation of business continuity plan
(Business continuity management)
Business continuity planning (BCP) BCP formulation methods
  • Risk analysis
  • Business continuity and identification of bottlenecks
  • Clarification of implementation standards
  • Recovery prioritization
  • Setting of target recovery time

(More)

8 Key Roles

These roles are are common to ITSM:

  • IT Service Continuity Manager
  • Risk Manager
  • Information Security Manager
  • Financial Manager

9 Standards

ANSI/ASIS SPC.1-2009. Organizational Resilience: Security, Preparedness and Continuity Management Systems—Requirements with Guidance for Use

ISO 22301:2012, Societal security -- Business continuity management systems --- Requirements

ISO/IEC 20000-1:2011, (IEEE Std 20000-1:2013) Information technology – Service management – Part 1: Service management system requirements

ISO/IEC 27031:2011, Information technology -- Security techniques -- Guidelines for information and communication technology readiness for business continuity


10 References

[1] Systems and Software Engineering Vocabulary. (2009). ISO/IEC 24765

[2] ISACA. (n.d.). http://www.isaca.org/Pages/Glossary.aspx

[3] ITIL Service Catalogue: How to produce a Service Catalogue; http://www.itilnews.com/ITIL_Service_Catalogue_How_to_produce_a_Service_Catalogue.html

[4] Introduction to the ITIL Service Lifecycle, Second Edition, Office of Government Commerce, 2010

[5] Dwight Kayto, Defining IT Services, Art of Change; http://www.artofchange.ca/images/documents/defining%20it%20services.pdf

[6] British Computing Society, BCS Delivery mission critical system, 2011; http://www.bcs.org/content/conWebDoc/43139
http://www.downloads.xdelta.co.uk/2011/2011_07_19-bcs-mission_critical-colin_butcher.pdf

[7] Realtech, Application Lifecycle Management, Diagram; http://www.realtech.com/wInternational/software/solutions/application-lifecycle-management/application-lifecycle-managementW3DnavanchorW262110100.php

11 Related and Informing Disciplines

  • Business continuity management
  • Application life-cycle management
  • Risk management
  • Change management
  • Enterprise and business architecture
  • Configuration management
  • Testing and validation

12 Disaster Recovery Plan Template

Here is an example template for a disaster recovery plan.

  1. Introduction
    • Scope of the plan
    • Objectives — RTO, RPO
    • Authority
    • Distribution
    • Disaster declaration process
    • Plan review
  2. Recovery
    • Recovery team
    • Recovery plan
    • Disaster preparation
    • Recovery tasks (short term and long term)
  3. Backup
    • Backup strategy for each critical system
  4. Contact information
    • Facilities information
    • Recovery team information
    • Other important business contacts