Disaster Preparedness
|
Contents
1 Introduction
Disaster preparedness and disaster recovery (DR) support business-continuity planning and include planning for Enterprise information technology (EIT) resiliency, as well as recovery from adversity, so that critical business services affected are restored to a satisfactory working state within an acceptable timeframe after an event.
DR can be defined as "in computer system operations, the return to normal operation after a hardware or software failure." [1] Also, the "activities and programs designed to return the organization to an acceptable condition. And the ability to respond to an interruption in services by implementing a disaster recovery plan to restore an organization's critical business functions." [2]
This chapter defines these processes and deliverables, and who should be responsible for planning, creating the documents, and communicating if a disaster occurs. The following are some examples for context:
- Examples of disasters
- Natural disaster affecting datacenters or EIT service operations (flood, fire, earthquake, wind)
- Security breach resulting in a disaster (destruction of data, admin password changes, virus/malware installation, sabotage)
- Usage error (accidental deletion, unplug/turn off system resulting in corruption)
- Utility failure affecting datacenters (loss of power even after UPS)
- Vendor failure (cloud provider security failure, oil spill)
- Staffing issue (employment dispute/walkout, epidemic)
- Examples of unpreparedness
- Requiring use of computers or printers when power is out
- Requiring use of Internet when power or connectivity is out
- Single point of knowledge/control for administration access
- Lack of offsite backup storage
- Lack of working restoration from backups
- Lack of failover datacenters in separate locations
- Undocumented or out-of-date documentation for system interfaces
- Requiring use of phones that are out of power
- Lack of designation of leaders in restoration efforts (who is in charge of restoring service and they know they are in charge)
- In general, no cohesive, comprehensive EIT service restoration plan
2 Goals and Principles
EIT organizations are responsible for the following goals:
- To document and plan for appropriate backup and recovery processes for all systems, and priority of systems for restoration.
- To create and deploy an EIT disaster recovery plan.
- To ensure that the business has business-continuity processes in place in case of a disaster.
The fundamental principles of disaster recovery depend on the business functions within the enterprise, and how critical each is to the health of the business. There are several methods for determining criticality of functions:
- Hierarchy of need as stated in SLAs, which is that the most critical business functions should be restored first, or in the first phase of disaster recovery.
- Keep the lights on (KTLO) or keep the business running (KTBR), which are not the same thing.
- All non-critical services are in the final phase of recovery.
- Industry-specific, so all systems delivering lifesaving functions are the highest priority for recovery efforts, whereas administration systems wait for second or third wave of recovery.
However, a fundamental recovery principle is that all systems to be recovered should be attended to within the specifications for recovery time objectives (RTOs) and recovery point objectives (RPOs) laid out by the business in the DR plan.
3 Context Diagram
Figure 1. Context Diagram for Disaster Preparedness and Recovery
3.1 Gather Inputs
The following inputs are necessary for this process to initiate or continue:
- Budget for DR plan
- Business risk factors/risk assessment
- Business and EIT service-level agreements (SLAs), operational-level agreements (OLAs), and contractual obligations to upstream/downstream systems
- Business-continuity plan
- Configuration management database (CMDB) and asset inventory (see the Operations and Support chapter)
- Current enterprise architecture artifacts/source code/document management systems
- EIT service catalog
- EIT staff capabilities
- Vendor service agreements/maintenance agreements
The obvious business driver is to reduce risk for the business, by providing both mitigation strategies and contingency plans. High-risk projects or operational inefficiencies can lead to lost business, which ultimately causes lost income for the business—this can be the high price of risk.
Another business driver for formal DR processes may be to meet regulatory (i.e., SOX) or sustainability objectives. Part of the information gathering includes conducting workshops or interviews to document the drivers to ensure that deliverables meet these requirements.
Another related information-gathering effort is to define and document the technical drivers driving DR, including aging technology and lack of application-support capabilities.
4 Description of Activities
4.1 Business Impact Analysis
4.1.1 Define Critical Business Services
The first activity is to define services critical to operations. Critical services are those that, if missing, would mean that the enterprise could no longer meet commitments and deliver business products or services. Use business impact analysis, and get input from the business, such as the risk management group, the business continuity management, audit departments, and executives. Use business process diagrams to assist with analysis.
The first activity is to define services critical to operations. Critical services are those that, if missing, would mean that the enterprise could no longer meet commitments and deliver business products or services. Use business impact analysis, and get input from the business, such as the risk management group, the business continuity management, audit departments, and executives. Use business process diagrams to assist with analysis.
The following list is a suggested structure for determining the service categories and corresponding criticality of organizations services (for definitions of the categories, refer to [3]):
- Mission critical
- Business critical
- Business operational
- Administrative services [3]
Examples of typical critical services within an enterprise are safety processes, safety documentation management, communication polices and processes, and financial data and processes.
4.1.2 Map Critical Business Services to EIT Services
This function is often referred to as building an EIT service catalog, which is an important input to disaster recovery planning. A service catalog is "a database or structured document with information about all live EIT services, including those available for deployment…The service catalog includes information about deliverables, prices, contact points, ordering, and request processes." [4] There are templates to assist with this mapping. [5] See the Operations and Support chapter for more information on service catalogs.
4.1.3 Define Relevant Disaster Scenarios and Responsible Parties
Clearly define criteria for who declares a disaster, including when and how. Mature organizations have assigned who is in charge during disasters so that there is a clear leader who can decide which processes and procedures to implement, and who knows to follow the communication plan. If no plan is in place, it allows for invalid assumptions about who is in charge, including no one taking responsibility, or multiple parties competing to be in charge, neither of which helps resolve the disaster and recover service.
4.1.4 Define Successive Waves for Extending Recovery Across the Business
Due to the complex nature of EIT systems within the enterprise today, it is unrealistic to provide recovery for all services in the initial recovery phase. There are different levels of recovery for different tiers of business services, and a corresponding, agreed-to timeframe for recovery of each service within the enterprise. These waves of recovery begin with the most critical services, and move through to the least critical in an acceptable timeframe based on a risk-mitigation process. For example, level one (i.e., Tier 1) recovery may take place within 72 hours of a disaster and would include services such as product production, shipping, and customer-service applications. Note: A non-critical service may be recovered in the first pass of recovery based solely on a critical service having it as a dependency.
Critical systems management is a useful process in the identification and documentation of critical systems. [6] Also, it ensures that proper application lifecycle management is occurring for these EIT services. [7]
Use risk-assessment techniques to analyze how disaster scenarios could adversely affect the business. One such process would be to tier possible risks into levels such as:
- Affecting the entire enterprise
- Affecting only certain business units
- Affecting a single component (either a technology component or a business unit)
- Affecting a single business function (such as processing credit card transactions)
4.2 Recovery Objectives and DR Plan
4.2.1 Determine Recovery Objectives and Develop Plan
In cooperation with the business, define the recovery point objective (RPO) and recovery time objective (RTO).
RPO is the point in time to which all integrated systems are recovered, taking into account backup schedules, sync points, and data-transfer points to ensure data quality and integrity.
RTO is how long it will take to return an EIT service to active duty. This varies depending on the criticality of the service as well as how integrated the service is with other services.
Configuration management is a process that helps document the business impact of a service, as well as documenting the backup and recovery requirements. Also, it provides an inventory of the applications and supporting infrastructure needed in the restoration processes.
Organization and Culture
The risk tolerance and depth of capabilities within the organization have a large impact on the organization's disaster preparedness level. In other words, the business's disaster tolerance is the "the time gap the business can accept the non-availability of EIT facilities." [2] The lower the tolerance, the more extensive and costly DR practices and techniques are deployed.
Also, the business product deliveries determine the requirements of the planning effort and metrics.
4.2.2 Develop Communications Plan
An effective communication plan is an essential component to the successful implementation and adoption of the DR processes. The communication plan should include:
- How to deliver communications when standard communication systems are unavailable (such as email or phone systems)
- Who to contact in a disaster situation, including specific lists for specific situations or systems affected
- What information each communication should and shouldn't include
Contact information lists should include the following stakeholders:
- External partners (service providers and suppliers)
- Police/fire/municipal departments
- EIT management and staff
- Business management and product owners
The DR communication plan should describe the process to provide business updates to business-continuity plans after the recovery has been completed.
A process for disaster declaration needs to be included in the DR plan and be well communicated to the team. In this section, all contact information and approval authority should be spelled out (i.e., who has the authority to declare a disaster within the company).
4.2.3 Develop Backup and Archive Strategies and Schedule
- The EIT team responsible for DR is either responsible for backup and recovery or works closely with the team who is. Archiving and incremental backups need to be scheduled for the varying needs of the systems being supported. Backup standards and recovery strategies should be defined to ensure the business requirements are met.
- Backup and storage technology has a large role to play in the recoverability of applications and systems. Current backup utilities provide incremental forever-backup processes, which can help reduce the cost of storage used for holding backups. In addition, architecture features such as high-availability options and failover redundancies can both reduce risk of service loss, and provide mitigation strategies for unstable or unreliable systems.
4.2.4 Develop and Document DR Plan
A disaster recovery plan (DRP) is "a set of human, physical, technical, and procedural resources to recover, within a defined time and cost, an activity interrupted by an emergency or disaster." [2]
The DR plan document needs to include all the information required to recover all critical systems that a business needs to operate. EIT must work with the business to develop and document a DR plan. See the template at the end of this chapter for recommended sections of a DR plan.
Data collection techniques are critical to the development of a meaningful DR plan that meets the needs of the business.
4.2.5 Interface with Business Continuity
The EIT team must communicate their processes to the business, and make consistent updates to the business-continuity plan (BCP). As new business components or services are added, the business assigns a criticality level, which then needs to be translated into EIT services that are assigned internally to a tier to determine the disaster recovery requirements. The relationship between business continuity and EIT disaster recovery is symbiotic and is critical to the success of both functions within the enterprise.
4.3 Implement and Test DR plan (Drill or Simulation)
The first step to implementing a DR plan is to allocate resources and assign responsibilities. The DR team needs to be assigned early in the process to ensure accountability and an understanding of roles at the time of a disaster. Many different roles are needed to define and execute a successful DR plan. The DR test is an opportunity to cross train roles, to mitigate the risk of key roles not being available if a disaster occurs. It is likely that no one from the business DR team will be available for the recovery of the systems, so documentation, testing, and assigning a strategic partner is important to the recovery of business services.
4.3.1 Roles and Responsibilities
Input supplier roles are roles and teams that supply the inputs to the process:
- Enterprise risk-management team
- BCP manager
- EIT managers
- Enterprise architecture team
- Solution management team
Key roles are the responsible individuals or teams that perform the process:
- DR team leads
- Test team
- Recovery center manager
- System specialists (multiple)
- Business management team
- Facilities manager
- Service manager
User roles expect and receive the deliverables:
- Operations management team
- Backup process manager
- Test manager
- Business management team
Stakeholder roles are informed or consulted on the process execution:
- Enterprise risk management team
- Operations management team
- Business continuity manager
- Business management team
- Contract manager
4.3.2 Document Recovery Strategies
As mentioned above, there are many strategies to recover the services that the business needs to function. There is a different solution for every service. The most important element is to choose a strategy, then document and communicate it.
- Use a third-party hot recovery site. The hot site should be in a geographically separate location to ensure that a natural disaster does not take out both the primary production location as well as the backup site location. These distances vary depending on geographic and infrastructure dependences (such as power, water, and network commonalities).
- Real-time mirroring is a technique used to replicate data to a geographically separate location to ensure that data is available if a restore processes is needed.
- Manual, non-standard, or ad hoc/on-demand/unscheduled procedures are an important aspect that is a responsibility of the business units to ensure business continuity while EIT is rebuilding system services. Recommend to business management that manual processes either be automated, or have testing be completed on a regular basis. Document the methods used to mitigate problems caused by aging technology, such as having parts inventories, and redundant or cold-standby equipment.
- Offsite data archiving ensures that backups are available if a disaster makes the primary site unavailable. Offsite services are available through many service providers. Due diligence by the DR team is important to ensure that the offsite facilities can guarantee secure and proper handling of backup data, which is an important enterprise asset.
- Action plans and recovery processes differ depending on what type of disaster has occurred. A single-component failure results in a standalone recovery of the failing component (such as an application, server, or appliance). An enterprise-wide disaster results in a disaster declaration event with a full DR plan being executed with the full DR team being mobilized.
- Identify and document potential disaster scenarios that have a high probability. For example, intrusion or denial of service attacks could have adverse effects on a technology company, whereas adverse environment conditions create higher risks to a construction company.
4.3.3 Define a Schedule for Service Continuity Testing
For the success of the recovery plan, it is critical to define a schedule for disaster recovery testing. One method often used is to simulate a disaster to test system recovery. Another process strongly recommended is to have production support test refreshes on a regular (i.e., monthly) basis. This not only ensures that backups are usable, but also that processes are well documented and functional, ensuring data quality and integration integrity.
4.3.4 Implement and Test DR Plan (Drill or Simulation)
- Implementation can take many forms. A hot-site contract is an agreement with a third-party vendor to provide the facilities and infrastructure needed to restore agreed to services in the timeframe specified. There are many variants to this type of contract depending on the dollar value of the contract and the expected availability of internal staff at the time of a disaster. If the hot sites are geographically distant from the enterprise offices, it is likely the contract includes staff to perform the recovery as well.
- Due to the size and complexity of many enterprises, in-house DR facilities are often the norm, meaning these are secondary facilities used as recovery centers for primary facilities, if needed.
- A useful metric from testing processes is the timing of the actual recovery procedures as well as a measure of the capabilities of the DR team, third-party, or secondary facilities, and the level of maturity of both staff knowledge and processes accuracy.
4.4 DR Plan—Change Management
Regular verification and updates to backup processes are necessary to ensure that accurate and usable backups are delivered. This change-management process needs to provide updates to the documentation of the backup and recovery processes. For example:
- DR testing cycle changes as services change or risk tolerances change.
- DR test results always cause process improvements and lessons learned to be added to the documentation.
- Updates and changes to the business-continuity plan (BCP) go hand in hand with the changes to systems and services.
Mature organizations build continual improvement evaluation and activities into all processes.
4.4.1 Update DR Plan Based on DR Test Results and Validation
Validation metrics are measurements that quantify the success of processes, based on the requirements and goals of the business. The following measures can be used to determine the success of a DR test or simple restore procedure execution:
- Recovery point objectives met
- Recovery time objectives met
- Testing result measurements (for example, timing of restore, accuracy of data, and integration points)
- Verification of backup usability
5 Summary
Like most processes, DR processes are a closed loop of plan > build > test > review with action. Continuous improvement and maturity of these processes are obtained through the regular execution of DR tests, measuring results, and then revising the DR plan as necessary. Stakeholder involvement with setting requirements is critical to the success of DR processes.
6 Key Maturity Frameworks
Capability maturity for EIT refers to its ability to reliably perform. Maturity is measured by an organization's readiness and capability expressed through its people, processes, data, technologies, and the consistent measurement practices that are in place. See Appendix F for additional information about maturity frameworks.
Many specialized frameworks have been developed since the original Capability Maturity Model (CMM) that was developed by the Software Engineering Institute in the late 1980s. This section describes how some of those apply to the activities described in this chapter.
6.1 IT-Capability Maturity Framework (IT-CMF)
The IT-CMF was developed by the Innovation Value Institute in Ireland. This framework helps organizations to measure, develop, and monitor their EIT capability maturity progression. It consists of 35 EIT management capabilities that are organized into four macro capabilities:
- Managing EIT like a business
- Managing the EIT budget
- Managing the EIT capability
- Managing EIT for business value
The three most relevant critical capabilities are technical infrastructure management (TIM), information security management (ISM), and enterprise information management (EIM).
6.1.1 Technical Infrastructure Management Maturity
The following statements provide a high-level overview of the technical infrastructure management (TIM) capability at successive levels of maturity.
Level 1 | Management of the EIT infrastructure is reactive or ad hoc. |
Level 2 | Documented policies are emerging relating to the management of a limited number of infrastructure components. Predominantly manual procedures are used for EIT infrastructure management. Visibility of capacity and utilization across infrastructure components is emerging. |
Level 3 | Management of infrastructure components is increasingly supported by standardized tool sets that are partly integrated, resulting in decreased execution times and improving infrastructure utilization. |
Level 4 | Policies related to EIT infrastructure management are implemented automatically, promoting execution agility and achievement of infrastructure utilization targets. |
Level 5 | The EIT infrastructure is continually reviewed so that it remains modular, agile, lean, and sustainable. |
6.1.2 Information Security Management Maturity
The following statements provide a high-level overview of the information security management (ISM) capability at successive levels of maturity.
Level 1 | The approach to information security tends to be localized. Incidents are typically not responded to in a timely manner. |
Level 2 | Defined security approaches, policies, and controls are emerging, primarily focused on complying with regulations. |
Level 3 | Standardized security approaches, policies, and controls are in place across the EIT function, dealing with access rights, business continuity, budgets, toolsets, incident response management, audits, non-compliance, and so on. |
Level 4 | Comprehensive security approaches, policies, and controls are in place and are fully integrated across the organization. |
Level 5 | Security approaches, policies, and controls are regularly reviewed to maintain a proactive approach to preventing security breaches. |
6.1.3 Enterprise Information Management Maturity
The following statements provide a high-level overview of the enterprise information management (EIM) capability at successive levels of maturity.
Level 1 | Management has limited awareness of information management opportunities. |
Level 2 | Basic and discrete information management approaches are in place, typically by function or line of business. |
Level 3 | Standardized information management policies, standards, and controls are in place across the EIT function, enabling formal oversight of all aspects of information management. |
Level 4 | Comprehensive information management policies, standards, and controls are in place across the organization. Business intelligence and analysis are recognized as key to organizational success. |
Level 5 | Information management policies, standards, and controls are continually reviewed based on agreed risk tolerance factors. Their scope effectively extends to key business ecosystem partners. |
7 Key Competence Frameworks
While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected three major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.
Creating mappings between these frameworks and our chapters is challenging, because they come from different perspectives and have different goals. There is rarely a 100 percent correspondence between the frameworks and our chapters, and, despite careful consideration some subjectivity was used to create the mappings. Please take that in consideration as you review them.
7.1 Skills Framework for the Information Age
The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes seven levels of competency that can be applied to each skill. However, not all skills cover all seven levels. Some reach only partially up the seven-step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. SFIA is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)
SFIA skills have not yet been defined for this chapter.
7.2 European Competency Framework
The European Union's European e-Competence Framework (e-CF) has 40 competences and is used by a large number of companies, qualification providers, and others in public and private sectors across the EU. It uses five levels of competence proficiency (e-1 to e-5). No competence is subject to all five levels.
The e-CF is published and legally owned by CEN, the European Committee for Standardization, and its National Member Bodies (www.cen.eu). Its creation and maintenance has been co-financed and politically supported by the European Commission, in particular, DG (Directorate General) Enterprise and Industry, with contributions from the EU ICT multi-stakeholder community, to support competitiveness, innovation, and job creation in European industry. The Commission works on a number of initiatives to boost ICT skills in the workforce. Version 1.0 to 3.0 were published as CEN Workshop Agreements (CWA). The e-CF 3.0 CWA 16234-1 was published as an official European Norm (EN), EN 16234-1. For complete information, see http://www.ecompetences.eu.
e-CF Dimension 2 | e-CF Dimension 3 |
---|---|
E.3. Risk Management (MANAGE) Implements the management of risk across information systems through the application of the enterprise-defined risk management policy and procedure. Assesses risk to the organization's business, including web, cloud, and mobile resources. Documents potential risk and containment plans. | Level 2-4 |
7.3 i Competency Dictionary
The Information Technology Promotion Agency (IPA) of Japan has developed the i Competency Dictionary (iCD) and translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. The iCD is an extensive skills and tasks database, used in Japan and southeast Asian countries. It establishes a taxonomy of tasks and the skills required to perform the tasks. The IPA is also responsible for the Information Technology Engineers Examination (ITEE), which has grown into one of the largest scale national examinations in Japan, with approximately 600,000 applicants each year.
The iCD consists of a Task Dictionary and a Skill Dictionary. Skills for a specific task are identified via a "Task x Skill" table. (See Appendix A for the task layer and skill layer structures.) EITBOK activities in each chapter require several tasks in the Task Dictionary.
The table below shows a sample task from iCD Task Dictionary Layer 2 (with Layer 1 in parentheses) that corresponds to activities in this chapter. It also shows the Layer 2 (Skill Classification), Layer 3 (Skill Item), and Layer 4 (knowledge item from the IPA Body of Knowledge) prerequisite skills associated with the sample task, as identified by the Task x Skill Table of the iCD Skill Dictionary. The complete iCD Task Dictionary (Layer 1-4) and Skill Dictionary (Layer 1-4) can be obtained by returning the request form provided at http://www.ipa.go.jp/english/humandev/icd.html.
Task Dictionary | Skill Dictionary | ||
---|---|---|---|
Task Layer 1 (Task Layer 2) | Skill Classification | Skill Item | Associated Knowledge Items |
Formulation of business continuity plan (business continuity management) |
Business continuity planning (BCP) | BCP formulation methods |
|
8 Key Roles
These roles are common to ITSM:
- Financial Manager
- Facilities Manager
- EIT Service Continuity Manager
- Risk Manager
Other roles include:
- Disaster recovery team
- Information Security Manager
- Operations management team
- Service Manager
- System specialists
- Test team
9 Standards
ANSI/ASIS SPC.1-2009. Organizational Resilience: Security, Preparedness and Continuity Management Systems—Requirements with Guidance for Use
ISO 22301:2012, Societal security—Business continuity management systems—Requirements
ISO/IEC 20000-1:2011, (IEEE Std 20000-1:2013) Information technology—Service management—Part 1: Service management system requirements
ISO/IEC 27031:2011, Information technology—Security techniques—Guidelines for information and communication technology readiness for business continuity
10 References
[1] Systems and Software Engineering Vocabulary. (2009). ISO/IEC 24765
[2] ISACA. (n.d.). http://www.isaca.org/Pages/Glossary.aspx
[3] ITIL Service Catalogue: How to produce a Service Catalogue; http://www.itilnews.com/ITIL_Service_Catalogue_How_to_produce_a_Service_Catalogue.html
[4] Introduction to the ITIL Service Lifecycle, Second Edition, Office of Government Commerce, 2010
[5] Dwight Kayto, Defining IT Services, Art of Change; http://www.artofchange.ca/images/documents/defining%20it%20services.pdf
[6] British Computing Society, BCS Delivery mission critical system, 2011; http://www.bcs.org/content/conWebDoc/43139
http://www.downloads.xdelta.co.uk/2011/2011_07_19-bcs-mission_critical-colin_butcher.pdf
[7] Realtech, Application Lifecycle Management, Diagram; http://www.realtech.com/wInternational/software/solutions/application-lifecycle-management/application-lifecycle-managementW3DnavanchorW262110100.php
11 Related and Informing Disciplines
- Application lifecycle management
- Business continuity management
- Change management
- Configuration management
- Enterprise and business architecture
- Risk management
- Testing and validation
12 Disaster Recovery Plan Template
Here is an example template for a disaster recovery plan.
- Introduction
- Scope of the plan
- Objectives—RTO, RPO
- Authority
- Distribution
- Disaster declaration process
- Plan review
- Recovery
- Recovery team
- Recovery plan
- Disaster preparation
- Recovery tasks (short term and long term)
- Backup
- Backup strategy for each critical system
- Contact information
- Facilities information
- Recovery team information
- Other important business contacts