Difference between revisions of "Disaster Preparedness"
m |
|||
Line 88: | Line 88: | ||
<p>RPO is the point in time to which all integrated systems are recovered, taking into account backup schedules, sync points, and data-transfer points to ensure data quality and integrity. </p> | <p>RPO is the point in time to which all integrated systems are recovered, taking into account backup schedules, sync points, and data-transfer points to ensure data quality and integrity. </p> | ||
<p>RTO is how long it will take to return an EIT service to active duty. This varies depending on the criticality of the service as well as how integrated the service is with other services. </p> | <p>RTO is how long it will take to return an EIT service to active duty. This varies depending on the criticality of the service as well as how integrated the service is with other services. </p> | ||
− | [[File:RecoveryTimeline.jpg]] | + | [[File:RecoveryTimeline.jpg|750px]] |
<p>'''Figure 2. Recovery Timeline'''</p> | <p>'''Figure 2. Recovery Timeline'''</p> | ||
<p>Configuration management is a process that helps document the business impact of a service, as well as documenting the backup and recovery requirements. Also, it provides an inventory of the applications and supporting infrastructure needed in the restoration processes.</p> | <p>Configuration management is a process that helps document the business impact of a service, as well as documenting the backup and recovery requirements. Also, it provides an inventory of the applications and supporting infrastructure needed in the restoration processes.</p> |
Revision as of 20:59, 16 November 2015
Contents
1 Introduction
Disaster preparedness and disaster recovery (DR) supports business-continuity planning and includes planning for enterprise information technology (EIT) resiliency, as well as recovery from adversity, so that critical business services affected are restored to a satisfactory working state within an acceptable timeframe after the event.
DR can be defined as “in computer system operations, the return to normal operation after a hardware or software failure.” [1] Also, the “activities and programs designed to return the organization to an acceptable condition. The ability to respond to an interruption in services by implementing a disaster recovery plan to restore an organization’s critical business functions.” [2]
This chapter defines these processes and deliverables, and who should be responsible for planning, creating the documents, and communicating if a disaster occurs. The following are some examples for context:
- Examples of disasters
- Natural disaster affecting datacenters or EIT service operations (flood, fire, earthquake, wind)
- Security breach resulting in disaster (destruction of data, admin password changes, virus/malware installation, sabotage)
- Usage error (accidental deletion, unplug/turn off system resulting in corruption)
- Utility failure affecting datacenters (loss of power even after UPS)
- Vendor failure (cloud provider security failure, oil spill)
- Staffing issue (employment dispute/walkout, epidemic)
- Examples of unpreparedness
- Requiring use of computers or printers when power is out
- Requiring use of internet when power or connectivity is out
- Single point of knowledge/control for administration access
- Lack of offsite backup storage
- Lack of working restoration from backups
- Lack of failover datacenters in separate locations
- Undocumented or out of date documentation for system interfaces
- Requiring use of phones that are out of power
- Lack of designation of leaders in restoration efforts (who is in charge of restoring service and they know they are in charge)
- In general, no cohesive, comprehensive EIT service restoration plan
2 Goals and Principles
EIT organizations are responsible for the following goals:
- To document and plan for appropriate backup and recovery processes for all systems, and priority of systems for restoration.
- To create and deploy an EIT disaster recovery plan.
- To ensure that the business has business continuity processes in place in case of a disaster.
Fundamental principles of disaster recovery depends on the business functions within the enterprise, and how critical each is to the health of the business. There are several methods for determining criticality of functions.
- Hierarchy of need and stated in SLAs, which is that the most critical business functions should be restored first, or in the first phase of disaster recovery.
- Keep the lights on (KTLO) or keep the business running (KTBR), which are not the same thing
- All non-critical services are in the final phase of recovery.
- Industry-specific, so all systems delivering lifesaving functions are the highest priority for recovery efforts, whereas administration systems wait for second or third wave of recovery.
However, a fundamental recovery principle is that all systems to be recovered should be attended to within the specifications for recovery time objectives (RTO) and recovery point objectives (RPO) laid out by the business in the DR plan.
3 Context Diagram
Figure 1. Context Diagram for Disaster Preparedness and Recovery
3.1 Gather Inputs
The following inputs are necessary for this process to initiate or continue:
- Budget for DR plan
- Business risk factors/risk assessment
- Business and EIT service-level agreements (SLAs), operational-level agreements (OLAs), and contractual obligations to upstream/downstream systems
- Business-continuity plan
- Configuration management database (CMDB) and asset inventory (see the Operations and Support chapter)
- Current enterprise architecture artifacts/source code/document management systems
- EIT service catalogue
- EIT staff capabilities
- Vendor service agreements/maintenance agreements
The obvious business driver is to reduce risk for the business, by providing both mitigation strategies and contingency plans. High-risk projects or operational inefficiencies can lead to lost business, which ultimately causes lost income for the business — this can be the high price of risk.
Another business driver for formal DR processes may be to meet regulatory (i.e., SOX) or sustainability objectives. Part of the information gathering includes conducting workshops or interviews to document the drivers to ensure that deliverables meet these requirements.
Another related information-gathering effort is to define and document the technical drivers driving DR, including aging technology and lack of application-support capabilities.
4 Description of Activities
4.1 Business Impact Analysis
4.1.1 Define Critical Business Services
The first activity is to define services critical to operations. Critical services are those that, if missing, would mean that the enterprise could no longer meet commitments and deliver business products or services. Use business impact analysis, and get input from the business, such as the risk management group, the business continuity management, audit departments, and executives. Use business process diagrams to assist with analysis.
The following list is a suggested structure for determining the service categories and corresponding criticality of organizations services (for definitions of the categories, refer to the Standard Service Definitions section***where should ref go?***):
- Mission critical
- Business critical
- Business operational
- Administrative services [3]
Examples of typical critical services within an enterprise are safety processes, safety documentation management, communication polices and processes, and financial data and processes.
4.1.2 Map Critical Business Services to EIT Services
This function is often referred to as building an EIT service catalog, which is an important input to disaster recovery planning. A service catalog is “a database or structured document with information about all live EIT services, including those available for deployment…The service catalog includes information about deliverables, prices, contact points, ordering, and request processes.” [4] Templates exist to assist with this mapping. [5] See the Operations and Support chapter for more information on service catalogs.
4.1.3 Define Relevant Disaster Scenarios and Responsible Parties
Clearly define criteria for who declares a disaster, including when and how. Mature organizations have assigned who is in charge during disasters so that there is a clear leader who can decide which processes and procedures to implement, and who knows to follow the communication plan. If no plan is in place, it allows for invalid assumptions about who is in charge, including no one taking responsibility, or multiple parties competing to be in charge, neither of which helps resolve the disaster and recover service.
4.1.4 Define Successive Waves for Extending Recovery Across the Business
Due to the complex nature of EIT systems within the enterprise today, it is unrealistic to provide recovery for all services in the initial recovery phase. There are different levels of recovery for different tiers of business services, and a corresponding, agreed-to timeframe for recovery of each service within the enterprise. These waves of recovery begin with the most critical services, and move through to the least critical in an acceptable timeframe based on a risk-mitigation process. For example, level one (i.e., Tier 1) recovery may take place within 72 hours of a disaster and would include services such as product production, shipping, and customer-service applications. Note: A non-critical service may be recovered in the first pass of recovery based solely on a critical service having it as a dependency.
Critical systems management is a useful process in the identification and documentation of critical systems. [6] Also, it ensures that proper application life-cycle management is occurring for these EIT services. [7]
Use risk-assessment techniques to analyze how disaster scenarios could adversely affect the business. One such process would be to tier possible risks into levels such as:
- Affecting the entire enterprise
- Affecting only certain business units
- Affecting a single component (either a technology component or a business unit)
- Affecting a single business function (such as processing credit card transactions)
4.2 Recovery Objectives and DR Plan
4.2.1 Determine Recovery Objectives and Develop Plan
In cooperation with the business, define the recovery point objectives (RPOs) and recovery time objectives (RTOs).
RPO is the point in time to which all integrated systems are recovered, taking into account backup schedules, sync points, and data-transfer points to ensure data quality and integrity.
RTO is how long it will take to return an EIT service to active duty. This varies depending on the criticality of the service as well as how integrated the service is with other services.
Figure 2. Recovery Timeline
Configuration management is a process that helps document the business impact of a service, as well as documenting the backup and recovery requirements. Also, it provides an inventory of the applications and supporting infrastructure needed in the restoration processes.
Organization and Culture
The risk tolerance and depth of capabilities within the organization have a large impact on the organization’s disaster preparedness level. In other words, the business’s disaster tolerance is the “the time gap the business can accept the non-availability of EIT facilities.” [2] The lower the tolerance, the more extensive and costly DR practices and techniques are deployed.
Also, the business product deliveries determine the requirements of the planning effort and metrics.
4.2.2 Develop Communications Plan
An effective communication plan is an essential component to the successful implementation and adoption of the DR processes. The communication plan should include:
- How to deliver communications when standard communication systems are unavailable (such as email or phone systems)
- Who to contact in a disaster situation, including specific lists for specific situations or systems affected
- What the information each communication should and shouldn’t include
Contact information lists should include the following stakeholders:
- External partners (service providers and suppliers)
- Police/fire/municipal departments
- EIT management and staff
- Business management and product owners
The DR communication plan should describe the process to provide business updates to business-continuity plans after the recovery has been completed.
A process for disaster declaration needs to be included in the DR plan and be well communicated to the team. In this section, all contact information and approval authority should be spelled out (i.e., who has the authority to declare a disaster within the company).
4.2.3 Develop Backup and Archive Strategies and Schedule
- The EIT team responsible for DR is either responsible for backup and recovery or works closely with the team who is. Archiving and incremental backups need to be scheduled for the varying needs of the systems being supported. Backup standards and recovery strategies should be defined to ensure the business requirements are met.
- Backup and storage technology has a large role to play in the recoverability of applications and systems. Current backup utilities provide incremental forever-backup processes, which can help reduce the cost of storage used for holding backups. In addition, architecture features such as high-availability options and failover redundancies can both reduce risk of service loss, and provide mitigation strategies for unstable or unreliable systems.
4.2.4 Develop and Document DR Plan
A disaster recovery plan (DRP) is “a set of human, physical, technical, and procedural resources to recover, within a defined time and cost, an activity interrupted by an emergency or disaster.” [2]
The DR plan document needs to include all of the information required to recovery all critical systems that a business needs to operate. EIT must work with the business to develop and document a DR plan. See the template at the end of this chapter for recommended sections of a DR plan.
Data collection techniques are critical to the development of a meaningful DR plan that meets the needs of the business.
4.2.5 Interface with Business Continuity
The EIT team must communicate their processes to the business, and make consistent updates to the business-continuity plan (BCP). As new business components or services are added, the business assigns a criticality level, which then needs to be translated into EIT services, that are assigned internally to a tier to determine the disaster recovery requirements. The relationship between business continuity and EIT disaster recovery is symbiotic and is critical to the success of both functions within the enterprise.
4.3 Implement and Test DR plan (Drill or Simulation)
The first step to implementing a DR plan is to allocate resources and assign responsibilities. The DR team needs to be assigned early in the process to ensure accountability and understanding of roles at the time of a disaster. Many different roles are needed to define and execute a successful DR plan. The DR test is an opportunity to cross train roles, to mitigate the risk of key roles not being available should a disaster occur. It is likely that no one from the business DR team will be available for the recovery of the systems, so documentation, testing, and assigning a strategic partner is important to the recovery of business services.
4.3.1 Roles and Responsibilities
Input supplier roles are roles and teams that supply the inputs to the process:
- Enterprise risk-management team
- BCP manager
- EIT managers
- Enterprise architecture team
- Solution management team
Key roles are the responsible individuals or teams that perform the process:
- DR team leads
- Test team
- Recovery center manager
- System specialists (multiple)
- Business management team
- Facilities manager
- Service manager
User roles expect and receive the deliverables:
- Operations management team
- Backup process manager
- Test manager
- Business management team
Stakeholder roles are informed or consulted on the process execution:
- Enterprise risk management team
- Operations management team
- Business continuity manager
- Business management team
- Contract manager
4.3.2 Document Recovery Strategies
As mentioned above, there are many strategies to recover the services that the business needs to function. There is a different solution for every different service out there. The most important element is to choose a strategy, then document and communicate it.
- Use a third-party hot recovery site. The hot site should be in a geographically separate location to ensure that a natural disaster does not take out both the primary production location as well as the backup site location. These distances vary depending on geographic as well as infrastructure dependences (such as power, water, and network commonalities).
- Real-time mirroring is a technique used to replicate data to a geographically separate location to ensure data is available if a restore processes is needed.
- Manual, non-standard, or ad hoc/on-demand/unscheduled procedures are an important aspect that is a responsibility of the business units to ensure business continuity while EIT is rebuilding system services. Recommend to business management that manual processes either be automated, or have testing be completed on a regular basis. Document the methods used to mitigate problems caused by aging technology, such as having parts inventories, and redundant or cold standby equipment.
- Offsite data archiving ensures that backups are available if a disaster makes the primary site unavailable. Offsite services are available through many service provides. Due diligence by the DR team is important to ensure that the offsite facilities can guarantee secure and proper handling of backup data, which is an important enterprise asset.
- Action plans and recovery processes differ depending on what type of disaster has occurred. A single-component failure results in a standalone recovery of the failing component (such as an application, server, or appliance). An enterprise-wide disaster results in a disaster declaration event with a full DR plan being executed with the full DR team being mobilized.
- Identify and document potential disaster scenarios that have a high probability. For example, intrusion or denial of service attacks could have adverse effects on a technology company, whereas adverse environment conditions create higher risks to a construction company.
4.3.3 Define a Schedule for Service Continuity Testing
For the success of the recovery plan it is critical to define a schedule for the disaster recovery testing. One method often used is to simulate a disaster to test system recovery. Another process strongly recommended is to have production support test refreshes on a regular (i.e., monthly) basis. This not only ensures that backups are usable, but also that processes are well documented and functional, ensuring data quality and integration integrity.
4.3.4 Implement and Test DR Plan (Drill or Simulation)
- Implementation can take many forms. A hot-site contract is an agreement with a third-party vendor to provide facilities and infrastructure needed to restore agreed to services in the timeframe specified. There are many variants to this type of contract depending on the dollar value of the contract and the expected availability of internal staff at the time of a disaster. If the hot sites are geographically distant from the enterprise offices, it is likely the contract includes staff to perform the recovery as well.
- Due to the size and complexity of many enterprises, in-house DR facilities are often the norm, meaning these are secondary facilities used as recovery centers for primary facilities if needed.
- A useful metric from testing processes is the timing of the actual recovery procedures as well as a measure of the capabilities of the DR team, third-party, or secondary facilities, and the level of maturity of both staff knowledge and processes accuracy.
4.4 DR Plan — Change Management
Regular verification and updates to backup processes are necessary to ensure that accurate and usable backups are delivered. This change-management process needs to provide updates to the documentation of the backup and recovery processes. For example:
- DR testing cycle changes as services change or risk tolerances change.
- DR test results always cause process improvements and lessons learned to be added to the documentation.
- Updates and changes to business-continuity plan (BCP) go hand in hand with the changes to systems and services.
Mature organizations build continual improvement evaluation and activities into all processes.
4.4.1 Update DR Plan Based on DR Test Results and Validation
Validation metrics are measurements that quantify the success of processes, based on the requirements and goals of the business. The following measures can be used to determine the success of a DR test or simple restore procedure execution:
- Recovery point objectives met
- Recovery time objectives met
- Testing result measurements (for example, timing of restore, accuracy of data, and integration points)
- Verification of backup usability
5 Summary
Like most processes, DR processes are a closed loop of plan > build > test > review with action. Continuous improvement and maturity of these processes are obtained through the regular execution of DR tests, measuring results, and then revising the DR plan as necessary. Stakeholder involvement with setting requirements is critical to the success of DR processes.
6 Important Skills
- Change management
- Enterprise and business architecture
- Risk management
- Configuration management
- Testing and validation
7 Standards
ISO/IEC 24765:2009 Systems and software engineering
8 Related and Informing Disciplines
- Business continuity management
- Application life-cycle management
- Risk management
- Change management
- Enterprise and business architecture
- Configuration management
- Testing and validation
9 Disaster Recovery Plan Template
Here is an example template for a disaster recovery plan.
- Introduction
- Scope of the plan
- Objectives — RTO, RPO
- Authority
- Distribution
- Disaster declaration process
- Plan review
- Recovery
- Recovery team
- Recovery plan
- Disaster preparation
- Recovery tasks (short term and long term)
- Backup
- Backup strategy for each critical system
- Contact information
- Facilities information
- Recovery team information
- Other important business contacts
10 References
[1] Systems and Software Engineering Vocabulary. (2009). ISO/IEC 24765
[2] ISACA. (n.d.). http://www.isaca.org/Pages/Glossary.aspx
[3] ITIL Service Catalogue: How to produce a Service Catalogue; http://www.itilnews.com/ITIL_Service_Catalogue_How_to_produce_a_Service_Catalogue.html
[4] Introduction to the ITIL Service Lifecycle, Second Edition, Office of Government Commerce, 2010
[5] Dwight Kayto, Defining IT Services, Art of Change; http://www.artofchange.ca/images/documents/defining%20it%20services.pdf
[6] British Computing Society, BCS Delivery mission critical system, 2011; http://www.bcs.org/content/conWebDoc/43139
http://www.downloads.xdelta.co.uk/2011/2011_07_19-bcs-mission_critical-colin_butcher.pdf
[7] Realtech, Application Lifecycle Management, Diagram; http://www.realtech.com/wInternational/software/solutions/application-lifecycle-management/application-lifecycle-managementW3DnavanchorW262110100.php