Operations and Support

From EITBOK
Revision as of 22:22, 26 May 2016 by Cwalrad (Talk | contribs)

Jump to: navigation, search

Note: This wiki is a work in progress, and may contain missing content, errors, or duplication.


1 Introduction

In enterprise information technology (EIT), operations and support coordinate and carry out the activities and processes required to deliver and manage services provided by technology assets to business users and customers, at agreed levels. A commonly used framework for operations activities is ITIL®, which is proprietary. Many IT service management terms have been redefined in ISO/IEC 20000 terms,so it is important to be consistent with the source your organization prefers.

This knowledge area is all about maintaining an operation normal state in EIT environments, meaning that systems and processes execute according to expectations, with no surprises to users. Operation normal states are disrupted in two ways:

  1. Through proper channels via transition. Operation normal is suspended during a change, and then when the change is implemented, there is a new operation normal state. Changes can come via retirement of an asset, or from transition via construction, acquisition, or sustainment activities.
  2. Through incidents (ITIL), which interrupt the normal operational services. Incidents can be resolved with or without creating a problem (ITIL) to be handled through sustainment or strategy and governance. Incident resolution relies heavily on disaster preparedness.

There are three main elements involved in this function: assets (what), users (who), and services (how). Each of these has an inventory, catalog, or user list. Each of these also needs an overall function to manage and monitor the performance, regardless of asset, service, or user details. This chapter addresses how that overall function works.

AssetsServicesUsers.jpg

Figure 1. Assets, Services, and Users

Overall management of assets, services, and users is composed of two main parts: operations (assets and services) and support (users).

The operations function manages the assets used to deliver services used by business processes via technology including:

  • Technical administration — managing the assets and their connections within the organization
  • Operations monitoring — monitoring the assets as they function and provide services to the organization
  • Access management — providing access to users upon request or via linking to another system that manages access
  • Application certification — testing applications to ensure they operate successfully in the environment
  • Incident management — providing leadership and resolution when operational systems halt
  • Problem management — providing assistance when operational systems are not running optimally due to some defect or workaround due to resolving an incident

The support function manages the users, and may include technology that provides the following functions:

  • User management — providing the ability to manage user access to services and applications
  • Help desk — providing a service where users can talk to a person who can help them with their request, troubleshoot issues, and create trouble tickets, which after analysis, are determined to be defect reports or change requests
  • Self service support — providing a service where users perform selected support functions on their own

Although these processes and functions are associated with operations, most processes and functions have activities that take place across multiple stages of the service life cycle (see service operations).

This chapter covers topics in the Service Operation (SO) column, as shown in the ITIL Core Topics table below.

Service Strategy (SS) Service Design (SD) Service Transition (ST) Service Operation (SO) Continual Service Improvement (CSI)
  • Service management
  • Service life cycle
  • Service assets and value creation
  • Service provider types and structures
  • Strategy, markets, and offerings
  • Financial management
  • Service portfolio management
  • Demand management
  • Organizational design, culture, and development
  • Sourcing strategy
  • Service automation and interfaces
  • Strategy tools
  • Challenges and risks
  • Balanced design
  • Requirements, drivers, activities, and constraints
  • Service-oriented architecture
  • Business management service
  • SD models
  • Service catalogue management
  • Service level management
  • Capacity and availability
  • EIT service continuity
  • Information security
  • Supplier management
  • Data and information management
  • Application management
  • Roles and tools
  • Business impact analysis
  • Challenges and risks
  • SD package
  • Service acceptance criteria
  • Documentation
  • Environmental issues
  • Process maturity framework
  • Goals, principles, policies, context, roles, and models
  • Planning and support
  • Change management
  • Service asset and configuration management
  • Release and deployment
  • Service validation and testing
  • Evaluation
  • Knowledge management
  • Managing communication and commitment
  • Stakeholder management
  • Configuration management system
  • Staged introduction
  • Challenges and risks
  • Asset types
  • Balance in SO
  • Operational health
  • Communication
  • Documentation
  • Events, incidents, and problems
  • Request fulfillment
  • Access management
  • Monitoring and control
  • Infrastructure and service management
  • Facilities and datacenter management
  • Information and physical security
  • Service desk
  • Technical, EIT operations, and application management
  • Roles, responsibilities, and organizational structures
  • Technology support to SO
  • Managing change, projects, and risk
  • Challenges
  • Complementary guidance
  • Goals, methods, and techniques
  • Organizational change
  • Ownership
  • Drivers
  • Service-level management
  • Service management
  • Knowledge management
  • Benchmarks
  • Models, standards, and quality
  • CSI seven-step improvement process
  • Return on investment (ROI) and business issues
  • Roles
  • Authority matrix (RACI)
  • Support tools
  • Implementation
  • Governance
  • Communications
  • Challenges and risks
  • Innovation, correction, and improvement
  • Best practices supporting CSI

2 Goals and Principles

The main goals of operations and support are to maximize the use of resources and minimize the negative impact of changes to the environment. All organizations have an EIT environment, a sort of ecosystem where things do tasks for people. Here, this translates to assets provide services for users. The table below illustrates the translation.

Number Goals Output Translation
1 Maintain an accurate inventory of all system assets, and users of those assets. What things and which people require support What assets and which users require support
2 Maintain an accurate catalog of all system services and maintenance needs, and corresponding user expectations. What tasks each thing performs, which people expect those tasks to happen, and what support each thing needs to keep functioning over time. What services each thing performs, which users expect those services to happen, and what support each asset needs to keep functioning over time.
3 Implement processes and procedures that efficiently and effectively monitor and support the assets, services, and users according to their needs and expectations. What each thing, task, and person is actually doing. What each asset, service, and user is actually doing.
4 Implement processes and procedures that manage requests for changes to the assets, services, and users. What actions manage changes to the things, tasks, and people, so the business processes continue operating properly. What actions manage changes to the assets, services, and users, so the business processes continue operating properly.

In order to know if these goals are being achieved, attributes of each should be measured quantitatively and objectively.

2.1 Guiding Principles

DO

  • DO have standards for technology and data, in order to minimize confusion and rework, and reduce the opportunity for technology- or task-specific positions or services
  • DO share resources where possible and practical.
  • DO make sure that backups and restores work by testing periodically.
  • DO make sure that all changes are properly approved, and scheduled to be implemented to cause the least disruption to business processes.
  • DO have procedures documented for when incidents occur that will guide the efforts to resolve the incident and restore services as quickly as possible.

DO NOT

  • DO NOT panic when incidents occur.
  • DO NOT allow emergency changes that have not been tested to enter your environment.
  • DO NOT allow changes to systems (patches) that do not use the normal application interface without very high level clearance. If necessary, make a backup of the system before running any patch.

3 Context Diagram

ContextDiagram OpsSupport.jpg

Figure 2. Context Diagram for Operations and Support

4 Asset Management

What cannot be measured cannot be managed. The first requirement for any system is knowledge of the system itself. It is critical that a centralized or federated authority exists that is responsible for asset management within the organization, ideally with a separate budget to handle on-going operational asset needs outside of projects. Manage assets to enable business processes to occur predictably.

Many organizations have suffered from project-only EIT organizations, letting hardware and software age out of support due to lack of funding. Operations and support organizations have a stake in ensuring that assets are up to date, and that maintenance assistance from system suppliers (either vendors or in-house developers) is available for emergencies. See the Sustainment chapter for more on this important topic.

Some organizations may outsource support; even then, the organization has heavily invested in the assets and depends on the services, so it follows that knowing the asset and service inventory is critical, regardless of what organization provides the support.

4.1 Asset and Facility Administration

There are four main categories of assets. Support teams to manage those assets vary depending on the asset types and the organization’s needs.

4.1.1 Types of Assets

There are four main types of assets to manage: facilities, hardware, software, and data. The first three types exist in order to create and use the third. Thus, the ultimate goal of EIT operations is to provide the correct data to the business processes and people needing the data.

  • Hardware and Facilities
    • Hardware is the physical machinery installed in a facility, which enables software to run and data to be stored and moved. This includes (but is not limited to) desktop/laptop workstations, servers (individual and farms), disks, disk arrays, solid state storage, modems, routers, network cards and appliances, backup media, and wiring. These types of assets can break down over time, and need to be repaired or replaced.
    • Facilities contain and protect the hardware, and include buildings, security systems, environment conditioning, and uninterruptable power supplies. These types of assets can break down over time, and need to be repaired or replaced. Facilities can also become insufficient for ongoing needs and require expansion, either in place, or with the addition of other facilities, which then requires network connections to join the facilities. Or, company needs might change and facility contents might need to be moved to new facilities.
  • Software is the collection of digital assets that are loaded onto the machinery to provide a service. These include (but are not limited to) operating systems, utilities (backup, restore, compression, search, etc.), daemons, drivers, compilers and kernels, email and calendar applications, office productivity applications, hosted third-party applications, software development environments, database management systems, database access systems, network connection software, user interfaces, browsers, browser interfaces, client/server applications, reporting and analysis applications, and asset management applications. These types of assets do not break down as they are used. However, they are patched or upgraded based on new hardware capabilities or programming changes. Such changes, although intended to improve the software, can cause the software to change its behavior for the worse, or to completely fail.
    Software does not always install and execute perfectly with no input. Most software requires some input or change to operate as expected. There are three main types of adjustments done to software: configuration, customization, and maintenance.
    • Configuration is the process of selecting or entering values for parameters the software needs to be installed and execute in the environment. Configuration values can be entered manually, or stored in files for the application to use. No programming in the software is changed in configuration.
    • Customization is the process of changing a vendor’s source code to change the functionality of software. Customization can be performed by the vendor owning the software, or by the user if they have a software development toolkit (SDK) or the actual source code. One risk with customization is that changes implemented this way can be overwritten and lost when performing maintenance using vendor-supplied files.
    • Maintenance is the process of applying changes to applications via files provided by the suppliers.
FeatureConfigurationCustomizationMaintenance
Frequency
  • Installation
  • Based on analysis or requirement changes
Based on development team release schedules
  • Based on maintenance schedules
  • Vendor release schedule?
Changes to supplier-delivered files No change to source code or executable files Changes to source code and executable files
NOTE: Customization can make installing patches or upgrades from the supplier difficult or impossible to do without losing the customization.
Changes to source code and executable files by overwriting them with new files created and provided by the supplier
Limits Changes execution in pre-determined ways No change limits Changes execution in pre-determined ways
Implementation
  • Application data entry screens (wizards)
  • Local file with defined structure (config files)
  • Software development toolkits (SDK)
  • Source code modification and re-compilation
  • Follow supplier-supplied instructions
  • Follow supplier-supplied update application
Content
  • Parameters (data)
  • Selected options
Code (instructions)
  • Executable files
  • Scripts
Skills required Ability to answer questions or follow directions Familiarity with software used to create the application Ability to answer questions or follow directions
  • Data — This type of asset is typically overlooked. This is the content that all the applications use, and the hardware stores. Data is loaded and stored on hardware, and moved over networks. Data is read, written, copied, manipulated, displayed, added, deleted, backed up, restored, archived, and sent via many protocols between systems. By itself, this type of asset does not break down over time, although it can become corrupted. Instead, the detail becomes less relevant as time passes. This asset is managed through policies determining what kind of data needs to be available in specified timeframes, and when data should be archived (put on slower/older/less expensive access media) and purged (erased from media permanently).
    Never patch data through the database management system alone! All data changes should occur through an application interface to reduce the chance of corruption.

4.1.2 Types of Support Teams

There are four main types of support team organizations: in-house local, embedded, in-house remote, and offshore. Large organizations may have sub-teams of each type, depending on needs and costs. Larger organizations may have combinations of these types of support teams, depending on assets and business needs.

  • In-house LocalIn-house local support has the support personnel co-located with the assets, usually in a datacenter. These support staff are part of an EIT organization. Business users may or may not be in the same location, depending on the organization.
    Support teams are local to the assets, so incidents involving hardware can be resolved sooner by eliminating travel time for remote staff to get to the site.
  • Embedded SupportEmbedded support means EIT support staff is co-located with the business unit (BU) directly, not with the asset location. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards. If there are enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs.
    Shadow EIT is a common term for BU-owned technical support and development. These organizational structures grow for a variety of reasons. The main drawbacks are below.
    • Asset life-cycle maintenance — Routine maintenance activities may not be performed consistently, if at all, resulting in assets going out of support over time, even though license and maintenance costs are still being incurred.
    • Future support — If in the future, support needs to come from EIT rather than the BU, non-standard assets may be more difficult or expensive to support, or must be converted to use standard assets, which may be costly.
    • Skills — Technical support staff that is part of a business unit may not receive the same technical training and cross training that an EIT support organization would provide. It may be that someone in the BU learned on the job instead of having formal technical training. When an incident (see Support Incidents and Recovery) occurs (something unexpected happens or an action performed without technical review has unintended consequences), they may not be prepared to resolve it without formal EIT assistance.
  • In-house Remote — EIT organizations provide in-house remote support outside of the corporate headquarters or business offices. Users contact the support center to report issues and receive support services. Some travel between the remote location, the datacenter, and the user locations occurs when necessary. This support model can be useful if there is little need for personal contact with the assets (such as when using a vendor-owned cloud implementation), so the support employees can work from home, or at a separate support center location.
  • OffshoreOffshore support occurs in another country. Many companies specialize in remote technical support. Additionally, offshore support can occur during work hours in that location, which could be overnight in the datacenter.

4.1.3 Understanding Asset Support Requirements

Each organization is different, and the support needs for all assets in the environment are different as well. Therefore, an analysis of needs must be carried out before even going looking for solutions (technical or otherwise).

  1. Priority — The importance of an asset varies is different for each organization, depending on the industry and the maturity of the organization. There may be several installations of the same asset in an environment (such as in a server farm) — each instance is treated as a separate asset. This includes production environments, and any failover, testing, sandbox, or development environments that EIT is responsible for. These questions may help guide your asset support requirements analysis.
    • Which assets are mission critical?
    • Which assets are critical for teams to be effective?
    • Which assets can be non-functional in an emergency?
    • How long can each asset be non-functional during an incident, without significantly affecting normal business processes?
  2. Type — Each organization has different types of assets that need support. Below are the two ends of the levels of asset management.
    1. If the organization uses cloud computing, they may only need monitoring of applications and data, as the supplier handles all the rest of the monitoring needs according to contractual agreements.
    2. In a traditional EIT organization with an internal datacenter, asset types include machines, networks, and building features.
  3. Documentation — Each organization requires a different level of detail in documentation to ensure adequate support of the assets. In mature organizations, part of the turnover exercise is a review of all relevant documentation with the support team. Below are the two ends of the levels of documentation.
    1. HIGH — If the organization has an internal support staff with great expertise, the systems are not supplier supported (that is, they are home grown or highly customized), the level of detail required may be high enough to include information that allows staff to make significant changes to the internal workings of the asset — for example, software development toolkits (SDK) or parts lists.
    2. LOW — If the organization has an agreement for support with an off-site vendor, the level of detail may be low — for example, a chart of the escalation process with support phone numbers for the vendor and minimal instructions for issue troubleshooting.

4.2 Asset Support Processes

Support requirements include determining the processes needed to provide adequate support for the enterprise’s technology assets and users. There are three main types of support processes for assets: monitoring, auditing, and scheduling. The process of defining support process requirements needs to be finished before even going looking for mechanisms (technical or otherwise) to support or automate part of the processes, and in fact, provides information that can be used as requirements in tool selection later.

  1. Monitoring — The kinds of processes needed to view the execution of the services operating on or using the assets vary by organization. Below are the two ends of the ranges of monitoring.
    1. Simple monitoring and alerting — The systems are simple and only kick out alarms when certain thresholds are met or exceeded.
    2. Complex monitoring and real-time management — The systems are complex and interdependent, so support team members may need to actually see messages or monitoring screens updating in order to keep the system operating within acceptable parameters.
  2. Auditing — The level and frequency of auditing or review corresponds to the maturity of the EIT organization and the industry requirements. This may be particularly important for certain organizations dealing with healthcare or finances, or parts of organizations dealing with human resources information.
    1. Simple annual review — The review of processes occurs annually as a checkpoint for owner or shareholder reports. There are no regulatory or legal requirements for reviews.
    2. Complex periodic auditing — This includes:
      1. Schedules for each type of audit to occur
      2. Published certifications of compliance with laws, regulations, and internal process rules
      3. Acknowledgements from employees that all required processes are executed correctly
      4. Evidence of acknowledgement from customers that they understand their rights
  3. Scheduling — Support schedules for vary by each organization’s needs. There are two main support schedules — monitoring and maintenance. An optional schedule can exist for system audits (see above point 2b).
    1. Monitoring — Different types of monitoring can be required depending on the time of day, day of the week, or the day of the month or year. All monitoring should have instructions for what situations to look for, and what to do in those situations, such as who to inform if a situation arises, or a list of expected anomalies and how to handle them. Also, monitoring can be automated to a point of only sending alerts when expected thresholds are crossed.
    2. Maintenance windows — Includes standard and emergency maintenance preferences (such as normal nightly, and during lunchtime in an emergency to reduce work impact when possible). Planning each maintenance session should have a minimum and maximum task limit to prevent unnecessary shutdowns for tasks that can be completed at a later time (not enough work to justify). Each instance should have a list of each process occurring on which asset needs attention during these windows (such as monthly cleanup of temporary storage areas). How many can be scheduled? How much of the system needs to be offline during this window? If one part is offline, what happens to systems downstream from or linked to that part? When are the preferred times for emergency maintenance if there is a choice?

4.2.1 Asset Support Levels

Define requirements for how to structure the support staff to handle the varying levels of support the assets require, especially depending on the support schedule. The monitoring staff would call for support when a system is not performing as expected or within defined thresholds.

  1. Support hierarchy — Most organizations have multiple levels of support, with increasing technical expertise. Larger organizations can provide support for a majority of issues with cheaper staff in the lower levels.
    1. Self-service (lowest) — This level of support consists of documentation websites or (interactive voice response (IVR) systems with instructions for resolving common situations.
    2. Service representatives — This level of support has a person with access to scripts or instruction sets for common problems, and possibly some familiarity with the systems, who can talk through basic troubleshooting steps with a caller. This level of support is sometimes called Tier 1.
    3. Technical support — This level of support has familiarity with the systems and can work with the caller using their expertise and access to monitoring tools. This level of support is sometimes called Tier 2. In larger organizations, there may be more than one technical support layer, up to and including the development organization.
    4. On-site technician — This level of support is the highest level, and sends a technician to the site when all lower levels have failed to resolve the caller’s issue. This level of support is sometimes called Tier 3.
  2. Escalation — In organizations with multiple support levels, a process for moving an issue to a higher level is necessary, to minimize time spent effectively resolving the issue.
    1. When to escalate:
      1. All options at the current support level have been exhausted without being able to successfully resolve the issue.
      2. The issue is on a list of known issues that require a higher level of support expertise.
      3. Upon request from the caller.
      4. Upon request from the support staff.
    2. How to escalate:
      1. Call in the next higher level directly using a defined process, such as using an on-call pager or phone number.
      2. Reassign the issue in a ticket management system to the next level in the hierarchy for that system.
  3. Specialization — In larger organizations, support organizations can be specialized and dedicated to certain asset types (hardware vs. software vs. network) or systems/applications. The customer base can support the cost of dedicated staff maintaining a single or small number of systems.
  4. Generalization — Specialization is often not financially feasible in a smaller shop where the EIT staff covers all business systems and often has infrastructure support duties as well. In this case, it is important to assign at least one primary support person to each system as the go to person for support. A secondary EIT staff person can be the primary’s backup for vacation and sick days, and cross-training staff where the primary for one system is the backup for another system can spread the responsibility and work effectively.

4.2.2 Asset Support Tools

Support tools come in two categories: asset management (what exists) and asset monitoring (what occurs). Some toolsets may cover both categories. As the business is at least tangentially funding operations and support, there should be a business view of the assets and services provided. Only communicating with the business in technical terms may not be helpful or understandable when notifying business users of outages or the need for upgrades.

BusinessViewOps.jpg

Figure 3. Business View of Operations

  • Asset Management Tools — Many software packages exist to manage asset information. The hard work lies in populating the package’s database with the organization’s asset catalog, and maintaining the catalog over time. No matter how comprehensive a tool is, it is useless until it has the organization’s data loaded.
    Each tool may have one or more sweet spots in the following areas: depth of monitoring (system detail), breadth of monitoring (system and network scans), or user interface (information display and reporting/graphical output).
    1. Asset catalog detail — Some packages include options for managing data for each asset (hardware, software, services, and data) such as configuration settings, license and compliance information, contract and contact information, cost and depreciation tracking, requisition and procurement information, and support agreement information. Some of this data must be entered or uploaded as it does not exist on the asset itself.
      Some packages also allow for business views of the services offered, so assets can be tied to business functions. Each EIT service uses one or more assets (or components), and should map to one or more business processes, which then feeds into importance rankings and service level agreement parameters for each EIT service.
    2. Asset scanning capabilities — Some packages focus on gathering information from the systems themselves, although there is usually some sort of processing cost while it runs. This can be a way to quickly get accurate asset data into the tool’s database.
      While system scans can replace some data entry effort up front, there is a lot of work to do making sure that the tool has the proper access to all the systems in order to run the scans, and each tool may not work with every system component or network connection in the organization. In those cases, some other tool can be used to dump data into a format that the asset management tool can import, but of course, that is a manual process with risks of errors or omissions.
      When the assets have been entered into the tool’s database, review the data to ensure that it meets the expectations and has read the system data correctly. Some configuration is required to set up the frequency of updates and re-scan functions, as well as notification and alert thresholds and contacts.
    3. User interface and reporting/analysis — Some packages focus on the ability to track and manage support issues for asset items, with analysis for trends in incidents or problems with certain assets. These tools commonly also support tracking change requests and enhancement requests. Users may also report issues with assets that need to be resolved either via an enhancement change request or through sustainment. There should be a process for notifying the appropriate teams of relevant issues. See the Change initiatives and Sustainment chapters for more on this topic.
      User interfaces for lookup and reporting can either be installed clients or web-based. Client-installed versions can enable data entry and other auditable activities for authorized users, and may have more intuitive user interfaces for interacting with the tool in order to enter or modify asset data. However, web-based interfaces are much easier to deploy and manage for large numbers of users.
      Some packages include options for impact and dependency analysis to help determine if a change would have adverse or unintended effects, as well as alarms and notifications for license expiration or version updates from vendors.
  • Asset Monitoring Tools
    • Monitoring capabilities — Some packages focus on being able to monitor components to very great detail. However, there may be a limited list of systems that the tool can function to that level of detail.
    • User interface — Some packages focus on the ability to display monitoring data in useful formats, in both reports and in real-time on scrolling or continually updating monitors.
      These may also have more intuitive user interfaces for interacting with the tool in order to enter or modify system data or schedules.

4.3 Asset Support Incidents and Recovery

Support organizations exist not only to keep the systems operating normally, but to handle exceptions to normal operations — direct handling for small known exceptions, and calling another team within the organization, or calling in outside support from vendors for more serious incidents or unknown exception situations.

4.3.1 Incidents versus Problems

Incidents (as defined in ITIL) are interruptions in normal processing on production systems. If a scheduled process does not start when expected, unexpectedly stops while executing, or executes longer than expected and somehow prevents something else from executing at all, it is an incident. Incidents causing incidents in other processes are considered related. ITIL also includes as incidents the failure of an asset that has not yet caused an interruption.

It is possible to resolve an incident with a workaround, but the underlying cause needs to be recorded as a problem.

Problems are situations where processes are executing, but not as normally expected, and the difference is temporarily tolerable. Problems become either defect reports or enhancement requests submitted to the change request queue. Identification as a defect report means that the change request alerts those responsible for defect correction and patch release planning. Identification as a change request means that the change request alerts those responsible for planning changes for future releases of the product.

Incidents can result in problems, but problems are not incidents. One way to understand the difference is to consider a traffic accident.

  1. A jack-knifed semi-truck across all lanes of traffic is an incident. Traffic is stopped.
  2. Incident management (police, firemen, tow truck) are all called. Traffic is routed to the shoulder (workaround), so the incident is resolved. Traffic movement is restored, but not as much or as fast as normally expected.
  3. There are now two problems.
    1. The first problem is that some lanes are still blocked. Getting the truck moved to the side helps, and removing the truck entirely restores traffic to the pre-incident state.
    2. The second problem is the reason why the truck jack-knifed in the first place — the weather, the road, the traffic, the truck, the driver, or some combination.
      1. If caused by the weather (bad conditions), then resolve this by communicating that road conditions are bad while salting the road (preventing future incidents).
      2. If caused by the road, patch the bad area and schedule a repair.
      3. If the level of traffic or the traffic pattern is the reason, examine how the traffic deteriorated to cause the accident, and recommend changes to the traffic pattern, such as adding turn signals or turn lanes.
      4. If the truck is the reason, fix the truck.
      5. If the driver is the reason, educate the driver.

Incident management is the process of restoring normal operations as quickly as possible with minimal business impact. This is an emergency, where getting processes working now is more important than making the system permanently stable. A problem occurred, causing an incident.

Problem management is the process of removing causes of incidents that have occurred, or preventing potential incidents from occurring in the first place. Some problems may be so rare that a permanent remedy could be more costly than the incident causes, so problem management also includes prioritization. Problems turn into change requests submitted to a change request queue for evaluation (see Change Request Queue Management).

4.3.2 Incident Notification or Detection

Incidents are detected automatically by monitoring processes, or are reported by users. Support teams may have standard email addresses or phone numbers for support calls. A survey taken by Everbridge [1] includes the following findings.

  • Organizations averaged 150 EIT incidents a year, each taking an average of 2.25 hours to resolve.
  • Reporting outages by phone alone is insufficient.
  • One third of respondents reported difficulty quickly connecting to the right on-call support person to handle the outage.
  • Almost unanimously, respondents reported incidents of late or no response from the assigned support team.
  • Of the five most reported incident types (Hardware failure, application outage, datacenter network outage or performances issues, or connectivity issues between sites and offices), network connectivity of some kind seems to be the most common issue. It follows that the survey also reported that network administrators were the most common type of support staff pulled into incident resolution efforts.
  • Two thirds said that incidents have caused staff personal stress or increased workloads.

Mature organizations have incident reporting processes, and automated responses for every step in the process of resolving the incident. These reporting mechanisms can be as simple as an email sent to a team mailbox, or as sophisticated as using a standard ticket assignment tool. Both of these can provide logs of the issue’s history and tracking for future reference.

It is very important to keep the person reporting the incident updated, as well as anyone affected by the outage. Lack of information leads to s bad impression of the support team, no matter how well they resolve the problem.

4.3.3 Incident Resolution

By nature, incidents are unpredictable, and therefore uncomfortable for all involved. Mature organizations plan for scheduling a good mix of skilled resources to be available at all times, and funding for staff enhancement when needed during incidents so that support staff members are not overwhelmed during a major incident.

Link known incidents to standard remedies to aid support staff in efficient resolution of certain expected situations. Below are two ends of the ranges of incidents:

  1. Minor incident — Process does not execute as expected. For example, an expected file is late in arriving from a supplier, preventing a process from executing. The problem is a file arrival issue.
    1. If the cause is there was no data to send, then the incident is resolved by removing the requirement for the file for this one instance. The problem is that there may be no data for multiple execution instances, requiring a resolution of some kind that can automatically handle the situation without causing an incident, such as the supplier sends an empty file or the application wait times out.
    2. If the cause is there was a delay in processing on their side, then the incident is resolved only when the file arrives. The problem is then that the SLA between the supplier and the application owner is not being met, requiring a resolution of some kind (the SLA to be revisited and the schedule adjusted, the dependency on that file arriving on time is removed, or some other solution).
    3. If this incident occurs frequently, the problem importance can be elevated to a higher priority for resolution.
  2. Major incident — A bug fix patch, upgrade, or enhancement implementation experienced an unexpected issue, causing an unexpected system restart. All processing stops without notice.
    1. Automatic restart — If the system processing restarts automatically, there could be problems with the data, depending on where in the processing the halt occurred, and how robust the processing is for handling exceptions like unexpected restarts. It could be that incidents or problems occur for processes and data unrelated to the upgrade or enhancement due to co-existence on the same system.
    2. Unsuccessful restart — If the system processing does not successfully restart, or the system does not restart at all, the incident continues until either a restore occurs from a backup taken prior to the change, or the system restarts successfully after some other action.
  3. Major incident — Power outage occurs in the datacenter (see the Disaster Recovery chapter for more on this topic).

Ideally, each asset has documentation on normal operations expectations, remedial steps for known exception occurrences, and a guide for troubleshooting when unexpected exceptions to normal operations occur.

4.3.4 Resolution Notification

In most organizations, whenever an incident occurs, all users affected by the incident receive a notice that includes what happened, what actions are occurring to resolve the incident, and an expected time of return to normal operations.

It is important to communicate with users at least every hour during an incident, even if the communication is only that work continues to determine how to restore operations as quickly as possible. Assign one person to handle user notifications during the incident to prevent confusing or conflicting messages to the users. Some service level agreements can specify the frequency of communications during an incident (see Service Level Agreements).

When the incident is resolved, communicate the status to the users as well as the next steps for determining the cause and resolution of any problems detected.

4.3.5 Closed-loop Analysis

Closed-loop processes provide the data needed for process improvement. Monitoring of assets is not sufficient by itself; monitoring of support instances and resolutions is necessary to identify trends, and then develop improved remediation actions for frequently occurring issues. To improve resolution processes, it is essential to track the disposition of each incident and problem. For example, there must be a repository and a process that records whether each incident or problem resulted in a change (stating why and where the change was made), whether it was deferred and marked for future action, or was dismissed.

Asset support teams can track the frequency of issues, and reduce time spent resolving issues in several ways:

  1. Reduce the time spent for successful resolution.
    1. Develop and deploy better self-service instructions to reduce the Tier 1 calls.
    2. Modify the Tier 1 scripts to identify if the issue needs escalation immediately, rather than trying other solutions first.
    3. Modify the Tier 1 scripts to use the optimal solution, based on analysis of prior instances.
  2. Reduce the frequency of the issue.
    1. Create a problem ticket or enhancement request requiring a system change that prevents the problem from occurring.
    2. Create a problem ticket or enhancement request that handles the issue automatically, avoiding support calls.

5 User Management

Users use assets. Asset management enables business processes to occur predictably. Therefore, EIT must proactively manage users to enable controlled access and prevent chaos. Depending on the organization, funding for user support may come from EIT as a provided service; otherwise, funding comes from the departments on a usage basis.

Always communicate with the business users in business terms to ensure clear understanding, as most business users do not know or use technical terms or understand technical details.

5.1 User Access and Security Administration

Users require access to systems and applications to do work. Systems and applications require information to authenticate users and enable access to authorized activity. Both sides need to agree. The most common method is to have profiles for users, and security authentication on the system side.

5.1.1 User Profiles

Profiles are collections of users with similar or identical qualities. Create user profiles by job or department, by usage patterns, by level of experience, by location, and so on. There are three common ways to assign profiles to users: user-based, job-based, and role-based.

  1. User-based — Each user has an individual profile with assigned rights and privileges to various systems. Mass changes are difficult and time consuming, but this method has the most flexibility.
  2. Job-based — Each defined job has rights and privileges assigned to it. Link users to jobs so they can inherit the rights and privileges. When job responsibilities change, all users linked to that job inherit the changes. Mass changes are easier, but some changes may not apply, and this method requires a link to a system maintaining the job definitions.
  3. Role-based — Each system has defined roles with access rights to certain parts of the system or application. Some systems may have identical roles, so one role can share access to multiple applications for the same rights. Roles attach to users, and if job functions or access needs change, roles are attached to or detached from the users. Mass changes are simply adding or removing users from roles.
5.1.1.1 Onboarding and Offboarding

When a user enters the organization, create a user profile. When a user leaves the organization, promptly disable their profile to prevent any possible negative actions.

5.1.1.2 Multiple Profiles

Some users may require multiple profiles for different situations, such as one for normal work and another for when performing maintenance or recovery operations.

5.1.2 User Access Security

Applications have access points where users can connect to the system. There are commonly three parts to enabling a secure connection between a user and a system:

  1. Authentication — Each access request must verify that the user has access to the system. This can be a user ID and password, a certificate, a token, or a notification or pass-through from another security system like Active Directory or LDAP.
  2. Authority — Each user has authority to perform functions within the application. Functions can be enabled or disabled based on the user’s assigned authority.
  3. Auditing or accounting — Every access request to the application is logged for review, regardless of success.

5.1.3 Types of User Support Teams

  • Embedded support — Embedded support means EIT support staff is co-located directly with the business unit. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards, such as using a ticketing system consistently. If there are enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs.
  • In-house remote — EIT organizations provide this type of support in a separate location from the users. Users contact the support center to report issues and receive support services. Some travel between the remote location, the datacenter, and the user locations occurs when necessary.
  • Offshore — Offshore support occurs in another country. Many companies specialize in remote user support.

5.1.4 User Support Requirements

Each organization is different, and the needs for all users in the organization are different as well. Needs analysis for user support is fundamental to hiring the right people and selecting the right tools.

  1. Task complexity — Each organization varies in how complex their organizations are. Some user support tasks may be simple and frequent, like password resets. Other tasks may require specialized access from certain teams to accomplish. These questions may help guide your requirements design.
    • Which support tasks are common?
    • Which support tasks are deliverable through self-service functions?
    • Which support tasks can be disabled during an emergency?
    • What is the expected task turnaround time?
  2. User skill level — Each organization varies in the types of users needing support. Below are the two ends of the levels of user support.
    1. Some users are very familiar with the applications and can perform some support tasks themselves.
    2. Some users are very uncomfortable with computers and may need additional attention to make them comfortable and successful.

5.2 User Support Processes

There are three main types of support processes for users: help desk, self-service portals, and change request queues. Each of these support processes requires monitoring and auditing to ensure proper performance.

  1. User support monitoring — Help desks are monitored through side-by-side call monitoring, or after-call reviews. Monitoring ensures standard policies compliance and that standard activities are being executed properly.
    Self Service portals have logs that can be monitored.
    Change request queues are monitored by operations. Requests to change features or report defects come from users. See Change Request Queue Management.
  2. User support auditing — The level and frequency of auditing or review corresponds to the maturity of the EIT organization and the industry requirements. This may be important for certain organizations dealing with healthcare or finances, or parts of organizations dealing with human resources information.

5.2.1 Help Desk Support Levels

Define requirements for how to structure the support staff to handle the varying levels of support the users require, especially depending on the support schedule. The help desk staff would call for support when a system is not performing as expected or within defined thresholds.

  1. Hierarchy — Most organizations have multiple levels of support, each increasing in authority. Larger organizations can provide support for a majority of issues with cheaper staff in the lower levels.
    1. Service representatives — This level of support has a person with access to scripts or instruction sets for common problems, and possibly some familiarity with the systems, who can talk through basic troubleshooting steps with a caller. This level of support is sometimes called Tier 1.
    2. Technical support — This level of support has familiarity with the systems and user support tasks. They can work with the caller using their expertise and access to additional functions tools. In larger organizations, there may be more than one technical support layer, up to and including the development organization. This level of support is sometimes called Tier 2.
    3. On-site technician — This level of support is the highest level, and is dispatched to the site when all lower levels have failed to resolve the caller’s issue. This level of support is sometimes called Tier 3.
  2. Escalation — In organizations with multiple support levels, a process for moving an issue to a higher level is necessary to minimize the time spent resolving the issue.
    1. When to escalate:
      1. All options at a support level have been exhausted without being able to successfully resolve the issue.
      2. The issue is on a list of known issues that require higher level of support expertise.
      3. Upon request from the caller.
      4. Upon request from the help desk staff.
    2. How to escalate:
      1. Call in the next higher level directly using a defined process, such as using an on-call pager or phone number.
      2. Reassign the issue in a ticket management system to the next level in the hierarchy for that system.
  3. Specialization — In larger organizations, support organizations can be specialized to certain subject areas or applications.
    1. Security access — Specialized teams evaluate requests for privileged access.
    2. Application deployment — Specialized teams deploy applications to users upon request.

Some organizations may outsource user support; however, the organization is heavily invested in the users and depends on the services, so it follows that knowing the user profiles is critical, regardless of where the support is provided and by what organization.

5.2.2 User Self Service Support

Examples of self-service functions are password resets or standardized software downloads and installations. Common self-service functions occur through IVR systems or websites where users request the support service without support staff involvement.

5.2.3 User Support Tools

Support tools come in two categories: user management (what exists) and help desk monitoring (what occurs). Some toolsets may cover both categories. As the business is at least tangentially funding operations and support, there should be a business view of the users and services consumed.

5.2.3.1 User Management Tools

Many software packages exist to manage user information and profile assignments. The hard work lies in populating the package’s database with the organization’s users, and maintaining the catalog over time. No matter how comprehensive a tool is, it is useless until it has the organization’s data loaded, and becomes useless over time if the data becomes stale.

Each tool may have one or more sweet spots in the following areas: profile management (user detail), remote connectivity to user workstations, or reporting user interface (information display and/or reporting/graphical output).

  • Profile management — Some packages include options for managing data for each user and profile, such as permission and authorization settings, contact information, and support agreement information. While there may be some useful templates or default profiles included, an organization’s user data must be entered or uploaded into the tool’s repository.
  • Remote connectivity — Being able to log directly into the user’s workstation enables support staff to see what the user sees to determine and resolve the user’s issue.
  • User interface and reporting/analysis — Some packages focus on the ability to track and manage user support requests, with analysis for trends in incidents or problems with certain assets. These tools commonly also support tracking change requests and enhancement requests. Users may also report issues with assets that need to be resolved through either an enhancement via a change request or as a sustainment issue. See the Change initiatives and Sustainment chapters for more on this topic.
5.2.3.2 Help Desk Monitoring Tools
  • Monitoring capabilities — Many tools handle monitoring help desk or call center activity. This type of monitoring is not specific to EIT operations and support help desks.
  • User interface — Some packages focus on the ability to display monitoring data in useful formats, in both reports and in real-time on scrolling or continually updating monitors.

6 Service and Operational Level Agreements

Service level agreements (SLAs) and operational level agreements (OLAs) are contracts between teams that set expectations for how the teams cooperate with each other, and what each team is responsible for providing to the other team.

SLAs contain expectations between a service provider and users consuming the services. An example of an SLA is an agreement between a business unit and IT regarding desktop services.

OLAs contain expectations between service providers who support systems providing services. An example of an OLA is an agreement between network operations, security, and desktop services to manage virtual private network (VPN) software and access for workstations. The figure below illustrates the difference.

Some service providers may not always have direct interfaces with users, and therefore no SLAs.

SLAsOLAs.jpg

Figure 4. SLAs and OLAs

6.1 Service Level Agreements

There are two main types of SLAs: generic and negotiated.

  • Generic SLAs are created when there is no specific relationship between the provider and the users. Sometimes the users are not part of the organization (external) or are on large teams where negotiation would be impractical. In these cases, the service provider alone may create a generic SLA as a baseline. Users that find the general SLA unsatisfactory can negotiate with the service provider to create an SLA for their team specific to their needs.
  • Negotiated SLAs are between providers and users to set expectations for each team’s deliverables to the other teams, what services are provided, the availability for those services, notifications of service interruptions, and if appropriate, rewards for compliance and/or penalties for non-compliance.

SLAs include multiple sections describing the agreement, the participants, and any required actions. Following are the most common sections included in SLAs.

6.1.1 Definitions

Each SLA needs to name and define the teams and services involved in the agreement. Any SLA negotiations should start with definitions. Each SLA should also have a version identifier and effective dates to reflect changes over time.

6.1.2 Services and Service Levels

SLAs do not need to be specific to only one service. For negotiated SLAs, all services that the users consume supported by the service providers in the agreement should be included in the SLA. For generic SLAs created by providers, all their available services should be included in the SLA.

For each service, include the following information:

  • Name of the service — the common name and any other terms used.
  • Benefit or use — what the service provides to the organization and users.
  • Cost or fee structure — costs for the service, if any. Some services involve purchasing or leasing hardware or software, while some services involve labor charges for specific activities. Some organizations use chargeback calculations where the business unit pays for EIT services through internal budget accounts.
  • Availability — when the service is available for access and usage. Monitor compliance and report this information to the users periodically.
  • Performance requirements — how long the service takes from request to response, how many events can be processed at once, how many users can access the system at once, etc. Monitor compliance and report this information to the users periodically.
  • Turnaround time for requests — when the users make service requests, how long it takes the provider team to respond (response time), and how long it takes to resolve (resolution time). Monitor compliance and report this information to the users periodically.
  • Scheduled maintenance windows — when maintenance activities can occur on a regular basis and not adversely affect the users. This section can also include preferred time for unscheduled maintenance windows in non-emergency situations.
  • (Optional) Scope — identification of systems in scope (involved in or related to the service), or outside of scope, is useful in situations involving multiple locations, providers, or external dependencies; some instances may have different SLAs. There is no need for much detail here — technical details such as specific server names or software versions do not belong in SLAs.
  • (Optional) Performance incentives and penalties — Some SLAs also include penalties for non-compliance or rewards for high levels of compliance. This is most common in SLAs with external users, and is best codified in a legal contract.

6.1.3 Standard Notifications

Service providers generate standard notifications to users regarding the services, and are one of four types: Maintenance window reminders, incident and resolution notifications, enhancement or upgrade notifications, and metrics reports. The SLA should include content and timing guidelines for communications to users according to the situation.

  • Maintenance window reminders should include the time and duration of the window and what services are available or unavailable (whichever list is shorter). If the maintenance window is on a standard schedule, the reminder can be sent infrequently, or only to new users. If the maintenance window is not on the standard schedule, reminders should be sent one day, three hours, one hour, and fifteen minutes in advance, and another reminder should be sent when the maintenance window is concluded and all systems are back online.
  • Incident notifications should include incident information including systems affected, and current resolution activities. The initial notification should be sent as soon as the incident is verified, with time of discovery as well as when the Incident started. While researching the incident, notices should be sent as specified in the SLA — every 15 to 30 minutes for critical systems, and less frequently for non-critical systems. When the incident is resolved, a final notification should be sent describing the resolution and any further actions necessary.
  • Enhancement or upgrade notifications should be sent when anything affecting services covered by the SLA are put on a schedule for implementation, and include how the users will be affected. If possible, notifications for changes requiring user training should be sent far enough ahead of time for the users to sign up for and receive the training on the new functionality before implementation. Reminders should be sent one day before the change is implemented, and another afterwards, including training information.
  • Metrics reports should be sent as defined in the SLA to users who have opted-in to getting the metrics reports. Some users may not be interested in these notifications and should be able to opt-out as part of their user profile. Most metrics reports are generated monthly, and include data on service availability and performance including issues, and user requests including counts and resolution times.

6.1.4 User Responsibilities

Users (identified or not) have certain responsibilities, even in generic SLAs. Users are mostly responsible for:

  • Requesting services, such as access to a system
  • Reporting issues with service performance

Users may also be responsible for evaluating service metrics to validate the services occur according to expectations — for separate compliance reporting, or for other internal audit reasons.

List and describe the responsibilities the user accepts in the SLA, and the process for submitting requests or reporting issues.

6.1.5 Provider Responsibilities

Providers have the primary responsibility of making the services available to the users. This may include aspects of asset management, user management, or both.

Providers have secondary responsibilities of responding to issue reports and user requests. Every communication from the users should immediately generate a response from the provider stating the request will be evaluated as soon as possible. Use of a ticketing system can provide tracking for both the original communication from the users and any responses or actions taken by the provider. Issue reports and user requests needs different responses.

List and describe the responsibilities the provider accepts in the SLA. Include expected turnaround times for issue reports and user requests.

6.1.5.1 Responding to User Requests

Responding to user requests includes research, and one or more of the following actions:

  • Notification of successful completion of access or other configuration requests, and completion of the request
  • Notification of issues with the user request
6.1.5.2 Responding to Issue Reports

Responding to issue reports includes research that results in one or more of the following actions:

  • Notification that the issue was unrepeatable, or an anomaly that should not recur
  • Notification that:
    • A permanent solution to resolve the issue or permanently prevent recurrence was identified and routed for consideration to the appropriate teams
    • Submission of the recommended solution to change request management
  • Notification that:
    • A workaround was identified, including instructions to use until a permanent solution can be identified and implemented
    • Submission of the issue to change request management for evaluation.

6.1.6 Escalation Procedures

When users report issues, the provider must investigate and respond. If the user has an issue with the response (or lack thereof), the SLA must cover how users can escalate the issue with the provider, including the support hierarchy involved. Additionally, if the provider must escalate an issue with the user, that process and escalation hierarchy should be included in the SLA.

6.1.7 SLA Review Cycle

There should be a standard process for review described in the SLA. This review can occur on a regular basis, such as annually, or be triggered by an event, such as multiple performance issues or prolonged failure to comply with performance or availability requirements. The result of a review triggered by an event does not always generate a change to the SLA; instead, it may generate a project to upgrade or increase the resources on the systems providing the service.

If the SLA changes, then the version should be updated.

6.1.8 Approvals

The SLA must be approved by the appropriate leaders of the teams involved, and should recorded into a document repository.

6.2 Operational Level Agreements

OLAs are between multiple technical teams (participants), and therefore have different contents and cover different areas. OLAs are negotiated between participating teams, and set expectations for each team’s deliverables to the other teams, including services provided, the availability for those services, and notifications of service interruptions. Some service providers only have agreements with other service providers, and so do not participate in SLAs with users.

OLAs include multiple sections describing the agreement, the participants, and any required actions. Following are the most common sections included in OLAs.

6.2.1 Definitions

Each OLA needs to name and define the teams and services involved in the agreement. Any OLA negotiations should start with definitions. Each OLA should also have a version identifier and effective dates to reflect changes over time.

6.2.2 Services and Service Levels

Include all services that the participating teams involved provide to or consume from each other.

For each service, include the following information:

  • Name of the service — the common name and any other terms used.
  • Benefit or use — what the service provides to the organization and the participant teams.
  • Availability for the service — when the service is available for access and usage. Monitor compliance and report this information to the OLA participants periodically.
  • Performance requirements for the service — how long from request to response, how many events can be processed at once, how many users can access the system at once, etc. Monitor compliance and report this information to the OLA participants periodically.
  • Turnaround time for requests — when a participant teams make requests, how long it takes the appropriate team to respond (response time), and how long it takes to resolve requests (resolution time). Monitor compliance and report this information to the OLA participants periodically.
  • Scheduled maintenance windows — when maintenance activities can occur on a regular basis and not adversely affect the other participants. Some maintenance windows could be used by multiple participants simultaneously to minimize overall downtime. This section can also include preferred time for unscheduled maintenance windows in non-emergency situations.

6.2.3 Standard Notifications

Participants generate standard notifications to other participants regarding the services, and are one of four types: Maintenance window reminders, incident and resolution notifications, enhancement or upgrade notifications, and metrics reports. The SLA should include content and timing guidelines for communications to users according to the situation.

  • Maintenance window reminders should include the time and duration of the window and what services are available or unavailable (whichever list is shorter). If the maintenance window is not on the standard schedule, reminders should be sent one day, three hours, one hour, and fifteen minutes in advance, and another reminder should be sent when the maintenance window is concluded and all systems are back online.
  • Incident notifications should include incident information including systems and participant teams affected, and current resolution activities. The initial notification should be sent as soon as the incident is verified, with time of discovery as well as when the Incident started. While researching the incident, notices should be sent as specified in the SLA — every 15 to 30 minutes for critical systems, and less frequently for non-critical systems. When the incident is resolved, a final notification should be sent describing the resolution and any further actions necessary.
  • Enhancement or upgrade notifications should be sent when anything affecting services covered by the OLA are put on a schedule for implementation, and include how the participant teams are affected. Reminders should be sent one day before the change is implemented, and another afterwards confirming installation.
  • Metrics reports should be sent as defined in the OLA to participants who have opted-in to getting the metrics reports. Participants who also have SLAs may use these reports as part of their reports to their users.

6.2.4 Participant Responsibilities

Participants have the primary responsibility of making their service available to the other participants. This may include aspects of asset management.

Participants have certain responsibilities to the other participants, most commonly for:

  • Reporting issues with service performance
  • Submitting requests for change or enhancement to services
  • Submitting requests for changes to the SLA terms

Participants may also be responsible for evaluating service metrics to validate the services occur according to expectations for separate compliance reporting, or for other internal audit reasons.

List and describe the responsibilities each participant team accepts in the OLA. Include expected turnaround times for issue reports.

6.2.5 OLA Review Cycle

There should be a standard process for review described in the OLA. This review can occur on a regular basis, such as annually, or be triggered by an event, such as multiple performance issues or prolonged failure to comply with performance or availability requirements. The result of a review triggered by an event does not always generate a change to the OLA; instead, it may generate a project to upgrade or increase the resources on the systems providing the service.

f the SLA changes, then the version should be updated.

6.2.6 Approvals

The OLA must be approved by the appropriate leaders of the teams involved, and should recorded into a document repository.

7 Change Request Queue Management

Change request queues contain requests from multiple areas to correct defects, or improve system functionality, or comply with changing external drivers . Most organizations use a ticketing system to record incidents, defect reports, and user requests. These systems can be configured to handle requests for compliance changes required by external drivers. Operations and support teams are logically placed to handle intake of these requests.

There are three main sources for change requests: defects, external drivers, and feature change requests.

  • Defects drive change requests because they are degrading operations and need to be removed. Some defects may affect large systems or large parts of smaller systems. Care must be taken to evaluate the frequency of the defect occurring and the overall effect on the organization’s operations. Some defects may not be significant enough to be assigned a high priority, and thus may be deferred for later evaluation, although in some cases, deferring a defect may mean a higher effect later, and a higher cost to remove.
  • External drivers demand changes occur to comply with laws, regulations, or contractual obligations, such as supplier support agreements. Other external drivers may be industry standards, such as for security processing, which are deemed mandatory due to fines or loss of business resulting from non-compliance.
  • Strategic changes are those from the business side that improve how the organization does business, either reducing expenses or increasing revenue. Change requests go through a lifecycle, starting with the origination of the need for a change. The figure below shows the lifecycle of change requests.

Figure 5. Change Life Cycle ***add image***

All of these requests appear in a single repository, whether a spreadsheet or a sophisticated tool, that contains all requests. As shown above, all requests are categorized and queued after having been evaluated for priority, effort, and affect to the organization. As each request is reviewed, a decision is made either to approve it for near-term or immediate action, to defer it, or to reject it. The decisions and their rationales are logged in the request.

8 Summary

Operations and support is mainly concerned with keeping production systems running as expected. Both assets and users need to be supported, but in different ways. Changes to either assets or users need to be managed appropriately.

9 Key Competence Frameworks

While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected 3 major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.

9.1 Skills Framework for the Information Age

The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes 7 levels of competency which can be applied to each skill. Not all skills, however, cover all seven levels. Some reach only partially up the seven step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. It is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)

SFIA skills have not yet been defined for the security chapter.

9.2 European Competency Framework

The European Union's European Competency Framework (e-CF) has 40 “competencies” and is used in the EU. (http://www.ecompetences.eu/) It uses five levels of competency. As in SFIA, not all skills are subject to all 5 levels. The EU has also created a mapping between e-CF and SFIA.

E-CF skills have not yet been defined for the security chapter.

9.3 i-Competency Dictionary

The Information Technology Promotion Agency in Japan has developed the i-CD, translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. . It is an extensive skills and tasks database, used in Japan and southeast Asian countries. Like SFIA, it establishes seven levels of competencies for skills. An example showcasing the skills and tasks relevant to this chapter is given below.

10 Key Roles

These roles are common to ITSM:

  • IT Operations Manager
  • Service Level Manager
  • Availability Manager
  • Capacity Manager
  • Incident Management
  • Problem Manager
  • Change Manager
  • Configuration Manager
  • Release Manager
  • Financial Manager

11 Standards

ANSI/AIAA G-043A-2012e, ANSI/AIAA Guide to the Preparation of Operational Concept Documents

IEEE Std 828™-2012, IEEE Standard for Configuration Management in Systems and Software Engineering

ISO 10004:2012, Quality management - Customer satisfaction - Guidelines for monitoring and measuring

ISO 10007:2003, Quality management systems — Guidelines for configuration management

ISO 18238:2015, Space systems — Closed loop problem solving management

ISO/IEC 16350:2015, Information technology – Systems and software engineering—Application management

ISO/IEC 19770-1:2012, Information technology -- Software asset management -- Part 1: Processes and tiered assessment of conformance

ISO/IEC 20000-1:2011, (IEEE Std 20000-1:2013) Information technology – Service management – Part 1: Service management system requirements

ISO/IEC 20000–2:2012, Information technology — Service management — Part 2: Guidance on the application of service management systems

ISO/IEC TR 20000–10:2015, Information technology — Service management — Part 10: Concepts and terminology

ISO/IEC TR 20000–11:2015, Information technology — Service management — Part 11: Guidance on the relationship between ISO/IEC 20000-1:2011 and related frameworks: ITIL®

ISO/IEC TR 20000-12 - Information technology — IT Service management — Part 12: Guidance on the relationship between ISO/IEC 20000-1:2011 and service management frameworks: CMMI-SVC®

ISO/IEC/IEEE 14764-2006, Software Engineering -- Software Life Cycle Processes – Maintenance

ISO/IEC 15939:2007, Systems and software engineering — Measurement process

12 References

[1] http://www.cioinsight.com/it-strategy/infrastructure/slideshows/how-mishandled-it-incidents-spiral-out-of-control.html taken from http://go.everbridge.com/ITCommunicationeBook-web.html.