Operations and Support

From EITBOK
Jump to: navigation, search
Welcome to the initial version of the EITBOK wiki. Like all wikis, it is a work in progress and may contain errors. We welcome feedback, edits, and real-world examples. Click here for instructions about how to send us feedback.
Ieee logo 1.png
Acm logo 3.png

 

1 Introduction

In Enterprise information technology (EIT), operations and support coordinate and carry out the activities and processes required to deliver and manage services provided by technology assets to business users and customers, at agreed levels. A commonly used framework for operations activities is ITIL®, which is proprietary. ISO/IEC and IEEE standard 20000 is similar in many aspects, and may be used by any organization to develop and manage EIT operations. Many EIT service management terms have been redefined in ISO/IEC 20000, so it is important to be consistent with the source your organization prefers.

This knowledge area is about maintaining an operation normal state in EIT environments, meaning that systems and processes execute according to expectations, with no surprises to users. Operation normal states are disrupted in two ways:

  • Through proper channels during the transition of an asset. Operation normal is suspended during a change, and then when the change is implemented, there is a new operation normal state. Changes can come when an asset is "retired, or when construction, acquisition, or sustainment activities are transitioned into operations.
  • Through incidents (ITIL) that interrupt the normal operational services. Incidents can be resolved with or without creating a problem (ITIL) to be handled through sustainment or strategy and governance (see Strategy and Governance). Incident resolution relies heavily on disaster preparedness (see Disaster Preparedness).

There are three main elements involved in this function: assets (what), users (who), and services (how). Each of these has an inventory, catalog, or user list. Each of these also needs an overall function to manage and monitor the performance, regardless of asset, service, or user details. This chapter addresses how that overall function works.

AssetsServicesUsers.jpg
Figure 1. Assets, Services, and Users

Overall management of assets, services, and users is composed of two main parts: operations (assets and services) and support (users).

The operations function manages the assets used to deliver services used by business processes via technology including:

  • Technical administration—managing the assets and their connections within the organization
  • Operations monitoring—monitoring the assets as they function and provide services to the organization
  • Access management—providing access to users upon request or via linking to another system that manages access
  • Application certification—testing applications to ensure they operate successfully in the environment
  • Incident management—providing leadership and resolution when operational systems halt
  • Problem management—providing assistance when operational systems are not running optimally due to some defect or workaround due to resolving an incident

The support function manages the users, and may include technology that provides the following functions:

  • User management—providing the ability to manage user access to services and applications
  • Help desk—providing a service where users can talk to a person who can help them with their request, troubleshoot issues, and create trouble tickets, which after analysis, are determined to be defect reports or change requests
  • Self-service support—providing a service where users perform selected support functions on their own

Although these processes and functions are associated with operations, most processes and functions have activities that take place across multiple stages of the service lifecycle (see Service Operation).

This chapter covers topics in the Service Operation (SO) column, as shown in the ITIL Core Topics table below.

Service Strategy (SS)Service Design (SD)Service Transition (ST)Service Operation (SO)Continual Service Improvement (CSI)
  • Service management
  • Service lifecycle
  • Service assets and value creation
  • Service provider types and structures
  • Strategy, markets, and offerings
  • Financial management
  • Service portfolio management
  • Demand management
  • Organizational design, culture, and development
  • Sourcing strategy
  • Service automation and interfaces
  • Strategy tools
  • Challenges and risks
  • Balanced design
  • Requirements, drivers, activities, and constraints
  • Service-oriented architecture
  • Business management service
  • SD models
  • Service catalog management
  • Service level management
  • Capacity and availability
  • EIT service continuity
  • Information security
  • Supplier management
  • Data and information management
  • Application management
  • Roles and tools
  • Business impact analysis
  • Challenges and risks
  • SD package
  • Service acceptance criteria
  • Documentation
  • Environmental issues
  • Process maturity framework
  • Goals, principles, policies, context, roles, and models
  • Planning and support
  • Change management
  • Service asset and configuration management
  • Release and deployment
  • Service validation and testing
  • Evaluation
  • Knowledge management
  • Managing communication and commitment
  • Stakeholder management
  • Configuration management system
  • Staged introduction
  • Challenges and risks
  • Asset types
  • Balance in SO
  • Operational health
  • Communication
  • Documentation
  • Events, incidents, and problems
  • Request fulfillment
  • Access management
  • Monitoring and control
  • Infrastructure and service management
  • Facilities and datacenter management
  • Information and physical security
  • Service desk
  • Technical, EIT operations, and application management
  • Roles, responsibilities, and organizational structures
  • Technology support to SO
  • Managing change, projects, and risk
  • Challenges
  • Complementary guidance
  • Goals, methods, and techniques
  • Organizational change
  • Ownership
  • Drivers
  • Service-level management
  • Service management
  • Knowledge management
  • Benchmarks
  • Models, standards, and quality
  • CSI seven-step improvement process
  • Return on investment (ROI) and business issues
  • Roles
  • Authority matrix (RACI)
  • Support tools
  • Implementation
  • Governance
  • Communications
  • Challenges and risks
  • Innovation, correction, and improvement
  • Accepted practices supporting CSI

2 Goals and Principles

2.1 Goals

The main goals of operations and support are to maximize the use of resources and minimize the negative impact of changes to the environment. All organizations have an EIT environment, a sort of ecosystem where things do tasks for people. Here, this translates to assets provide services for users. The table below illustrates the translation.

GoalsOutputTranslation
Maintain an accurate inventory of all system assets and users of those assets. What things and which people require support What assets and which users require support
Maintain an accurate catalog of all system services and maintenance needs and corresponding user expectations. What tasks each thing performs, which people expect those tasks to happen, and what support each thing needs to keep functioning over time What services each thing performs, which users expect those services to happen, and what support each asset needs to keep functioning over time
Implement processes and procedures that efficiently and effectively monitor and support the assets, services, and users according to their needs and expectations. What each thing, task, and person is actually doing What each asset, service, and user is actually doing
Implement processes and procedures that manage requests for changes to the assets, services, and users. What actions manage changes to the things, tasks, and people, so the business processes continue operating properly What actions manage changes to the assets, services, and users, so the business processes continue operating properly

To know whether these goals are being achieved, attributes of each should be measured quantitatively and objectively.

2.2 Guiding Principles

DO ...

  • Have standards for technology and data that minimize confusion and rework, and reduce the opportunity for technology- or task-specific positions or services.
  • Share resources where possible and practical.
  • Make sure that backups and restores work by testing periodically.
  • Make sure that all changes are properly approved, and scheduled to be implemented to cause the least disruption to business processes.
  • Have procedures documented for when incidents occur that will guide the efforts to resolve the incident and restore services as quickly as possible.

DO NOT ...

  • Panic when incidents occur.
  • Allow emergency changes that have not been tested to enter your environment.
  • Allow changes to systems (patches) that do not use the normal application interface without very high level clearance. If necessary, make a backup of the system before running any patch.

3 Context Diagram

08 Operations&Support CD.png
Figure 2. Context Diagram for Operations and Support

4 Asset Management

What cannot be measured cannot be managed. The first requirement for any system is knowledge of the system itself. It is critical that a centralized or federated authority exists that is responsible for asset management within the organization, ideally with a separate budget to handle on-going operational asset needs outside of projects. Manage assets to enable business processes to occur predictably.

Many organizations have suffered from project-only EIT organizations, letting hardware and software age out of support due to lack of funding. Operations and support organizations have a stake in ensuring that assets are up to date, and that maintenance assistance from system suppliers (either vendors or in-house developers) is available for emergencies. See the Sustainment chapter for more on this important topic.

Some organizations may outsource support; even then, the organization has heavily invested in the assets and depends on the services, so it follows that knowing the asset and service inventory is critical, regardless of what organization provides the support.

4.1 Asset and Facility Administration

There are four main categories of assets. Support teams to manage those assets vary depending on the asset type and the organization's needs.

4.1.1 Types of Assets

There are four main types of assets to manage: facilities, hardware, software, and data. The first three types exist in order to create and use the third. Thus, the ultimate goal of EIT operations is to provide the correct data to the business processes and people who need the data.

  • Facilities contain and protect the hardware, and include buildings, security systems, environment conditioning, and uninterruptible power supplies. These types of assets can break down over time, and need to be repaired or replaced. Facilities can also become insufficient for ongoing needs and require expansion, either in place, or with the addition of other facilities, which then requires network connections to join the facilities. Or, company needs might change and facility contents might need to be moved to new facilities.
  • Hardware is the physical machinery installed in a facility, which enables software to run and data to be stored and moved. This includes (but is not limited to) desktop/laptop workstations, servers (individual and farms), disks, disk arrays, solid state storage, modems, routers, network cards and appliances, backup media, and wiring. These types of assets can break down over time, and need to be repaired or replaced.
  • Software is the collection of digital assets that are loaded onto the hardware to provide a service. These include (but are not limited to) operating systems, utilities (backup, restore, compression, search, etc.), daemons, drivers, compilers and kernels, email and calendar applications, office productivity applications, hosted third-party applications, software development environments, database management systems, database access systems, network connection software, user interfaces, browsers, browser interfaces, client/server applications, reporting and analysis applications, and asset-management applications. These types of assets do not break down as they are used. However, they are patched or upgraded based on new hardware capabilities or programming changes. Such changes, although intended to improve the software, can cause the software to change its behavior for the worse, or to completely fail. Software does not always install and execute perfectly with no input. Most software requires some input or change to operate as expected. There are three main types of adjustments done to software: configuration, customization, and maintenance.
    • Configuration is the process of selecting or entering values for parameters that the software needs to be installed and execute in the environment. Configuration values can be entered manually, or stored in files for the application to use. No programming in the software is changed in configuration.
    • Customization is the process of changing a vendor's source code to change the functionality of software. Customization can be performed by the vendor that owns the software, or by the user if they have a software development toolkit (SDK) or the actual source code. One risk with customization is that implemented changes can be overwritten and lost when performing maintenance using vendor-supplied files.
    • Maintenance is the process of applying changes to applications via files provided by the suppliers.
      FeatureConfigurationCustomizationMaintenance
      Frequency
      • Installation
      • Based on analysis or requirement changes
      • Based on development team release schedules
      • Based on maintenance schedules
      • Vendor release schedule
      Changes to supplier-delivered files
      • No change to source code or executable files
      • Changes to source code and executable files
        Note: Customization can make installing patches or upgrades from the supplier difficult or impossible to do without losing the customization.
      • Changes to source code and executable files by overwriting them with new files created and provided by the supplier
      Limits
      • Changes execution in pre-determined ways
      • No change limits
      • Changes execution in pre-determined ways
      Implementation
      • Application data entry screens (wizards)
      • Local file with defined structure (config files)
      • Software development toolkits (SDK)
      • Source code modification and re-compilation
      • Follow supplier-supplied instructions
      • Follow supplier-supplied update application
      Content
      • Parameters (data)
      • Selected options
      • Code (instructions)
      • Executable files
      • Scripts
      Skills required
      • Ability to answer questions or follow directions
      • Familiarity with software used to create the application
      • Ability to answer questions or follow directions
  • Data is an asset that is typically overlooked. Data is the content that all the applications use. Data is loaded and stored on hardware, and moved over networks. Data is read, written, copied, manipulated, displayed, added, deleted, backed up, restored, archived, and sent via many protocols between systems. By itself, this type of asset does not break down over time, although it can become corrupted. Instead, the detail becomes less relevant as time passes. This asset is managed through policies determining what kind of data needs to be available in specified timeframes, and when data should be archived (put on slower/older/less expensive access media) and purged (erased from media permanently). Never patch data through the database management system alone! All data changes should occur through an application interface to reduce the chance of corruption.

4.1.2 Types of Support Teams

There are four main types of support team organizations: in-house local, embedded, in-house remote, and offshore. Large organizations may have sub-teams of each type, depending on needs and costs. Larger organizations may have combinations of these types of support teams, depending on assets and business needs.

  • In-house local—In-house local support has the support personnel co-located with the assets, usually in a datacenter. This support staff is part of an EIT organization. Business users may or may not be in the same location, depending on the organization. Support teams are local to the assets, so incidents involving hardware can be resolved sooner by eliminating travel time for remote staff to get to the site.
  • Embedded support—Embedded support means EIT support staff is co-located with the business unit (BU) directly, not with the asset location. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards. If there are enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs. Shadow EIT is a common term for BU-owned technical support and development. These organizational structures grow for a variety of reasons. The primary drawbacks are:
    • Asset lifecycle maintenance—Routine maintenance activities may not be performed consistently, if at all, resulting in assets going out of support over time, even though license and maintenance costs are still being incurred.
    • Future support—If in the future, support needs to come from EIT rather than the BU, non-standard assets may be more difficult or expensive to support, or must be converted to use standard assets, which may be costly.
    • Skills—Technical support staff that is part of a business unit may not receive the same technical training and cross training that an EIT support organization would provide. It may be that someone in the BU learned on the job instead of having formal technical training. When an incident occurs (something unexpected happens or an action performed without technical review has unintended consequences), they may not be prepared to resolve it without formal EIT assistance (see Support Incidents and Recovery).
  • In-house remote—EIT organizations provide in-house remote support outside of the corporate headquarters or business offices. Users contact the support center to report issues and receive support services. Some travel between the remote location, the datacenter, and the user locations occurs when necessary. This support model can be useful if there is little need for personal contact with the assets (such as when using a vendor-owned cloud implementation), so the support employees can work from home, or at a separate support center location.
  • Offshore—Offshore support occurs in another country. Many companies specialize in remote technical support. Additionally, offshore support can occur during work hours in that location, which could be overnight in the datacenter.

4.1.3 Understanding Asset Support Requirements

Each organization is different, and the support needs for all assets in the environment are different as well. Therefore, an analysis of needs must be carried out before looking for solutions (technical or otherwise).

  1. Assess priority—The importance of an asset varies for each organization depending on the industry and the maturity of the organization. There may be several installations of the same asset in an environment (such as in a server farm); each instance is treated as a separate asset. This includes production environments, and any failover, testing, sandbox, or development environments that EIT is responsible for. The following questions can help guide asset support requirements analysis:
    • Which assets are mission critical?
    • Which assets are critical for teams to be effective?
    • Which assets can be non-functional in an emergency?
    • During an incident, how long can each asset be non-functional without significantly affecting normal business processes?
  2. Determine asset type—Each organization has different types of assets that need support. Below are the two ends of the asset management spectrum.
    • If the organization uses cloud computing, they may only need monitoring of applications and data, as the supplier handles all the rest of the monitoring needs according to contractual agreements.
    • In a traditional EIT organization with an internal datacenter, asset types include machines, networks, and building features.
  3. Review documentation needs—Each organization requires a different level of detail in documentation to ensure adequate support of the assets. In mature organizations, part of the turnover exercise is a review of all relevant documentation with the support team. Below are the two ends of the documentation spectrum.
    • LOW—If the organization has an agreement for support with an off-site vendor, the level of detail may be low—for example, a chart of the escalation process with support phone numbers for the vendor and minimal instructions for issue troubleshooting.
    • HIGH—If the organization has an internal support staff with great expertise, the systems are not supplier supported (that is, they are home grown or highly customized), the level of detail required may be high enough to include information that allows staff to make significant changes to the internal workings of the asset—for example, software development toolkits (SDK) or parts lists.

4.2 Asset Support Processes

Support requirements include determining the processes needed to provide adequate support for the enterprise's technology assets and users. There are three main types of support processes for assets: monitoring, auditing, and scheduling. The process of defining support process requirements needs to be finished before looking for mechanisms (technical or otherwise) to support or automate part of the processes, and in fact, provides information that can be used as requirements in tool selection later.

  1. Monitoring—The kinds of processes needed to view the execution of the services operating on or using the assets vary by organization. Below are the two ends of the monitoring spectrum.
    • Simple monitoring and alerting—The systems are simple and only trigger alarms when certain thresholds are met or exceeded.
    • Complex monitoring and real-time management—The systems are complex and interdependent, so support team members may need to actually see messages or monitoring screens updating in order to keep the system operating within acceptable parameters.
  2. Auditing—The level and frequency of auditing or review corresponds to the maturity of the EIT organization and the industry requirements. This may be particularly important for certain organizations dealing with healthcare or finances, or parts of organizations dealing with human resources information.
    • Simple annual review—The review of processes occurs annually as a checkpoint for owner or shareholder reports. There are no regulatory or legal requirements for reviews.
    • Complex periodic auditing—This includes:
      • Schedules for each type of audit to occur
      • Published certifications of compliance with laws, regulations, and internal process rules
      • Acknowledgements from employees that all required processes are executed correctly
      • Evidence of acknowledgement from customers that they understand their rights
  3. Scheduling—Support schedules vary by each organization's needs. There are two main support schedules: monitoring and maintenance. An optional schedule can exist for system audits (see item 2, above).
    • Monitoring—Different types of monitoring can be required depending on the time of day, day of the week, or the day of the month or year. All monitoring should have instructions for what situations to look for, and what to do in those situations, such as who to inform if a situation arises, or a list of expected anomalies and how to handle them. Also, monitoring can be automated to a point of only sending alerts when expected thresholds are crossed.
    • Maintenance windows—This includes standard and emergency maintenance preferences (such as normal nightly, and during lunchtime in an emergency to reduce work impact when possible). Planning each maintenance session should have a minimum and maximum task limit to prevent unnecessary shutdowns for tasks that can be completed at a later time (not enough work to justify). Each instance should have a list of each process occurring and which asset needs attention during these windows (such as monthly cleanup of temporary storage areas). How many can be scheduled? How much of the system needs to be offline during this window? If one part is offline, what happens to systems downstream from or linked to that part? When are the preferred times for emergency maintenance if there is a choice?

4.2.1 Asset Support Levels

Define requirements for how to structure the support staff to handle the varying levels of support the assets require, especially depending on the support schedule. The monitoring staff would call for support when a system is not performing as expected or within defined thresholds.

  • Support hierarchy—Most organizations have multiple levels of support, with increasing technical expertise. Larger organizations can provide support for the majority of issues with cheaper staff in the lower levels.
    LevelDescription
    Self service (lowest) This level of support consists of documentation websites or (interactive voice response (IVR) systems with instructions for resolving common situations.
    Service representatives This level of support has a person with access to scripts or instruction sets for common problems, and possibly some familiarity with the systems, who can talk through basic troubleshooting steps with a caller. This level of support is sometimes called Tier 1.
    Technical support This level of support has familiarity with the systems and user support tasks. They can work with the caller using their expertise and access to additional functions tools. In larger organizations, there may be more than one technical support layer, up to and including the development organization. This level of support is sometimes called Tier 2.
    On-site technician This level of support is the highest level, and sends a technician to the site when all lower levels have failed to resolve the caller's issue. This level of support is sometimes called Tier 3.
  • Escalation—In organizations with multiple support levels, a process for moving an issue to a higher level is necessary to minimize time spent effectively resolving the issue.

    When to escalate:

    • All options at the current support level have been exhausted without being able to successfully resolve the issue.
    • The issue is on a list of known issues that require a higher level of support expertise.
    • Upon request from the caller.
    • Upon request from the support staff.

    How to escalate:

    • Call in the next higher level directly using a defined process, such as using an on-call pager or phone number.
    • Reassign the issue in a ticket management system to the next level in the hierarchy for that system.
  • Specialization—In larger organizations, support organizations can be specialized and dedicated to certain asset types (hardware vs. software vs. network) or systems/applications. The customer base can support the cost of dedicated staff maintaining a single or small number of systems.
  • Generalization—Specialization is often not financially feasible in a smaller shop where the EIT staff covers all business systems and often has infrastructure support duties as well. In this case, it is important to assign at least one primary support person to each system as the go to person for support. A secondary EIT staff person can be the primary's backup for vacation and sick days, and cross-training staff where the primary for one system is the backup for another system can spread the responsibility and work effectively.

4.2.2 Asset Support Tools

Support tools come in two categories: asset management (what exists) and asset monitoring (what occurs). Some toolsets may cover both categories. As the business is at least tangentially funding operations and support, there should be a business view of the assets and services provided. Simply communicating with the business in technical terms may not be helpful or understandable when notifying business users of outages or the need for upgrades.

BusinessViewOps.jpg
Figure 3. Business View of Operations

  • Asset management tools—Many software packages manage asset information. The hard work lies in populating the package's database with the organization's asset catalog, and maintaining the catalog over time. No matter how comprehensive a tool is, it is useless until it has the organization's data loaded.
    Each tool may have one or more sweet spots in the following areas: depth of monitoring (system detail), breadth of monitoring (system and network scans), or user interface (information display and reporting/graphical output).
    • Asset detail cataloging—Some packages include options for managing data for each asset (hardware, software, services, and data) such as configuration settings, license and compliance information, contract and contact information, cost and depreciation tracking, requisition and procurement information, and support agreement information. Some of this data must be entered or uploaded as it does not exist on the asset itself.
      Some packages also allow for business views of the services offered, so assets can be tied to business functions. Each EIT service uses one or more assets (or components), and should map to one or more business processes, which then feeds into importance rankings and service level agreement parameters for each EIT service.
    • Asset scanning capabilities—Some packages focus on gathering information from the systems themselves, although there is usually some sort of processing cost while they run. This can be a way to quickly get accurate asset data into the tool's database.
      While system scans can replace some data entry effort up front, there is a lot of work to do making sure that the tool has the proper access to all the systems in order to run the scans, and each tool may not work with every system component or network connection in the organization. In those cases, some other tool can be used to dump data into a format that the asset management tool can import, but of course, that is a manual process with risks of errors or omissions.
      When the assets have been entered into the tool's database, review the data to ensure that it meets the expectations and has read the system data correctly. Some configuration is required to set up the frequency of updates and re-scan functions, as well as notification and alert thresholds and contacts.
    • User interface and reporting/analysis—Some packages focus on the ability to track and manage support issues for asset items, with analysis for trends in incidents or problems with certain assets. These tools commonly support tracking change requests and enhancement requests. Users may also report issues with assets that need to be resolved either via an enhancement change request or through sustainment. There should be a process for notifying the appropriate teams of relevant issues. See the Change initiatives and Sustainment chapters for more on this topic.
      User interfaces for lookup and reporting can either be installed clients or web-based. Client-installed versions can enable data entry and other auditable activities for authorized users, and may have more intuitive user interfaces for interacting with the tool in order to enter or modify asset data. However, web-based interfaces are much easier to deploy and manage for large numbers of users.
      Some packages include options for impact and dependency analysis to help determine if a change would have adverse or unintended effects, as well as alarms and notifications for license expiration or version updates from vendors.
  • Asset monitoring tools
    • Monitoring capabilities—Some packages focus on being able to monitor components to very great detail. However, there may be a limited list of systems in which the tool can function to that level of detail.
    • User interface—Some packages focus on the ability to display monitoring data in useful formats, in both reports and in real-time on scrolling or continually updating monitors. These may also have more intuitive user interfaces for interacting with the tool in order to enter or modify system data or schedules.

4.3 Asset Support Incidents and Recovery

Support organizations exist not only to keep the systems operating normally, but to handle exceptions to normal operations—direct handling for small known exceptions, and calling another team within the organization, or calling in outside support from vendors for more serious incidents or unknown exception situations.

4.3.1 Incidents versus Problems

Incidents (as defined in ITIL) are interruptions in normal processing on production systems. If a scheduled process does not start when expected, unexpectedly stops while executing, or executes longer than expected and somehow prevents something else from executing, it is an incident. Incidents causing incidents in other processes are considered related. ITIL also includes as incidents the failure of an asset that has not yet caused an interruption.

It is possible to resolve an incident with a workaround, but the underlying cause needs to be recorded as a problem.

Problems are situations where processes are executing, but not as normally expected, and the difference is temporarily tolerable. Problems become either defect reports or enhancement requests submitted to the change request queue. Identification as a defect report means that the change request alerts those responsible for defect correction and patch release planning. Identification as a change request means that the change request alerts those responsible for planning changes for future releases of the product.

Incidents can result in problems, but problems are not incidents. One way to understand the difference is to consider a traffic accident.

  1. A jack-knifed semi-truck lies across all lanes of traffic (an incident). Traffic is stopped.
  2. Incident management (police, firemen, tow truck) is called.
  3. Traffic is routed to the shoulder (workaround), so the incident is resolved. Traffic movement is restored, but not as much or as fast as normally expected.

There are now two remaining issues:

  • The first problem is that some lanes are still blocked. Getting the truck moved to the side helps, and removing the truck entirely restores traffic to the pre-incident state.
  • The second problem is to understand and address the reason why the truck jack-knifed in the first place—the weather, the road, the traffic, the truck, the driver, or some combination.
    • If caused by the weather (bad conditions), the incident management team can resolve the issue by communicating that road conditions are bad, while salting the road (preventing future incidents).
    • If caused by an issue with the road, the team can patch the bad area and schedule a more permanent repair.
    • If the level of traffic or the traffic pattern is the reason for the incident, the team can examine how the traffic deteriorated to cause the accident, and recommend changes to the traffic pattern, such as adding turn signals or turn lanes.
    • If the truck caused the issue, the owner can fix the truck.
    • If the driver caused the issue, the driver be retrained.

Incident management is the process of restoring normal operations as quickly as possible with minimal business impact. This is an emergency, where getting processes working now is more important than making the system permanently stable. A problem occurred, causing an incident.

Problem management is the process of removing causes of incidents that have occurred, or preventing potential incidents from occurring in the first place. Some problems may be so rare that a permanent remedy could be more costly than the incident causes, so problem management also includes prioritization. Problems turn into change requests submitted to a change request queue for evaluation (see Change Request Queue Management).

4.3.2 Incident Notification or Detection

Incidents are detected automatically by monitoring processes, or are reported by users. Support teams may have standard email addresses or phone numbers for support calls. A survey taken by Everbridge [1] includes the following findings.

  • Organizations averaged 150 EIT incidents a year, each taking an average of 2.25 hours to resolve.
  • Reporting outages by phone alone is insufficient.
  • One third of respondents reported difficulty quickly connecting to the right on-call support person to handle the outage.
  • Almost unanimously, respondents reported incidents of late or no response from the assigned support team.
  • Of the five most reported incident types (hardware failure, application outage, datacenter network outage or performances issues, or connectivity issues between sites and offices), network connectivity of some kind seems to be the most common issue. It follows that the survey also reported that network administrators were the most common type of support staff pulled into incident resolution efforts.
  • Two thirds said that incidents caused personal stress or increased workloads for staff members.

Mature organizations have reporting processes, and automated responses for every step in the process of resolving the incident. These reporting mechanisms can be as simple as an email sent to a team mailbox, or as sophisticated as using a standard ticket assignment tool. Both of these can provide logs of the issue's history and tracking for future reference.

It is important to keep the person reporting the incident updated, as well as anyone affected by the outage. Lack of information leads to a bad impression of the support team, no matter how well they resolve the problem.

4.3.3 Incident Resolution

By nature, incidents are unpredictable, and therefore uncomfortable for all involved. Mature organizations plan for scheduling a good mix of skilled resources to be available at all times, and funding for staff enhancement when needed during incidents so that support staff members are not overwhelmed during a major incident.

Link known incidents to standard remedies to aid support staff in efficient resolution of certain expected situations. Below are two ends of the ranges of incidents:

  • Minor incident—Process does not execute as expected. For example, an expected file is late in arriving from a supplier, preventing a process from executing. The problem is a file arrival issue.
    • If the cause is that there was no data to send, then the incident is resolved by removing the requirement for the file for this one instance. The problem is that there may be no data for multiple execution instances, requiring a resolution of some kind that can automatically handle the situation without causing an incident, such as the supplier sends an empty file or the application wait times out.
    • If the cause is that there was a delay in processing on their side, then the incident is resolved only when the file arrives. The problem is then that the SLA between the supplier and the application owner is not being met, requiring a resolution of some kind (the SLA to be revisited and the schedule adjusted, the dependency on that file arriving on time is removed, or some other solution).
    • If this incident occurs frequently, the problem importance can be elevated to a higher priority for resolution.
  • Major incident—A bug fix patch, upgrade, or enhancement implementation experienced an unexpected issue, causing an unexpected system restart. All processing stops without notice.
    • Automatic restart—If the system processing restarts automatically, there could be problems with the data, depending on where in the processing the halt occurred, and how robust the processing is for handling exceptions like unexpected restarts. It could be that incidents or problems occur for processes and data unrelated to the upgrade or enhancement due to co-existence on the same system.
    • Unsuccessful restart—If the system processing does not successfully restart, or the system does not restart at all, the incident continues until either a restore occurs from a backup taken prior to the change, or the system restarts successfully after some other action.
  • Major incident—Power outage occurs in the datacenter (see the Disaster Recovery chapter for more on this topic).

Ideally, each asset has documentation on normal operations expectations, remedial steps for known exception occurrences, and a guide for troubleshooting when unexpected exceptions to normal operations occur.

4.3.4 Resolution Notification

In most organizations, whenever an incident occurs, all users affected by the incident receive a notice that includes what happened, what actions are occurring to resolve the incident, and an expected time of return to normal operations.

It is important to communicate with users at least every hour during an incident, even if the communication is only that work is continuing to determine how to restore operations as quickly as possible. Assign one person to handle user notifications during the incident to prevent confusing or conflicting messages to the users. Some service level agreements can specify the frequency of communications during an incident (see Service Level Agreements).

When the incident is resolved, communicate the status to the users as well as the next steps for determining the cause and resolution of any problems detected.

4.3.5 Closed-loop Analysis

Closed-loop processes provide the data needed for process improvement. Monitoring of assets is not sufficient by itself; monitoring of support instances and resolutions is necessary to identify trends, and then develop improved remediation actions for frequently occurring issues. To improve resolution processes, it is essential to track the disposition of each incident and problem. For example, there must be a repository and a process that records whether each incident or problem resulted in a change (stating why and where the change was made), whether it was deferred and marked for future action, or was dismissed.

Asset support teams can track the frequency of issues, and reduce time spent resolving issues in several ways:

  • Reduce the time spent for successful resolution:
    • Develop and deploy better self-service instructions to reduce Tier 1 calls.
    • Modify Tier 1 scripts to identify if the issue needs escalation immediately, rather than trying other solutions first.
    • Modify Tier 1 scripts to use the optimal solution, based on analysis of prior instances.
  • Reduce the frequency of the issue:
    • Create a problem ticket or enhancement request requiring a system change that prevents the problem from occurring.
    • Create a problem ticket or enhancement request that handles the issue automatically, avoiding support calls.

5 User Management

Users use assets. Asset management enables business processes to occur predictably. Therefore, EIT must proactively manage users to enable controlled access and prevent chaos. Depending on the organization, funding for user support may come from EIT as a provided service; otherwise, funding comes from the departments on a usage basis.

Always communicate with the business users in business terms to ensure clear understanding, as most business users do not know technical terms or understand technical details.

5.1 User Access and Security Administration

Users require access to systems and applications to do their work. Systems and applications require information to authenticate users and enable access to authorized activity. Both sides need to agree. The most common method is to have profiles for users, and security authentication on the system side.

5.1.1 User Profiles and User Logins

A user profile is a collection of data about (a digital profile of) an individual or identity. Users within a single database typically share similar or identical qualities/characteristics. A profile often has information about the user's job or department, usage patterns, level of experience, location, and an ID. Often, user credentials are associated with a user profile and that is how the user is given access to applications, systems, or networks.

When a user enters an organization, the service department creates a user profile and associated login credentials. When a user leaves the organization, the EIT team should promptly disable their profile to prevent unauthorized access and potential negative actions. Some users might require multiple user profiles (IDs) each covering a different situation, such as one for normal work and another for when performing maintenance or recovery operations.

Applications have access points where users connect to the application or system. There are three aspects to enabling a secure connection between a user and a system:

  • Authentication—Each access request must verify that the user has access to the system. This can be a user ID and password, a certificate, a token, or a notification or pass-through from another security system like Active Directory or LDAP.
  • Authority—Each user has authority to perform functions within the application. Functions can be enabled or disabled based on the user's assigned authority.
  • Auditing or accounting—Every access request to the application is logged for review, regardless of success.

There are three common ways to assign access rights to users:

  • User-based—Each user has an individual login/profile with assigned rights and privileges to various systems. Mass changes are difficult and time consuming, but this method has the most flexibility.
  • Job-based—Each defined job has rights and privileges assigned to it. Users are linked to jobs so they can inherit the rights and privileges associated with that job. When job responsibilities change, all users linked to that job inherit the changes. Mass changes are easier, but some changes may not apply. This method requires a way to maintain the job definitions.
  • Role-based—Each system has defined roles with access rights to certain parts of the system or application. Some systems can have identical roles, so one role can share access to multiple applications for the same rights. Roles attach to users, and if job functions or access needs change, roles are attached to or detached from the users. Mass changes are simply adding or removing users from roles.

5.1.2 User Support Requirements

Each organization is different, and the needs for all users in the organization are different as well. Needs analysis for user support is fundamental to hiring the right people and selecting the right tools. Two aspects should be considered when determining support requirements:

  • Task complexity—Each organization varies in how complex their organizations are. Some user support tasks are simple and frequent, like password resets. Other tasks can require specialized access from certain teams to accomplish. These questions may help guide your requirements design:
    • Which support tasks are common?
    • Which support tasks are deliverable through self-service functions?
    • Which support tasks can be disabled during an emergency?
    • What is the expected task turnaround time for each task?
  • User skill level—Each organization varies in the types of users needing support. Some users are familiar with the applications and can perform many support tasks themselves. At the other end of the spectrum, some users are uncomfortable with computers and need additional attention to make them comfortable and successful.

5.2 User Support Processes

There are three main types of support processes for users: help desk, self-service portals, and change request queues. Each of these support processes requires monitoring and auditing to ensure proper performance. Help desks are monitored through side-by-side call monitoring, or after-call reviews. Monitoring ensures standard policy compliance and the proper execution of standard activities. Self-service portals have logs that can be monitored by the support team and change requests are typically monitored by operations. See Change Request Queue Management.

User support auditing is also important. The level and frequency of auditing or review corresponds to the maturity of the EIT organization and the industry requirements. Auditing can be critical for organizations dealing with healthcare or finances, or organizations that deal with human resources information.

5.2.1 Types of User Support Teams

User support can be provided in a number of ways, including embedded support, in-house remote support, and offshore support.

Embedded support means that the EIT support staff is co-located directly with the business unit. Having the support team embedded in business unit makes support convenient from the business side; however, from the EIT side, the main drawback is the possibility of resistance to EIT standards, such as using a ticketing system consistently. If there are enterprise standards, co-location with the business unit can bring extra pressure to ignore standards in favor of business needs.

In-house remote support means that the EIT organization provides support in a separate location from the users. Users contact the support center to report issues and receive support services. Some travel between the remote location, the datacenter, and the user locations occurs when necessary.

Offshore support is provided by an organization in another country. Many companies specialize in remote user support.

5.2.2 Help Desk Support Levels

When setting up a support system, the operations team needs to structure the support staff in the best way to handle the varying levels of support the users require. The team also needs to understand the timing of those needs to set up a support schedule. The help desk staff needs to call for support when a system is not performing as expected or when the help needs are out of their area of expertise. A support team can be structured by hierarchy, escalation, and specialization.

  • Hierarchy—Most organizations have multiple levels of support, each increasing in authority. Larger organizations can provide support for a majority of issues with cheaper staff in the lower levels.
    • Service representatives—This level of support has a person with access to scripts or instruction sets for common problems, and possibly some familiarity with the systems, who can talk through basic troubleshooting steps with a caller. This level of support is sometimes called Tier 1.
    • Technical support—This level of support has familiarity with the systems and user support tasks. They can work with the caller using their expertise and access to additional functions tools. In larger organizations, there may be more than one technical support layer, up to and including the development organization. This level of support is sometimes called Tier 2.
    • On-site technician—This level of support is the highest level, and sends a technician to the site when all lower levels have failed to resolve the caller's issue. This level of support is sometimes called Tier 3.
  • Escalation—In organizations with multiple support levels, a process for moving an issue to a higher level is necessary to minimize the time spent resolving the issue.
    • When to escalate:
      • All options at a support level have been exhausted without being able to successfully resolve the issue.
      • The issue is on a list of known issues that require higher level of support expertise.
      • Upon request from the caller.
      • Upon request from the help desk staff.
    • How to escalate:
      • Call in the next higher level directly using a defined process, such as using an on-call pager or phone number.
      • Reassign the issue in a ticket management system to the next level in the hierarchy for that system.
  • Specialization—In larger organizations, support organizations can be specialized to certain subject areas or applications.
    • Security access—Specialized teams evaluate requests for privileged access.
    • Application deployment—Specialized teams deploy applications to users upon request.

Some organizations may outsource user support; however, the organization is heavily invested in the users and depends on the services, so it follows that knowing the user profiles is critical, regardless of where the support is provided and by what organization.

5.2.3 User Self-Service Support

Examples of self-service functions are password resets or standardized software downloads and installations. Common self-service functions occur through IVR systems or websites where users request the support service without support staff involvement.

5.2.4 User Support Tools

Support tools come in two categories: user management (what exists) and help desk monitoring (what occurs). Some toolsets may cover both categories. As the business is at least tangentially funding operations and support, there should be a business view of the users and services consumed.

User Management Tools

Many software packages manage user information and profile assignments. The hard work lies in populating the package's database with the organization's users, and maintaining the catalog over time. No matter how comprehensive a tool is, it is useless until it has the organization's data loaded, and becomes useless over time if the data becomes stale.

Each tool may have one or more sweet spots in the following areas:

  • Profile management—Some packages include options for managing data for each user and profile, such as permission and authorization settings, contact information, and support agreement information. While there may be some useful templates or default profiles included, an organization's user data must be entered or uploaded into the tool's repository.
  • Remote connectivity—Being able to log directly in to the user's workstation enables support staff to see what the user sees to determine and resolve the user's issue.
  • User interface and reporting/analysis—Some packages focus on the ability to track and manage user support requests, with analysis for trends in incidents or problems with certain assets. These tools commonly support tracking change requests and enhancement requests. Users may also report issues with assets that need to be resolved through either an enhancement via a change request or as a sustainment issue. See the Change initiatives and Sustainment chapters for more on this topic.

Help Desk Monitoring Tools

Other tools are designed to help the help desk support team:

  • Monitoring capabilities—Many tools handle monitoring help desk or call center activity. This type of monitoring is not specific to EIT operations and support help desks.
  • User interface—Some packages focus on the ability to display monitoring data in useful formats, in both reports and in real-time on scrolling or continually updating monitors.

6 Service and Operational Level Agreements

Service level agreements (SLAs) and operational level agreements (OLAs) are contracts between teams that set expectations for how the teams cooperate with each other, and what each team is responsible for providing to the other team. SLAs contain expectations between a service provider and users consuming the services. An example of an SLA is an agreement between a business unit and EIT regarding desktop services. OLAs contain expectations between service providers who support systems providing services. An example of an OLA is an agreement between network operations, security, and desktop services to manage virtual private network (VPN) software and access for workstations. The figure below illustrates the difference.

Some service providers may not always have direct interfaces with users, and therefore no SLAs.

SLAsOLAs.jpg
Figure 4. SLAs and OLAs

6.1 Service Level Agreements (SLAs)

There are two main types of SLAs: generic and negotiated.

  • Generic SLAs are created when there is no specific relationship between the provider and the users. Sometimes the users are not part of the organization (external) or are on large teams where negotiation would be impractical. In these cases, the service provider alone may create a generic SLA as a baseline. Users that find the general SLA unsatisfactory can negotiate with the service provider to create an SLA that is specific to their team's needs.
  • Negotiated SLAs are between providers and users to set expectations for each team's deliverables to the other teams, what services are provided, the availability of those services, notifications of service interruptions, and if appropriate, rewards for compliance and penalties for non-compliance.

SLAs include multiple sections describing the agreement, the participants, and any required actions. Following are the most common sections included in SLAs.

Definitions Each SLA needs to name and define the teams and services involved in the agreement. Any SLA negotiations should start with definitions. Each SLA should also have a version identifier and effective dates to reflect changes over time.
Services and Service Levels SLAs do not need to be specific to only one service. For negotiated SLAs, all services that the users consume supported by the service providers in the agreement should be included in the SLA. For generic SLAs created by providers, all their available services should be included in the SLA. For each service, include the following information:
  • Name of the service—The common name and any other terms used.
  • Benefit or use—What the service provides to the organization and users.
  • Cost or fee structure—Costs for the service, if any. Some services involve purchasing or leasing hardware or software, while some services involve labor charges for specific activities. Some organizations use chargeback calculations where the business unit pays for EIT services through internal budget accounts.
  • Availability—When the service is available for access and usage. Monitor compliance and report this information to the users periodically.
  • Performance requirements—How long the service takes from request to response, how many events can be processed at once, how many users can access the system at once, etc. Monitor compliance and report this information to the users periodically.
  • Turnaround time for requests—When the users make service requests, how long it takes the provider team to respond (response time), and how long it takes to resolve (resolution time). Monitor compliance and report this information to the users periodically.
  • Scheduled maintenance windows—When maintenance activities can occur on a regular basis and not adversely affect the users. This section can also include preferred time for unscheduled maintenance windows in non-emergency situations.
  • Scope (Optional)—Identification of systems in scope (involved in or related to the service), or outside scope, is useful in situations involving multiple locations, providers, or external dependencies; some instances may have different SLAs. There is no need for much detail here—technical details such as specific server names or software versions do not belong in SLAs.
  • Performance incentives and penalties (Optional)—Some SLAs include penalties for non-compliance or rewards for high levels of compliance. This is most common in SLAs with external users, and is best codified in a legal contract.
Standard Notifications Service providers generate four types of standard notifications to users regarding the services: maintenance window reminders, incident and resolution notifications, enhancement or upgrade notifications, and metrics reports. The OLA should include content and timing guidelines for communications to users according to the situation.
  • Maintenance window reminders should include the time and duration of the window and what services are available or unavailable (whichever list is shorter). If the maintenance window is on a standard schedule, the reminder can be sent infrequently, or only to new users. If the maintenance window is not on the standard schedule, reminders should be sent one day, three hours, one hour, and fifteen minutes in advance, and another reminder should be sent when the maintenance window is concluded and all systems are back online.
  • Incident notifications should include incident information including systems affected, and current resolution activities. The initial notification should be sent as soon as the incident is verified, with the time of discovery as well as when the incident started. While researching the incident, notices should be sent as specified in the OLA—every 15 to 30 minutes for critical systems, and less frequently for non-critical systems. When the incident is resolved, a final notification should be sent describing the resolution and any further actions necessary.
  • Enhancement or upgrade notifications should be sent when anything affecting services covered by the SLA is put on a schedule for implementation, including how users will be affected. If possible, notifications for changes requiring user training should be sent far enough ahead of time for users to sign up for and receive the training on the new functionality before implementation. Reminders should be sent one day before the change is implemented, and another afterward, including training information.
  • Metrics reports should be sent as defined in the SLA to users who have opted in to getting the metrics reports. Some users may not be interested in these notifications and should be able to opt out as part of their user profile. Most metrics reports are generated monthly, and include data on service availability and performance including issues, and user requests including counts and resolution times.
User Responsibilities Users (identified or not) have certain responsibilities, even in generic SLAs. Users are primarily responsible for:
  • Requesting services, such as access to a system
  • Reporting issues with service performance

Users may also be responsible for evaluating service metrics to validate that services occur according to expectations. This is for separate compliance reporting, or for other internal audit reasons.

It is important to list and describe the responsibilities that the user accepts in the SLA, and the process for submitting requests or reporting issues.

Provider Responsibilities Providers have the primary responsibility of making the services available to the users. This may include aspects of asset management, user management, or both.

Providers have the secondary responsibilities of responding to issue reports and user requests. Every communication from the users should immediately generate a response from the provider stating that the request will be evaluated as soon as possible. Use of a ticketing system can provide tracking for both the original communication from the users and any responses or actions taken by the provider. Issue reports and user requests needs different responses.

List and describe the responsibilities that the provider accepts in the SLA. Include expected turnaround times for issue reports and user requests.

Responding to user requests includes research, and one or more of the following actions:

  • Notification of successful completion of access or other configuration requests, and completion of the request
  • Notification of issues with the user request

Responding to issue reports includes research that results in one or more of the following actions:

  • Notification that the issue was unrepeatable, or an anomaly that should not recur
  • Notification that a permanent solution to resolve the issue or permanently prevent recurrence was identified and routed for consideration to the appropriate teams, along with submission of the recommended solution to change request management
  • Notification that a workaround was identified, including instructions to use the workaround until a permanent solution can be identified and implemented, as well as submission of the issue to change request management for evaluation.
Escalation Procedures When users report issues, the provider must investigate and respond. If the user has an issue with the response (or lack thereof), the SLA must cover how users can escalate issues with providers, including the support hierarchy involved. Additionally, if the provider must escalate an issue with a user, that process and escalation hierarchy should be included in the SLA.
SLA Review Cycle There should be a standard process for review described in the SLA. This review can occur on a regular basis, such as annually, or be triggered by an event, such as multiple performance issues or prolonged failure to comply with performance or availability requirements. The result of a review triggered by an event does not always generate a change to the SLA; instead, it may generate a project to upgrade or increase the resources on the systems providing the service. If the OLA changes, then the version should be updated.
Approvals The SLA must be approved by the appropriate leaders of the teams involved, and should recorded into a document repository.

6.2 Operational Level Agreements (OLAs)

OLAs are between multiple technical teams (participants), and therefore have different contents and cover different areas. OLAs are negotiated between participating teams, and set expectations for each team's deliverables to the other teams, including services provided, the availability for those services, and notifications of service interruptions. Some service providers only have agreements with other service providers, and so do not participate in OLAs with users.

OLAs include multiple sections describing the agreement, the participants, and any required actions. Following are the most common sections included in OLAs.

Definitions Each OLA needs to name and define the teams and services involved in the agreement. Any OLA negotiations should start with definitions. Each OLA should also have a version identifier and effective dates to reflect changes over time.
Services and Service Levels Include all services that the participating teams involved provide to or consume from each other. For each service, include the following information:
  • Name of the service—The common name and any other terms used.
  • Benefit or use—What the service provides to the organization and the participant teams.
  • Availability for the service—When the service is available for access and usage. Monitor compliance and report this information to the OLA participants periodically.
  • Performance requirements for the service—How long from request to response, how many events can be processed at once, how many users can access the system at once, etc. Monitor compliance and report this information to the OLA participants periodically.
  • Turnaround time for requests—When participant teams make requests, how long it takes the appropriate team to respond (response time), and how long it takes to resolve requests (resolution time). Monitor compliance and report this information to the OLA participants periodically.
  • Scheduled maintenance windows—When maintenance activities can occur on a regular basis and not adversely affect the other participants. Some maintenance windows could be used by multiple participants simultaneously to minimize overall downtime. This section can also include preferred times for unscheduled maintenance windows in non-emergency situations.
Standard Notifications Participants generate standard notifications to other participants regarding the services, and are one of four types: maintenance window reminders, incident and resolution notifications, enhancement or upgrade notifications, and metrics reports. The OLA should include content and timing guidelines for communications to users according to the situation.
  • Maintenance window reminders should include the time and duration of the window and what services are available or unavailable (whichever list is shorter). If the maintenance window is on a standard schedule, the reminder can be sent infrequently, or only to new users. If the maintenance window is not on the standard schedule, reminders should be sent one day, three hours, one hour, and fifteen minutes in advance, and another reminder should be sent when the maintenance window is concluded and all systems are back online.
  • Incident notifications should include incident information including systems and participant teams affected, and current resolution activities. The initial notification should be sent as soon as the incident is verified, with time of discovery as well as when the incident started. While researching the incident, notices should be sent as specified in the OLA—every 15 to 30 minutes for critical systems, and less frequently for non-critical systems. When the incident is resolved, a final notification should be sent describing the resolution and any further actions necessary.
  • Enhancement or upgrade notifications should be sent when anything affecting services covered by the OLA is put on a schedule for implementation, and include how the participant teams are affected. Reminders should be sent one day before the change is implemented, and another afterward confirming installation.
  • Metrics reports should be sent as defined in the OLA to participants who have opted in to getting the metrics reports. Participants who also have SLAs may use these reports as part of their reports to their users.
Participant Responsibilities Participants have the primary responsibility of making their service available to the other participants. This may include aspects of asset management.

Participants have certain responsibilities to the other participants, most commonly for:

  • Reporting issues with service performance
  • Submitting requests for change or enhancement to services
  • Submitting requests for changes to the SLA terms

Participants may also be responsible for evaluating service metrics to validate the services occur according to expectations for separate compliance reporting, or for other internal audit reasons.

List and describe the responsibilities each participant team accepts in the OLA. Include expected turnaround times for issue reports.

OLA Review Cycle There should be a standard process for review described in the OLA. This review can occur on a regular basis, such as annually, or be triggered by an event, such as multiple performance issues or prolonged failure to comply with performance or availability requirements. The result of a review triggered by an event does not always generate a change to the OLA; instead, it may generate a project to upgrade or increase the resources on the systems providing the service.

If the SLA changes, then the version should be updated.

Approvals The OLA must be approved by the appropriate leaders of the teams involved, and should recorded into a document repository.

7 Change Request Queue Management

Change request queues contain requests from multiple areas to correct defects, improve system functionality, or comply with changing external drivers. Most organizations use a ticketing system to record incidents, defect reports, and user requests. These systems can be configured to handle requests for compliance changes required by external drivers. Operations and support teams are logically placed to handle the intake of these requests.

There are three main sources for change requests:

  • Defects drive change requests because they are degrading operations and need to be removed. Some defects may affect large systems or large parts of smaller systems. Care must be taken to evaluate the frequency of the defect occurring and the overall effect on the organization's operations. Some defects may not be significant enough to be assigned a high priority, and thus may be deferred for later evaluation, although in some cases, deferring a defect may mean a higher effect later, and a higher cost to remove.
  • External drivers demand that changes occur to comply with laws, regulations, or contractual obligations, such as supplier support agreements. Other external drivers may be industry standards, such as for security processing, which are deemed mandatory due to fines or loss of business resulting from non-compliance.
  • Strategic changes are those from the business side that improve how the organization does business, either reducing expenses or increasing revenue. Change requests go through a lifecycle, starting with the origination of the need for the change. The figure below shows the lifecycle of change requests.

ChangeRequestLifeCycle.png
Figure 5. Change Request Lifecycle

All of these requests appear in a single repository, whether a spreadsheet or a sophisticated tool, that contains all requests. As shown above, all requests are categorized and queued after having been evaluated for priority, effort, and effect in the organization. As each request is reviewed, a decision is made either to approve it for near-term or immediate action, to defer it, or to reject it. The decisions and their rationales are logged in the request.

8 Summary

Operations and support is mainly concerned with keeping production systems running as expected. Both assets and users need to be supported, but in different ways. Changes to either assets or users need to be managed appropriately.

9 Key Maturity Frameworks

Capability maturity for EIT refers to its ability to reliably perform. Maturity is measured by an organization's readiness and capability expressed through its people, processes, data, technologies, and the consistent measurement practices that are in place. See Appendix F for additional information about maturity frameworks.

Many specialized frameworks have been developed since the original Capability Maturity Model (CMM) that was developed by the Software Engineering Institute in the late 1980s. This section describes how some of those apply to the activities described in this chapter.

9.1 IT-Capability Maturity Framework (IT-CMF)

The IT-CMF was developed by the Innovation Value Institute in Ireland. This framework helps organizations to measure, develop, and monitor their EIT capability maturity progression. It consists of 35 EIT management capabilities that are organized into four macro capabilities:

  • Managing EIT like a business
  • Managing the EIT budget
  • Managing the EIT capability
  • Managing EIT for business value

Each has five different levels of maturity starting from initial to optimizing. The two most relevant critical capabilities are service provisioning (SRP) and technical infrastructure management (TIM).

9.1.1 Service Provisioning Maturity

The following statements provide a high-level overview of the service provisioning (SRP) capability at successive levels of maturity.

Level 1The service provisioning processes are ad hoc, resulting in unpredictable EIT service quality.
Level 2Service provisioning processes are increasingly defined and documented, but execution is dependent on individual interpretation of the documentation. Service level agreements (SLAs) are typically defined at the technical operational level only.
Level 3Service provisioning is supported by standardized tools for most EIT services, but may not yet be adequately integrated. SLAs are typically defined at the business operational level.
Level 4Customers have access to services on demand. Management and troubleshooting of services are highly automated.
Level 5Customers experience zero downtime or delays, and service provisioning is fully automated.

9.1.2 Technical Infrastructure Management Maturity

The following statements provide a high-level overview of the technical infrastructure management (TIM) capability at successive levels of maturity.

Level 1Management of the EIT infrastructure is reactive or ad hoc.
Level 2Documented policies are emerging relating to the management of a limited number of infrastructure components. Predominantly manual procedures are used for EIT infrastructure management. Visibility of capacity and utilization across infrastructure components is emerging.
Level 3Management of infrastructure components is increasingly supported by standardized tool sets that are partly integrated, resulting in decreased execution times and improving infrastructure utilization.
Level 4Policies related to EIT infrastructure management are implemented automatically, promoting execution agility and achievement of infrastructure utilization targets.
Level 5The EIT infrastructure is continually reviewed so that it remains modular, agile, lean, and sustainable.

10 Key Competence Frameworks

While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected three major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.

Creating mappings between these frameworks and our chapters is challenging, because they come from different perspectives and have different goals. There is rarely a 100 percent correspondence between the frameworks and our chapters, and, despite careful consideration some subjectivity was used to create the mappings. Please take that in consideration as you review them.

10.1 Skills Framework for the Information Age

The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes seven levels of competency that can be applied to each skill. However, not all skills cover all seven levels. Some reach only partially up the seven-step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. SFIA is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)

SkillSkill DescriptionCompetency Levels
Service level managementThe planning, implementation, control, review, and audit of service provision, to meet customer business requirements. This includes negotiation, implementation, and monitoring of service level agreements (SLAs), and the ongoing management of operational facilities to provide the agreed levels of service, seeking continually and proactively to improve service delivery and sustainability targets.2-5
Service acceptanceThe achievement of formal confirmation that service acceptance criteria have been met, and that the service provider is ready to operate the new service when it has been deployed. (Service acceptance criteria are used to ensure that a service meets the defined service requirements, including functionality, operational support, performance and quality requirements).4-5
Configuration managementThe lifecycle planning, control, and management of the assets of an organization (such as documentation, software, and service assets, including information relating to those assets and their relationships. This involves identification, classification, and specification of all configuration items (CIs) and the interfaces to other processes and data. Required information relates to storage, access, service relationships, versions, problem reporting, and change control of CIs. The application of status accounting and auditing, often in line with acknowledged external criteria such as ISO 9000, ISO/IEC 20000, ISO/IEC 27000, and security throughout all stages of the CI lifecycle, including the early stages of system development.2-5
Asset managementThe management of the lifecycle for all managed assets (hardware, software, intellectual property, licenses, warranties, etc.) including security, inventory, compliance, usage, and disposal, aiming to protect and secure the corporate assets portfolio, optimize the total cost of ownership and sustainability by minimizing operating costs, improve investment decisions, and capitalize on potential opportunities. Knowledge and use of international standards for asset management and close integration with security, change, and configuration management are examples of enhanced asset management development.2-5
Change managementThe management of change to the service infrastructure including service assets, configuration items, and associated documentation. Change management uses requests for change (RFC) for standard or emergency changes, and changes due to incidents or problems to provide effective control and reduction of risk to the availability, performance, security, and compliance of the business services impacted by the change.2-5
Release and deploymentThe management of the processes, systems, and functions to package, build, test, and deploy changes and updates (which are bounded as "releases") into a live environment, establishing or continuing the specified service, to enable controlled and effective handover to operations and the user community.3-5
System softwareThe provision of specialist expertise to facilitate and execute the installation and maintenance of system software such as operating systems, data management products, office automation products, and other utility software.3
Capacity managementThe management of the capability, functionality, and sustainability of service components (including hardware, software, network resources, and software/infrastructure as a service) to meet current and forecast needs in a cost-efficient manner aligned to the business. This includes predicting both long-term changes and short-term variations in the level of capacity required to execute the service, and deployment, where appropriate, of techniques to control the demand for a particular resource or service.4-5
Security administrationThe provision of operational security management and administrative services. This typically includes the authorization and monitoring of access to EIT facilities or infrastructure, the investigation of unauthorized access, and compliance with relevant legislation.1-5
Penetration testingThe assessment of organizational vulnerabilities through the design and execution of penetration tests that demonstrate how an adversary can either subvert the organization's security goals (e.g., the protection of specific intellectual property) or achieve specific adversarial objectives (e.g., establishment of a covert command and control infrastructure). Pen test results provide deeper insight into the business risks of various vulnerabilities.4-6
Application supportThe provision of application maintenance and support services, either directly to users of the systems or to service delivery functions. Support typically includes investigation and resolution of issues and may also include performance monitoring. Issues may be resolved by providing advice or training to users, devising corrections (permanent or temporary) for faults, making general or site-specific modifications, updating documentation, manipulating data, or defining enhancements. Support often involves close collaboration with the system's developers or with colleagues specializing in different areas, such as database administration or network support.2-5
EIT infrastructureThe operation and control of the EIT infrastructure (typically hardware, software, data stored on various media, and all equipment within wide and local area networks) required to deliver and support EIT services and products to meet the needs of a business. Includes preparation for new or changed services, operation of the change process, the maintenance of regulatory, legal, and professional standards, the building and management of systems and components in virtualized computing environments, and the monitoring of performance of systems and services in relation to their contribution to business performance, their security, and their sustainability.1-4
Database administrationThe installation, configuration, upgrade, administration, monitoring, and maintenance of databases.2-4
Storage managementThe planning, implementation, configuration, and tuning of storage hardware and software covering online, offline, remote, and offsite data storage (backup, archiving, and recovery) and ensuring compliance with regulatory and security requirements.3-5
Network supportThe provision of network maintenance and support services. Support may be provided both to users of the systems and to service delivery functions. Support typically takes the form of investigating and resolving problems and providing information about the systems. It may also include monitoring their performance. Problems may be resolved by providing advice or training to users about the network's functionality, correct operation or constraints, by devising work-arounds, correcting faults, or making general or site-specific modifications.2-5
Problem managementThe resolution (both reactive and proactive) of problems throughout the information system lifecycle, including classification, prioritization, and initiation of action, documentation of root causes, and implementation of remedies to prevent future incidents.3-5
Incident managementThe processing and coordination of appropriate and timely responses to incident reports, including channeling requests for help to appropriate functions for resolution, monitoring resolution activity, and keeping clients appraised of progress towards service restoration.2-5
Facilities managementThe planning, control, and management of all the facilities which, collectively, make up the EIT estate. This involves provision and management of the physical environment, including space and power allocation, and environmental monitoring to provide statistics on energy usage. Encompasses physical access control, and adherence to all mandatory policies and regulations concerning health and safety at work.3-5
Customer service supportThe management and operation of one or more customer service or service desk functions. Acting as a point of contact to support service users and customers reporting issues, requesting information, access, or other services.1-5

10.2 European Competency Framework

The European Union's European e-Competence Framework (e-CF) has 40 competences and is used by a large number of companies, qualification providers, and others in public and private sectors across the EU. It uses five levels of competence proficiency (e-1 to e-5). No competence is subject to all five levels.

The e-CF is published and legally owned by CEN, the European Committee for Standardization, and its National Member Bodies (www.cen.eu). Its creation and maintenance has been co-financed and politically supported by the European Commission, in particular, DG (Directorate General) Enterprise and Industry, with contributions from the EU ICT multi-stakeholder community, to support competitiveness, innovation, and job creation in European industry. The Commission works on a number of initiatives to boost ICT skills in the workforce. Version 1.0 to 3.0 were published as CEN Workshop Agreements (CWA). The e-CF 3.0 CWA 16234-1 was published as an official European Norm (EN), EN 16234-1. For complete information, see http://www.ecompetences.eu.

e-CF Dimension 2e-CF Dimension 3
A.2. Service Level Management (PLAN)
Defines, validates and makes applicable service level agreements (SLAs) and underpinning contracts for services offered. Negotiates service performance levels taking into account the needs and capacity of stakeholders and business.
Level 3-4
C.1. User Support (RUN)
Responds to user requests and issues, recording relevant information. Ensures resolution or escalates incidents and optimizes system performance in accordance with predefined service level agreements (SLAs). Understands how to monitor solution outcome and resultant customer satisfaction.
Level 1-3
C.2. Change Support (RUN)
Implements and guides the evolution of an ICT solution. Ensures efficient control and scheduling of software or hardware modifications to prevent multiple upgrades creating unpredictable outcomes. Minimizes service disruption as a consequence of changes and adheres to defined service level agreements (SLAs). Ensures consideration and compliance with information security procedures.
Level 2-3
C.3. Service Delivery (RUN)
Ensures service delivery in accordance with established service level agreements (SLAs). Takes proactive action to ensure stable and secure applications and ICT infrastructure to avoid potential service disruptions, attending to capacity planning and to information security. Updates operational document library and logs all service incidents. Maintains monitoring and management tools (i.e., scripts, procedures). Maintains IS services. Takes proactive measures.
Level 1-3
C.4. Problem Management (RUN)
Identifies and resolves the root cause of incidents. Takes a proactive approach to avoidance or identification of root cause of ICT problems. Deploys a knowledge system based on recurrence of common errors. Resolves or escalates incidents. Optimizes system or component performance.
Level 2-4
D.9. Personnel Development (ENABLE)
Diagnoses individual and group competence, identifying skill needs and skill gaps. Reviews training and development options and selects appropriate methodology taking into account the individual, project, and business requirements. Coaches/mentors individuals and teams to address learning needs.
Level 2-4
E.4. Relationship Management (MANAGE)
Establishes and maintains positive business relationships between stakeholders (internal or external) deploying and complying with organizational processes. Maintains regular communication with customer/partner/supplier, and addresses needs through empathy with their environment and managing supply chain communications. Ensures that stakeholder needs, concerns, or complaints are understood and addressed in accordance with organizational policy.
Level 3-4

10.3 i Competency Dictionary

The Information Technology Promotion Agency (IPA) of Japan has developed the i Competency Dictionary (iCD) and translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. The iCD is an extensive skills and tasks database, used in Japan and southeast Asian countries. It establishes a taxonomy of tasks and the skills required to perform the tasks. The IPA is also responsible for the Information Technology Engineers Examination (ITEE), which has grown into one of the largest scale national examinations in Japan, with approximately 600,000 applicants each year.

The iCD consists of a Task Dictionary and a Skill Dictionary. Skills for a specific task are identified via a "Task x Skill" table. (See Appendix A for the task layer and skill layer structures.) EITBOK activities in each chapter require several tasks in the Task Dictionary.

The table below shows a sample task from iCD Task Dictionary Layer 2 (with Layer 1 in parentheses) that corresponds to activities in this chapter. It also shows the Layer 2 (Skill Classification), Layer 3 (Skill Item), and Layer 4 (knowledge item from the IPA Body of Knowledge) prerequisite skills associated with the sample task, as identified by the Task x Skill Table of the iCD Skill Dictionary. The complete iCD Task Dictionary (Layer 1-4) and Skill Dictionary (Layer 1-4) can be obtained by returning the request form provided at http://www.ipa.go.jp/english/humandev/icd.html.

Task DictionarySkill Dictionary
Task Layer 1 (Task Layer 2)Skill ClassificationSkill ItemAssociated Knowledge Items
System user support
(service desk)
Operation of services Service desk operation methods
  • Methods for CS improvement activities (formulated by a plan)
  • EIT services standards
  • Functions of incident management process
  • Understanding of customer operations
  • Customer regular reporting
  • Complaint level understanding
  • Complaint management
  • Functions of call tracking system
  • Service desk
  • Service level monitoring and evaluation
  • Service continuity management
  • System audit standards
  • Management of stakeholders
  • Security standards
  • Software configuration
  • Utilization of knowledge base
  • Network configuration
  • Hardware configuration
  • Facility configuration
  • License management
  • Resource management
  • Release management
  • Availability management
  • Operation monitoring and trend analysis
  • Understanding of operation influence level
  • Knowledge related to analysis of business operations and current EIT environment analysis
  • Contract information management
  • Customer satisfaction management
  • Broad area customer management
  • Asset management
  • Information assets management (knowledge management)
  • Corrective measures
  • Measurement, analysis, and improvement
  • Knowledge related to system building and schedule planning
  • Knowledge on in-person/interview/conversation skills
  • Knowledge and utilization of intellectual property
  • Tools and techniques for quality planning and management
  • Knowledge related to quality requirements, system building, and schedule planning
  • Preventive measures

11 Key Roles

These roles are common to ITSM:

  • Availability Manager
  • Capacity Manager
  • Change Manager
  • Configuration Manager
  • Financial Manager
  • EIT Operations Manager
  • Incident Management
  • Problem Manager
  • Release Manager
  • Service Level Manager

Other key roles are:

  • Database Administrator
  • Development Team
  • Network Manager
  • Product Management

12 Standards

ANSI/AIAA G-043A-2012e, ANSI/AIAA Guide to the Preparation of Operational Concept Documents

IEEE Std 828™-2012, IEEE Standard for Configuration Management in Systems and Software Engineering

ISO 10004:2012, Quality management—Customer satisfaction—Guidelines for monitoring and measuring

ISO 10007:2003, Quality management systems—Guidelines for configuration management

ISO 18238:2015, Space systems—Closed loop problem solving management

ISO/IEC 16350:2015, Information technology—Systems and software engineering—Application management

ISO/IEC 19770-1:2012, Information technology—Software asset management—Part 1: Processes and tiered assessment of conformance

ISO/IEC 20000-1:2011, (IEEE Std 20000-1:2013) Information technology—Service management—Part 1: Service management system requirements

ISO/IEC 20000–2:2012, Information technology—Service management—Part 2: Guidance on the application of service management systems

ISO/IEC TR 20000–10:2015, Information technology—Service management—Part 10: Concepts and terminology

ISO/IEC TR 20000–11:2015, Information technology—Service management—Part 11: Guidance on the relationship between ISO/IEC 20000-1:2011 and related frameworks: ITIL®

ISO/IEC TR 20000-12—Information technology—IT Service management—Part 12: Guidance on the relationship between ISO/IEC 20000-1:2011 and service management frameworks: CMMI-SVC®

ISO/IEC/IEEE 14764-2006, Software Engineering—Software Lifecycle Processes—Maintenance

ISO/IEC 15939:2007, Systems and software engineering—Measurement process

13 References

[1] http://www.cioinsight.com/it-strategy/infrastructure/slideshows/how-mishandled-it-incidents-spiral-out-of-control.html taken from http://go.everbridge.com/ITCommunicationeBook-web.html.