Maintenance and Control
Contents
1 Introduction
Maintenance and Control are two sides of the same process. Maintenance activities ensure that a system remains operational and does not degrade over time. Maintenance activities preserve existing function. Control of a system manages the approval process for requested changes to a system, including defect fixes, evolution of third-party components, and in-house enhancements. Control activities evaluate and determine approval and schedules for changes to existing function, which are then implemented through the transition of a solution to operations (see Transition into Operations).
Considerable research over the past several decades has shown that the majority of expenses (75-80%) reported as maintenance costs are actually due to changes to systems in response to enhancement requests.[6] [7] Changes that add functionality almost always add value, and therefore, should not properly be called “maintenance.” They are instead properly seen as stages in the system’s evolution.
1.1 Maintenance
An EIT system (asset or solution) is maintained by activities performed to ensure that the system continues to be operational over time [1]. The maintenance team is responsible for two main functions:
- The maintenance team plans for, designs, and directs the maintenance necessary to prevent the deterioration and failure of a system, which can be due to defects, obsolescence, or environmental conditions [2].
- The maintenance team recommends changes to assure continuing environmental compatibility of a system due to evolution of its hardware and software as provided by vendors.
Maintenance is different from operations; however, the understanding of a system’s maintenance needs is closely connected to its operational function.[2]
- Operations and Support activities ensure that production systems operate consistently in a steady state of defined functionality. Operational support focuses outwardly on preserving execution of systems providing service to users.
- Maintenance evaluates the system’s operations over time in comparison to changing environment component standards (upgrades from vendors) and increasing age of components (failures due to normal use). Maintenance focuses inwardly on proactively preserving a system’s ability to provide defined functionality services to users over time.
Maintenance is different from evolution or enhancement (as shown in Table 1). In short, evolution development and enhancement change the functionality of a system or add functionality to a system. Maintenance does not.
Table 1. Comparison of Maintenance with Operational Support and Evolutionary Development and Enhancement
Operational Support | Maintenance | Evolutionary Development and Enhancement | |
---|---|---|---|
Functionality effect | Provides consistent functionality and recovers from failures | Predicts activities necessary to preserve functionality or prevents failures | Changes or adds functionality |
Focus | Activity | Process | Process and activity |
User affect | Low to none | Low to none | High |
Active/Passive | Passive | Active | Active |
Frequency | Continual | Regularly scheduled | Scheduled for inclusion in development projects |
Standard Activities | Executes maintenance processes | Creates and implements maintenance processes | N/A |
To illustrate the differences, consider these concepts as they relate a hot air balloon.
- Maintenance is an important part of risk management. It scans for and analyzes changes in current safety regulations, wind and weather conditions, and the condition of the heater, basket, and balloon fabric, which will determine what parts need to be replenished, repaired, or replaced. Maintenance determines the schedule and defines the standard checklist of pre- and post-flight tasks, as well as the schedule and criteria for performing occasional maintenance tasks, such as examining parts for signs of wear or failure, resulting in activities to patch the basket or balloon, or replace hoses, straps, heating elements, or ropes before they reach end-of-life.
- Operational support executes the scheduled standard tasks from Maintenance that keep the balloon safely airborne when in use, such as pre-flight unpacking, fuel tank loading, fuel consumption, equipment inventory, and post-flight packing.
- Evolution or enhancement adds new features, or changes out the heater, basket, or balloon to enable longer flights, more passengers, or more comfort during flight.
1.2 Control
Control is the process of ensuring only expected and approved changes are implemented in any system. Change requests may come through defect reports, external drivers such as patches or revisions from third-party software providers, changes to relevant laws or regulations (such as for tax or payroll systems, or privacy protection), or through feature enhancement requests. Change requests are collected and reviewed by a team of stakeholders in the organization, including members of the Operations, Finance, and Architecture teams, other business users, and software portfolio managers. Changes may be deferred (to be done later when more feasible), rejected (will never be done), or approved and prioritized for implementation (see Figure 1: Change Request Life Cycle).
Approved changes are then developed (see Construction) or acquired (see Acquisition), and placed into operations through the transition process.
Figure 1: Change Request Life Cycle
2 Goals & Principles
The basic goal for all of Enterprise IT ([http://eitbokwiki.org/Glossary#eit EIT) is to keep systems operating to provide value to the organization, despite defects discovered after installation, changes in laws, and advances in technology. The main goal for Maintenance and Control is to preserve operations over time through asset lifecycle management and control of changes to assets. This goal has three parts:
- Ensure service levels and stability through standard maintenance activities to prevent service disruption.
- Preserve service levels and functionality through approved changes to assets as provided by suppliers, required by law or regulation, or to repair a defect.
- Reduce risk to assets by designing achievable maintenance activities, removing obsolete assets, reviewing all requested changes, implementing only stakeholder-approved changes, and ensuring changes occur through defined standard construction, acquisition, and transition processes.
EIT risk management responsibilities include coordinating with disaster recovery planning, testing, and evaluation. This responsibility is especially important for EIT, because the business of the enterprise almost certainly cannot be conducted without EIT systems operation.
2.1 Guiding Principles
DO:
- Design maintenance activities to be simple, straightforward, and consistent across systems to minimize the need for specialized skills.
- Have good relationships with suppliers to enable open and honest discussions about any offered upgrades or fixes, and to ensure that no side-effects or unintended consequences occur.
- Determine how much risk the organization is willing to take on when considering an upgrade – early adopters might get considerations for any defects they discover, while late adopters usually have fewer implementation issues.
- Ensure that the Business stakeholders have a presence in change request reviews.
- Ensure that change request reviews include prioritization based upon the value to the business.
DO NOT:
- Change more than necessary to reduce risk of failure, fix a defect, or meet a requirement.
- Approve changes to data that are not implemented through existing applications or interfaces.
3 Context Diagram
Figure 1: The Context Diagram for Maintenance and Control
4 Maintenance Responsibilities
Maintenance is defined as activities required to keep a system operational and responsive after it is accepted and placed into production. The maintenance of EIT systems includes preventive actions (risk reduction) and corrective actions (fixes) that preserve consistent operations. In EIT systems, maintenance can be performed on hardware, software, and data.
As part of portfolio management, EIT and the Enterprise are expected to set policies about support levels for operational systems. The support level for any given system will be determined by weighing the value of the service provided against the cost to support it. The goal is to not spend more on a system than its value to the enterprise.
The capability to be maintainable must be designed into all systems. This includes the system’s architecture, and by extension, its maintainability requirements. The ability to maintain a system is determined by the processes required (a function of the system’s design) and the availability of resources to execute those processes [2]. Maintenance processes must be monitored and measured for continuous improvement.
As part of system evaluations, maintenance tools and processes must be included. Systems should have included, as part of the package, expected maintenance activities (much like regular oil changes on cars), and tools to monitor system behavior, and possibly tools to perform maintenance activities.
4.1 Define Maintenance Activities
There are four types of maintenance, corrective, preventive, adaptive, and perfective that are defined:
- Corrective Maintenance: Corrective maintenance can be either unscheduled (emergency) or scheduled.
- In an incident (ITIL) or emergency situation, maintenance activities occur to recover and bring a system back into operation. These occurrences can be reduced by proper implementation and execution of the other types of maintenance, as well as proper control, which reduces the risk of emergencies. [8]
- Scheduled corrective maintenance occurs to remove existing defects in a system, which are related to a problem, or are due to an issue with a change applied to the system.
- Preventive Maintenance: Preventive maintenance is scheduled. It is set up based upon analysis of similar systems to find patterns of flaws and to replace components before they fail. This type of maintenance may be required by vendors, regulations, or laws, especially in safety-related systems. Even when not required by vendors, regulations, or laws, preventative maintenance keeps track of such things as aging of components, vis-à-vis their expected useful lives and inspects wires, and connections for signs of wear. This is an important part of risk management. By extension, preventative maintenance activities require certain interactions with facilities management, for example, with regard to provisions for power back-ups for EIT systems in case of power failures.
- Adaptive Maintenance: The less common adaptive maintenance occurs when EIT changes one system to adapt to changes in another system. This is actually a type of enhancement, because the entire environment is enhanced when one part is upgraded. An adaptive maintenance task can be as simple as changing a configuration in one system to adapt to an upgrade in another system, using a different driver to connect databases because the other system’s database software was upgraded, or increasing data capacity via a parameter change. On the other hand, it could be a complex set of operations such as ones that would enable increasing the number of concurrent users.
- Perfective Maintenance: Perfective maintenance is a misnomer and the term is used less often. It is defined as the process of improving or evolving a system in some manner, which is actually enhancement, not maintenance (see Evolutionary Responsibilities.)
In summary:
Corrective | Preventative | Adaptive | Perfective (Enhancements) |
---|---|---|---|
|
|
|
|
Defining maintenance activities depends on
- The type of system and component
Activities required will vary based on the system, and the component type within the system. Maintenance of (for example) disk drives may vary depending on whether they are installed in individual servers, or a storage cluster. - Expected support levels and system priorities
As assigned by the organization, some systems will be designated as ‘mission-critical’ and therefore will be maintained to preserve the least amount of risk of failure, whereas non-production systems may be on a slower maintenance schedule, or have lower priority for resource assignment during times of peak production usage or risk. A high priority system may be assigned maintenance activities that cycle out components that have a predictable lifecycle, while a low priority system may be assigned a support level that allows only mandatory support activities response to component failure. - System economics
The economics of a system can be described as the difference between the benefits the service is bringing to the business versus the cost to maintain and support it. Some measurements that determine the benefits of a solution are:- Criticality of the business processes supported
- number of users accessing the system
- number of new transaction being processed
- amount of time saved by using the functions (e.g., versus manual procedures)
- The vendor support costs (maintenance, subscription or licensing, or leasing)
- Infrastructure costs (server, storage, rack space, power, cooling)
- Technical stack costs (operating system, utilities, printing, data transfer, reporting, and so on)
- FTE costs of the support, maintenance, and operations [3]
These measurements should be compared with the total cost of maintenance and support of the system, which includes:
When the cost becomes greater than the benefits, it is time to retire or replace the service. Aging out, and therefore eliminating utility/maintenance/licensing costs, should reduce EIT costs, as long as the functional replacement does not result in an added maintenance burden for EIT staff, or provide a reduced benefit to the business (see Transition into Operations for the planning and processes for these functions.
Generally, maintenance activity design should take into account:
- The importance of the system to the organization (priority) – all systems vary in importance to the organization, depending on the function it provides, and whether the organization can continue to transact business without that system operational. Mission critical systems will have a higher priority for recovery from failure, and therefore will have more monitoring and maintenance activities performed to prevent any failures from occurring at all.
- Maintenance requirements due to regulations and laws – Some industries have monitoring and maintenance regulations, rather than letting each organization define their own. In these cases, there may be reporting of monitoring and maintenance activities to an outside organization as well, to manage compliance.
- The risk and business cost of a failure occurring –
- Some components may have a low risk of failure, but once a failure occurs, the business cost and recovery costs are high. Older systems may need parts that become scarce or expensive over time. Technical debt (the additional cost of maintaining and/or upgrading systems that lag behind current releases or technology) increases as components age, which may make the cost of a system failure catastrophic.
- Some components may have a high risk of failure, but recovery is cheap and quick, such as by swapping out drives or cards in arrays, such that the system overall has a low risk of failure, even though components are replaced frequently.
- Maintenance recommendations from vendors – Any system purchased or leased will have vendor-recommended activities to keep the system operational. Of course, the vendor will probably err on the side of more frequent and/or expensive activities, so each organization must determine for itself its needs. Cloud systems remove this responsibility from the lessee as part of the Platform-as-a-Service, although maintenance activity costs are built into the contract.
- The expected life of each component (lifecycle) under normal use – all components will have an expected life under normal conditions. Some components are less durable than others or consumable, and therefore will need more frequent maintenance.
- The probability of each component’s failure at or before scheduled maintenance – a function of the expected life is a growing probability that a failure will occur over time, or after extraordinary use or strain.
- The cost of the maintenance activity in both parts and labor – a balance must be found between overprotection: continuous monitoring and replacement at the first sign of trouble (costly) and negligence: inadequate attention to monitoring or delayed maintenance (which leads to technical debt).
4.2 Define Maintenance Schedules
Maintenance activities almost always interfere with normal processing. Components must be made unavailable or incur additional strain from both production and maintenance activities occurring simultaneously. Only in extreme situations should maintenance activities occur on online components. All systems must be designed to enable offline maintenance, even if infrequent, as that ability will also be used in disaster (Incident) situations (see [8] and Disaster Recovery).
Maintenance activities occur as scheduled over time or occur due to an event. Some components may recommend maintenance activities occur both on a schedule and/or due to an event. Both may also be automated to automatically perform some maintenance activity based either on a time or an event.
- Scheduled maintenance activities have defined time periods for activities, and each activity is placed on the schedule according to the organization’s needs. This type of maintenance is pro-active.
- Event-based maintenance occurs when a monitoring threshold is reached (time to clean out the shared temp area), or a component signals a need for attention. This type of maintenance is re-active.
Almost all vendors provide suggested maintenance schedules or monitoring thresholds for the components they support. Schedules are in terms of activities to be performed per time period (week, month, etc.). Otherwise a list of events and suggested actions are provided.
In distributed (failover) or high-availability systems, maintenance on components may occur while the system is online (even if the component is not), as other components will take over processing from the ones undergoing maintenance. So maintenance activities may occur during business hours as an option.
For systems without failover, plan the most intrusive maintenance activities to occur during non-peak times, and include options for taking components offline for a time to perform activities that may adversely affect business processing. Most maintenance activities occur during times of lower business processing, such as overnight or on weekends, although with the advances in distributed systems and networks, these activities require less downtime and lower frequency.
4.3 Design and Implement Standard Maintenance Processes
Maintenance processes should be designed to be simple, easy to perform, repeatable activities which occur based on a schedule or on an event, and ideally, can be automated as much as possible. (See above) In many cases, those performing manual maintenance activities are not knowledgeable about the component or system; instead, these operators follow the instructions provided.
Manual maintenance activities which are onerous, intricate, difficult, and/or not clearly documented so not understood by the operator will be less likely to be performed correctly without monitoring. Design each task to be simple, including breaking up complex tasks into simpler parts, clearly document the activity, automate as much as possible, and train the operators on any manual tasks to increase success.
For example, one important maintenance activity is the ongoing standard purging of temp data areas, such as shared databases and standard cleanup of old files and logs.
- An automatic activity can be scheduled to remove data older than a specific time regularly. The maintenance team develops and tests the scripts that do the cleanup, and then submit them to the Operations and Support teams to run on a regular basis (in a smaller shop this may be the same team).
- If DBAs report that temporary work areas are running out of space, maintenance activities need to identify acceptable remedial actions, such as by temporarily adding space, temporarily restricting access to the shared space, or automatically dropping temporary data objects based on previously identified criteria.
4.4 Design Standard Alert Thresholds, Reports, and Forecasts
Over time, infrastructure resources such as storage, CPUs, network bandwidth, or data transfer rates grows. This ‘capacity’ growth should be monitored and trends reported to help determine future capacity needs to the system.
Commonly, system capacity is designed to meet an initial workload, and to handle projected usage changes going forward. There are two main usage patterns:
- Slow steady growth – for certain industries (healthcare for example), there are few times when the average usage spikes either up or down greater than a standard deviation. Capacity increases can then be planned in advance to be ahead of the usage slope.
- Peaks and Valleys – for certain industries (such as retail), there are standard times when the capacity needs to handle several times the average usage. Capacity can be planned to either always be able to handle the highest peak (which means most of the year there is unused capacity being supported), or to handle an average capacity with the ability to temporarily expand capacity during peak periods (which can also be expensive if the capacity is leased from a vendor).
Capacity limits affect availability because if a system begins to experience downtime due to inability to allocate storage areas or CPU power when needed, it will influence the amount of resources that must be allocated to the system, to meet future business demands.
- Thresholds must provide enough headroom or lead time to allow for analysis before taking action.
- Reports must provide enough information to enable appropriate decisions
- Forecasts must be based on enough data to rule out anomalies, resulting in either excessive or inadequate capacity growth, both of which are costly.
Excessive capacity may have unintended consequences such as:
- More time needed for back-ups, or other scheduled maintenance
- Increased power needed for equipment
- Need to upgrade computer chassis to support capacity increases
- Increased floor space / footprint for equipment
- Increased cooling needs to maintain preferred data center temperatures
- Changes in UPS needs
- Assumptions by users that space is infinite and therefore efficiency in storage and processing is unnecessary
Inadequate capacity may result in more frequent system additions, which may also cost in reduced bulk discounts and more frequent maintenance events to add the capacity.
Today’s technologies are providing great advancements in on-demand capacity allocation in CPU processing (compute capacity) and storage capacity. This capability is available both for in-house delivered infrastructure, as well as with cloud computing.
5 Evolutionary Responsibilities
Requests to add functionality or to change the way existing functionality works are Enhancement Requests (ERs). Acting on enhancement requests without sufficient analysis can be very dangerous to the overall health of the system. In fact, enhancement requests, done in isolation, contribute to the problem of spaghetti code often encountered in legacy systems. For that reason, best practice is now to recognize that enhancement activities evolve systems. In other words, evolution is not the same as simply maintaining a system. Enhancement requests should be collected and addressed in groups, within development projects. See Construction' for the tools and techniques.
An EIT organization can submit ERs to third-party vendors, which may or may not be acted upon. Vendors have their own internal systems for evaluating ERs, whether from customers or generated internally. Thus, vendor-provided components evolve independently from any customer using those components. When vendors notify organizations of upgrades ( new versions or patches), the maintenance team must assure that all changes to the component — and their potential impacts — are well understood before recommending installation of an upgrade, and going through the Transition process (see Transition into Operations).
If a component has been customized by the organization (not just configured, but significantly changed from the Off-The-Shelf installed version), it can become increasingly difficult to retain those customizations in the component as new versions become available. This leads to components falling behind, which increases Technical Debt both in opportunity cost from the inability to take advantage of functional improvements (for example, security improvements), and in increasing the eventual cost when an upgrade is unavoidable. Often, local modifications of a 3rd party system make it very difficult to accept new versions, because so much work would be required to carry those modifications forward to the new version, and the vendor may not be inclined (without significant cost) to include the customizations into their base product.
Evolution is a continuous change from a lesser, simpler, or worse state to a higher or better state. However, acting on vendor notifications without sufficient analysis can be very dangerous to the overall health of the system. No organization only has components provided by a single vendor, so evaluation of the entire environment must be made to assess the impact an upgrade to a component may have on other systems (ripple affect). Some upgrades may require other systems to change how they interface or connect to the component being upgraded (adaptive maintenance).
It is also often the case where one component may have an upgrade available, but other connected components may not be compatible with the upgrade until a later time. Careful evaluation of the entire component inventory is essential to prevent an upgrade causing an Incident in another system, disrupting business, and requiring remediation, such as blackout (see Transition into Operations).
Both EIT management and the business product owners have a responsibility to ensure a solution does not fall behind in both service currency (i.e., meeting the business need), but also product currency (i.e., vendor support and maintenance). A system lapsing from supportability by vendors is negligence on the organization’s part, unless the component is placed in "sunset status" with a defined retirement date. In this case, there may be little point in upgrading to the latest version.
Keeping a system around only to be used for historical reference is a waste of resources — convert the data to a currently-readable archive and disconnect the system.
6 Change Control Systems and Processes
The maintenance function has the responsibility for establishing change control mechanisms for any and all types of changes requested for installed systems. While Operations and Support uses the change management system for recording and tracking, the Maintenance function assures the orderly progression of requests through to resolution.
6.1 Define and Implement a Standard Change Management Process
A Change Management (CM) system is a set of processes that defines, at a high level, how subsystems can be introduced or changed. The CM tracking system includes a change request process and a defect handling process. These processes are generic across all component types. For example, a change request to add new hardware is treated exactly the same as a change request to enhance functionality (i.e. both CRs are assigned, approved, etc.).
In order for this process to work effectively, a number of Change Management mechanisms must be established and consistently used. First and foremost, a change control authority (a Change Control Board (CCB) or Change Advisory Board (CAB)) has to be defined and established. It is typically chaired by the maintenance function, and includes representatives of all stakeholders: product owners, developers, testers, users, operations and support. In addition, for some types of requests, specialists (such as enterprise architects and the original business analysts) may be called in.
In order to support and facilitate the functioning of the Board, specific mechanisms need to be in place. These include things such as:
- A numbering scheme for defect reports, enhancement requests, adaptive requests, and prevention requests
- A scheme for categorizing, assessing risk, and prioritizing the requests taking into account how severe an incident is and how many users it might affect
- A scheme for siphoning off enhancement requests into queues for bundling requests into development projects
- Assuring that all requests are entered and tracked in a change management system
- Defining a closed loop change management process so that
- all requests are tracked through resolution (Deferred, Rejected, Approved, Change Made, Change Tested, Change Released)
- A clear path of request reporting, reviews, approvals, and resolution is defined
- All requests come through the same system that the Operations help desk (aka Support) uses
- Tracking and reporting provides trend analysis for such things as error-prone areas or module, volatility of change requests (especially defect reports)
- Assuring that action on requests is reflected in the Operations CM database system and the development CM system
6.2 Define Request Intake and Evaluation Process
A typical change control process looks something like this:
- Operations receives software change requests via its Help Desk function and enters them into the incident-tracking system. Operations does not make changes to software.
- Defects and adaptive requests are automatically sent to the responsible development team (or vendor relationship manager for acquired components or systems), where they are assigned by appropriate manager according to relative priority.
- Approved enhancement requests are periodically reviewed by the Product Manager for inclusion in later releases of the system. In EIT organizations that do not have a Product Manager function, a suitable user representative is tapped for this role in the CCB.
- Preventive requests are reviewed by the Product Manager and the CCB.
- The status of all changes to a system are reviewed by the CCB prior to release.
- The CCB is comprised of representatives of all stakeholders (development, testing, documentation and training, operations and product management).
6.3 Change Request Processing and Approval Flow
Configuration Management is the foundation of a software project. It is the management of change to components and systems. Without it, no matter how talented the staff, how large the budget, how robust the development and test processes, or how technically superior the development tools, project discipline will collapse and success will be left to chance.
Once a Change Request (CR) is assigned and approved, its “owner” manages the necessary change via the defined CM process. Procedures may differ depending on the type of change. For example, a developer may be required to apply an operating system patch. An operating system patch will be applied differently than a system release.
The owner, at various stages of the configuration control process, will be able to identify where in the high-level CM process the change is. For example, if an operating system patch is ready to be applied in a test environment, the CR should be marked as Ready_for_testing. Once the patch has been successfully tested, the CR should be marked Tested.
The CM process and its supporting mechanisms should provides a clear, documented trail of change requests, their disposition, and changes introduced into the system, enabling better team communication, and collection of meaningful project metrics. The request itself, the requestor, the approvers, and all actions taken in response to the CR should be available through the CM process and tools to everyone on the project.
The generic approval flow defines a generic CR. The CR may be used to represent and track defects, enhancements, greenfield development, documentation, etc. The Change Control Board (CCB) is a central control mechanism to ensure that every change request is properly considered, authorised, and co-ordinated. The full CCB should meet on a regular basis, probably once a week. Emergency meetings can be called as necessary. All decisions made by the CCB should be documented in the CM system.
A CCB member is the top level of the change management hierarchy and can also act as every role defined lower in the hierarchy. For example, if a team leader is not present, a CCB member can act on behalf of the team leader.
The CCB includes the following members:
- Configuration process manager
- CM system administrator
- Respective system/component development managers
- Key stakeholders, such as Operations & Support, and user representatives
Figure 1: Change Request Life Cycle
This table describes basic actions performed on a change request.
Action | Description | Role |
---|---|---|
Submit CR | Any stakeholder on the project can submit a Change Request (CR). This logs the CR in in the CM system, places it into the CCB Review Queue, and sets its state to Submitted. | Submitter |
Review CR | This CCB action reviews Submitted Change Requests. The CR’s content is initially reviewed in the CCB Review meeting to determine if it is a valid request. If it is, a determination is made if the change is in or out of scope for the current release(s), based on priority, schedule, resources, level-of-effort, risk, severity and any other relevant criteria as determined by the group. The state of a valid CR is set to Assigned or Postponed accordingly. | CCB |
Confirm Duplicate or Reject | If a CR is suspected of being a duplicate or invalid request (e.g., operator error, not reproducible, the way it works, etc.), a delegate of the CCB is assigned to confirm the duplicate or rejected CR and to gather more information from the submitter, if necessary. The CR state is set to Duplicate or Closed as appropriate. | CCB Delegate |
Re-open | If more information is needed, or if a CR is rejected at any point in the process, the submitter is notified and may update the CR with new information. The updated CR is then re-submitted to the CCB Review Queue for consideration of the new data. | Submitter |
Open & Work-on | Once a CR is assigned by the CCB, the Project Lead will assign the work to the appropriate user – depending on the type of request (enhancement request, defect, documentation change, test defect, etc.) – and make any needed updates to the project schedule. The CR state is set to Opened. | Configuration Manager |
Resolve | The assigned worker performs the set of activities defined within the appropriate section of the process (e.g., requirements, analysis & design, implementation, produce user-support materials, design test, etc.) to make the changes requested. These activities will include all normal review and unit test activities as described within the normal development process. The CR will then be marked as Resolved. | Assigned user |
Validate | After the changes are Resolved by the assigned user (analyst, developer, tester, etc.), the changes are placed into a test queue to be assigned to a tester and validated in a test build of the product. | Tester |
7 Types of System Change Releases
7.1 "Patch" Release (patch only)
A patch is a relatively small change generally to source code to fix a defect. However data fixes may be required to rectify invalid data that has been created by bad code or user error. Either patch type, although small, can still have wide-spread impact on the system, especially if it the source code change affects a critical component of a system or a data fix changes millions of data records. Therefore, all patches applied must be fully tested before being scheduled for production implementation.
7.2 Full Release
Full release generally means most or all system components are packaged on a release medium This is usually called a version upgrade.
7.3 Traceability and Audibility
Processes and systems that will assist in the management of patch and version upgrades are a part of configuration management. They track past, current and future versions of software and infrastructure components (i.e. databases, utilities, hardware) that have been, or will be, implemented. For large systems, like ERPs, with thousands of modules, manual processes become error prone and unmanageable, so automated tracking is required to ensure major downtime is not experienced due to user error.
A configuration management system can provide a great audit trail for implemented changes; however, this is not the only tracking necessary in most cases. Many companies today have regulatory requirements to comply with accounting and other regulations or standards, such as Sarbanes Oxley (SOX) or internal auditing control functions. Both internal and external EIT auditors will use this history to ensure control processes are followed by IT. Specific EIT staff are allocated responsibility and oversight for these control processes, and are responsible and accountable to ensure the defined processes are followed, but also that the audits are completed on a timely basis and are accurate.
8 Summary
Systems should be designed and built to be easily maintained. Maintenance is the responsibility of EIT, and should be an auditable process, with mechanisms for tracking and reporting. Systems need to be monitored, measured, and validated to ensure this happens.
9 Key Competence Frameworks
While many large companies have defined their own sets of skills for purposes of talent management (to recruit, retain, and further develop the highest quality staff members that they can find, afford and hire), the advancement of EIT professionalism will require common definitions of EIT skills that can be used not just across enterprises, but also across countries. We have selected 3 major sources of skill definitions. While none of them is used universally, they provide a good cross-section of options.
Creating mappings between these frameworks and our chapters is challenging, because they come from different perspectives and have different goals. There is rarely a 100% correspondence between the frameworks and our chapters, and, despite careful consideration some subjectivity was used to create the mappings. Please take that in consideration as you review them.
9.1 Skills Framework for the Information Age
The Skills Framework for the Information Age (SFIA) has defined nearly 100 skills. SFIA describes 7 levels of competency which can be applied to each skill. Not all skills, however, cover all seven levels. Some reach only partially up the seven step ladder. Others are based on mastering foundational skills, and start at the fourth or fifth level of competency. It is used in nearly 200 countries, from Britain to South Africa, South America, to the Pacific Rim, to the United States. (http://www.sfia-online.org)
Skill | Skill Description | Competency Levels |
---|---|---|
Application support | The provision of application maintenance and support services, either directly to users of the systems or to service delivery functions. Support typically includes investigation and resolution of issues and may also include performance monitoring. Issues may be resolved by providing advice or training to users, by devising corrections (permanent or temporary) for faults, making general or site-specific modifications, updating documentation, manipulating data, or defining enhancements Support often involves close collaboration with the system's developers and/or with colleagues specialising in different areas, such as Database administration or Network support. | 2 - 5 |
Business risk management | The planning and implementation of organisation-wide processes and procedures for the management of risk to the success or integrity of the business, especially those arising from the use of information technology, reduction or non-availability of energy supply or inappropriate disposal of materials, hardware or data. | 4 - 7 |
Capacity management | The management of the capability, functionality and sustainability of service components (including hardware, software, network resources and software/infrastructure as a Service) to meet current and forecast needs in a cost efficient manner aligned to the business. This includes predicting both long-term changes and short-term variations in the level of capacity required to execute the service, and deployment, where appropriate, of techniques to control the demand for a particular resource or service. | 4 - 6 |
Conformance review | The independent assessment of the conformity of any activity, process, deliverable, product or service to the criteria of specified standards, best practice, or other documented requirements. May relate to, for example, asset management, network security tools, firewalls and internet security, sustainability, real-time systems, application design and specific certifications. | 3 - 6 |
Customer service support | The management and operation of one or more customer service or service desk functions. Acting as a point of contact to support service users and customers reporting issues, requesting information, access, or other services. | 1 - 6 |
Database administration | The installation, configuration, upgrade, administration, monitoring and maintenance of databases. | 2 - 5 |
Digital forensics | The collection, processing, preserving, analysing, and presenting of computer-related evidence in support of security vulnerability mitigation and/or criminal, fraud, counterintelligence, or law enforcement investigations. | 4 - 6 |
Facilities management | The planning, control and management of all the facilities which, collectively, make up the IT estate. This involves provision and management of the physical environment, including space and power allocation, and environmental monitoring to provide statistics on energy usage. Encompasses physical access control, and adherence to all mandatory policies and regulations concerning health and safety at work. | 3 - 6 |
Financial management | The overall financial management, control and stewardship of the IT assets and resources used in the provision of IT services, including the identification of materials and energy costs, ensuring compliance with all governance, legal and regulatory requirements. | 4 - 6 |
Incident management | The processing and coordination of appropriate and timely responses to incident reports, including channelling requests for help to appropriate functions for resolution, monitoring resolution activity, and keeping clients appraised of progress towards service restoration. | 2 - 5 |
IT Infrastructure | The operation and control of the IT infrastructure (typically hardware, software, data stored on various media, and all equipment within wide and local area networks) required to deliver and support IT services and products to meet the needs of a business. Includes preparation for new or changed services, operation of the change process, the maintenance of regulatory, legal and professional standards, the building and management of systems and components in virtualised computing environments and the monitoring of performance of systems and services in relation to their contribution to business performance, their security and their sustainability. | 1 - 4 |
IT management | The management of the IT infrastructure and resources required to plan for, develop, deliver and support IT services and products to meet the needs of a business. The preparation for new or changed services, management of the change process and the maintenance of regulatory, legal and professional standards. The management of performance of systems and services in terms of their contribution to business performance and their financial costs and sustainability. The management of bought-in services. The development of continual service improvement plans to ensure the IT infrastructure adequately supports business needs. | 5 - 7 |
Network support | The provision of network maintenance and support services. Support may be provided both to users of the systems and to service delivery functions. Support typically takes the form of investigating and resolving problems and providing information about the systems. It may also include monitoring their performance. Problems may be resolved by providing advice or training to users about the network's functionality, correct operation or constraints, by devising work-arounds, correcting faults, or making general or site-specific modifications. | 2 - 5 |
Problem management | The resolution (both reactive and proactive) of problems throughout the information system lifecycle, including classification, prioritisation and initiation of action, documentation of root causes and implementation of remedies to prevent future incidents. | 3 - 5 |
Security administration | The provision of operational security management and administrative services. Typically includes the authorisation and monitoring of access to IT facilities or infrastructure, the investigation of unauthorised access and compliance with relevant legislation. | 1 - 6 |
Storage management | The planning, implementation, configuration and tuning of storage hardware and software covering online, offline, remote and offsite data storage (backup, archiving and recovery) and ensuring compliance with regulatory and security requirements. | 3 - 6 |
System software | The provision of specialist expertise to facilitate and execute the installation and maintenance of system software such as operating systems, data management products, office automation products and other utility software. | 3 - 5 |
9.2 European Competency Framework
The European Union’s European e-Competence Framework (e-CF) has 40 competences and is used by a large number of companies, qualification providers and others in public and private sectors across the EU. It uses five levels of competence proficiency (e-1 to e-5). No competence is subject to all five levels.
The e-CF is published and legally owned by CEN, the European Committee for Standardization, and its National Member Bodies (www.cen.eu). Its creation and maintenance has been co-financed and politically supported by the European Commission, in particular, DG (Directorate General) Enterprise and Industry, with contributions from the EU ICT multi-stakeholder community, to support competitiveness, innovation, and job creation in European industry. The Commission works on a number of initiatives to boost ICT skills in the workforce. Version 1.0 to 3.0 were published as CEN Workshop Agreements (CWA). The e-CF 3.0 CWA 16234-1 was published as an official European Norm (EN), EN 16234-1. For complete information, please see http://www.ecompetences.eu.
e-CF Dimension 2 | e-CF Dimension 3 |
---|---|
C.3.Service Delivery (RUN) Ensures service delivery in accordance with established service level agreements (SLA's). Takes proactive action to ensure stable and secure applications and ICT infrastructure to avoid potential service disruptions, attending to capacity planning and to information security. Updates operational document library and logs all service incidents. Maintains monitoring and management tools (i.e. scripts, procedures). Maintains IS services. Takes proactive measures. |
Level 1-3 |
C.4. Problem Management (RUN) Identifies and resolves the root cause of incidents. Takes a proactive approach to avoidance or identification of root cause of ICT problems. Deploys a knowledge system based on recurrence of common errors. Resolves or escalates incidents. Optimises system or component performance. |
Level 2-4 |
E.3. Risk Management (MANAGE) Implements the management of risk across information system s through the application of the enterprise defined risk management policy and procedure. Assesses risk to the organisation’s business, including web, cloud and mobile resources. Documents potential risk and containment plans. |
Level 2-4 |
E.4. Relationship Management (MANAGE) Establishes and maintains positive business relationships between stakeholders (internal or external) deploying and complying with organisational processes. Maintains regular communication with customer / partner / supplier, and addresses needs through empathy with their environment and managing supply chain communications. Ensures that stakeholder needs, concerns or complaints are understood and addressed in accordance with organisational policy. |
Level 3-4 |
E.8. Information Security Management (MANAGE) Implements information security policy. Monitors and takes action against intrusion, fraud and security breaches or leaks. Ensures that security risks are analysed and managed with respect to enterprise data and information. Reviews security incidents, makes recommendations for security policy and strategy to ensure continuous improvement of security provision. |
Level 2-4 |
9.3 i Competency Dictionary
The Information Technology Promotion Agency (IPA) of Japan has developed the i Competency Dictionary (iCD), translated it into English, and describes it at https://www.ipa.go.jp/english/humandev/icd.html. It is an extensive skills and tasks database, used in Japan and southeast Asian countries. It establishes a taxonomy of tasks and the skills required to perform the tasks. The IPA is also responsible for the Information Technology Engineers Examination (ITEE), which has grown into one of the largest scale national examinations in Japan, with approximately 600,000 applicants each year.
The iCD consists of a Task Dictionary and a Skill Dictionary. Skills for a specific task are identified via a “Task x Skill” table. (Please see Appendix A for the task layer and skill layer structures.) EITBOK activities in each chapter require several tasks in the Task Dictionary.
The table below shows a sample task from iCD Task Dictionary Layer 2 (with Layer 1 in parentheses) that correspond to activities in this chapter. It also shows the Layer 2 (Skill Classification), Layer 3 (Skill Item), and Layer 4 (knowledge item from the IPA Body of Knowledge) prerequisite skills associated with the sample task, as identified by the Task x Skill Table of the iCD Skill Dictionary. The complete iCD Task Dictionary (Layer 1-4) and Skill Dictionary (Layer 1-4) can be obtained by returning the request form provided at http://www.ipa.go.jp/english/humandev/icd.html.
Task Dictionary | Skill Dictionary | ||
---|---|---|---|
Task Layer (Task Area) | Skill Classification | Skill Item | Associated Knowledge Items |
System operation design (Operation design) |
System maintenance, operation, and evaluation | System operations management requirements definition |
|
(More)
10 Key Roles
Both SFIA and the e-CF have described profiles (similar to roles) for providing examples of skill sets (skill combinations) for various roles. The iCD has described tasks performed in EIT and associated those with skills in the IPA database.
The following roles are common to ITSM.
- 1st, 2nd, 3rd Level Support
- Access Manager
- Facilities Manager
- Incident Manager
- IT Operations Manager
- IT Operator
- Major Incident Team
- Problem Manager
- Service Request Fulfillment
11 Standards
- ISO/IEC 24765:2016 Systems and software engineering—Vocabulary (also available online as SEVOCAB at https://pascal.computer.org/ and it is free)
- ISO 20000 series
- IEEE Std 14764-2006, ISO/IEC 14764-2006 - International Standard for Software Engineering - Software Life Cycle Processes - Maintenance
- IEEE Std 982.1-2005 - Standard Dictionary of Measures of the Software Aspects of Dependability
12 References
[1] (2010). SEVOCAB at https://pascal.computer.org/
[2] Retrieved from Guide to the Systems Engineering Body of Knowledge (SEBoK), version 1.3.: http://www.sebokwiki.org/w/index.php?title=Logistics&oldid=48199
[3] Ibid., Section 2.3 Maintenance Cost Estimation
[4] International Institute of Business Analysis. (2009). A Guide to the Business Analysis Body of Knowledge® Version 2.0. IIBA. [IIBA-BABOK 7.5.2, p.134]
[5] [Ibid., 7.6.2, p. 137]
[6] Pigoski, Thomas M., 1997. Practical software maintenance: Best practices for managing your software investment. Wiley Computer Pub. (New York) and Lehman M. M., 1980.
[7] Program, Life-Cycles and the Laws of Software Evolution. In Proceedings of IEEE, 68, 9,1060-1076
[8] Introduction to the ITIL Service Lifecycle, Second Edition, Office of Government Commerce, 2010. p. 52; p. 215