IT Availability Risk

From Open Risk Manual

Definition

IT Availability Risk (also IT Continuity Risk) is the risk that performance and availability of IT systems and data are adversely impacted, including the inability to timely recover the institution’s services, due to a failure of IT hardware or software components; weaknesses in IT system management; or any other event[1]

Risk Sub-types

Inadequate capacity management

A lack of resources (e.g. hardware, software, staff, service providers) can result in an inability to scale the service to meet business needs, system interruptions, degradation of service and/or operational mistakes.

Examples:

  • A capacity shortfall may affect transmission rates and the availability of the network (internet) for services like internet banking.
  • A lack of staff (internal or third party) can result in system interruptions and/or operational mistakes.

ICT system failures

A loss of availability due to hardware failures or due to software failures and bugs

Examples:

  • Failure/malfunction of storage (hard disks), server or other IT equipment caused by e.g. lack of maintenance.
  • Infinite loop in application software prevents transaction execution.
  • Outages due the continued use of outdated IT systems and solutions that no longer meet present availability and resilience requirements and/or are no longer supported by their vendors.

Inadequate IT continuity and disaster recovery planning

Failure of IT planned availability and/or continuity solutions and/or disaster recovery (e.g. fall-back recovery datacentre) when activated in response to an incident.

Examples:

  • Configuration differences between the primary and secondary datacentre may result in the incapacity of the fall-back datacentre to provide the planned continuity of service.

Disruptive and destructive cyber attacks

Attacks for different purposes (e.g. activism, blackmailing), which result in an overloading of systems and the network, preventing online computer services to be accessed by their legitimate users.

Examples:

  • Distributed Denial of Service (DDoS) attacks are performed by means of a multitude of computer systems on the internet controlled by a hacker, sending a large amount of apparently legitimate service requests to internet (e.g. e-banking) services.

Factors

  • an institution may be subject to IT availability and continuity risks due to internet dependencies, high adoption of innovative IT solutions or other business distribution channels that may make it a more likely target for cyber-attacks
  • an institution may be more exposed to IT availability and continuity risks due to the complexity (e.g. as a result of mergers or acquisitions) or outdated nature of its IT systems
  • an institution that is implementing material changes to its IT systems and/or IT function (e.g. as a result of mergers, acquisitions, divestments or the replacement of its core IT systems)
  • the location of important IT operations/data centres (e.g. regions, countries) may expose the institution to natural disasters (e.g. flooding, earthquakes), political instability or labour conflicts and civil disturbances which can lead to a material increase of IT availability and continuity risks

Controls

  • identify critical IT processes and the relevant supporting IT systems that should be part of the business resilience and continuity plans with:
    • comprehensive analysis of dependencies between the critical business processes and supporting systems;
    • determination of recovery objectives for the supporting IT systems (e.g. typically determined by the business and/or regulations in terms of RTO and RPO);
    • appropriate contingency planning to enable the availability, continuity, and recovery of critical IT systems and services to minimize disruption to an institution’s operations within acceptable limits.
  • have business resilience, continuity control environment policies and standards and operational controls which include:
    • Measures to avoid that a single scenario, incident or disaster might impact both IT production and recovery systems;
    • IT system backup and recovery procedures for critical software and data, that ensure that these backups are stored in a secure and sufficiently remote location, so that an incident or disaster cannot destroy or corrupt these critical data;
    • monitoring solutions for the timely detection of IT availability or continuity incidents;
    • a documented incident management and escalation process, that also provides guidance on the different incident management and escalation roles and responsibilities, the members of the crisis committee(s) and the chain of command in case of emergency;
    • physical measures to both protect the institution’s critical IT infrastructure (e.g. data centres) from environmental risks (e.g. flooding and other natural disasters) and ensure an appropriate operating environment for IT systems (e.g. air conditioning);
    • processes, roles and responsibilities to ensure that also outsourced IT systems and services are covered by adequate business resilience and continuity solutions and plans;
    • IT performance and capacity planning and monitoring solutions for critical IT systems and services with defined availability requirements, to detect important performance and capacity constraints in a timely manner;
    • solutions to protect critical internet activities or services (e.g. e-banking services), where necessary and appropriate, against denial of service and other cyber-attacks from the internet, aimed at preventing or disturbing access to these activities and services.
  • tests IT availability and continuity solutions, against a range of realistic scenarios including cyber-attacks, fail-over tests and tests of back-ups for critical software and data which:
    • are planned, formalised and documented, and the test results used to strengthen the effectiveness of the IT availability and continuity solutions;
    • include stakeholders and functions within the organisation, such as business line management including business continuity, incident and crisis response teams, as well as relevant external stakeholders in the ecosystem;
    • management body and senior management are appropriately involved in (e.g. as part of crisis management teams) and are informed of test results.

References

  1. EBA, Final Guidelines on ICT Risk Assessment under SREP