Data Integrity Validation

From Open Risk Manual


Data (Integrity) Validation is the process of analysing a data set in order to establish certain aspects of Data Quality and decide on possible remediation steps.[1]

Validation Checks

The range of possible validation checks varies greatly. The following types of checks are meaningful in particular in the context of validating credit data (Loan Tape data) but should be useful also more generally:

  • Reconciliation checks (Checking number of lines, totals etc.)
  • Field-specific checks (Checking for presence and uniqueness of fields, formatting, numerical bounds etc.)
  • Cross-field checks (Checking consistency of values within a given time snapshot where there are dependencies)
  • Cross-time checks (Checking consistency of values of when datasets reflect different snapshots, e.g. when there is a monotonicity requirement)
  • Sense-check of distribution of observations (Checking the frequency distributions for unexpected outliers)

In many cases the validation checks can be formally defined via Validation Rules

Validation Levels

One can define different validation levels on the basis of the amount of data that are necessary to perform the data integrity audit[2]

Validation Level 0

For these quality checks only the structure of the file or the format of the variables are necessary as input. Examples:

  • The correct number of tables / datasheets
  • The correct number of columns

Validation Level 1

Quality checks that require only the data point itself. Examples:

  • Whether it is populated with data, whether the data has the right type, it has the right range
  • Whether data points are consistent across records and the entire datafile

Validation Level 2

Defined as "intra-domain, intra-source checks". Examples:

  • Revision checks. Time series checks
  • Inter-dataset checks comparing different datasets from the same data source
  • Checks between correlated datasets

Validation Level 3

Validation level 3 involves mirror checks that verify the consistency between declarations from different sources referring to the same phenomenon. Examples:

  • "inter-source intra-dataset checks"
  • "inter-source inter-dataset checks"

Validation Level 4

Validation level 4 is defined as plausibility or consistency checks between separate domains available in the same Institution. These checks could be based on the plausibility of results describing the "same" phenomenon from different statistical domains or data generating processes.

Validation Level 5

Validation level 5 could be defined as plausibility or consistency checks between the data available in the Institution and the data / information available outside the Institution. This implies no "control" over the methodology on the basis of which the external data are collected, and sometimes a limited knowledge of it

ECB TRIM Remediation Requirements

A process for the identification and remediation of data quality deficiencies should be in place in order to constantly improve data quality and promote compliance with the data quality standards.[3]

Data quality assessments should be carried out by an independent unit whose recommendations are issued with an indication of their priority, based on the materiality of the incidents identified. All such data quality incidents should be recorded and monitored by an independent data quality unit.

For each of the data quality incidents, an owner responsible for resolving the incident should be appointed and an action plan for dealing with the incident drawn up on the basis of the priority assigned.

Remediation timelines should depend on the severity and impact of the incident and the implementation timelines required to resolve it. Data quality incidents should be resolved - rather than mitigated - at source level by taking a prudent approach.


  1. ECB, Asset Quality Review - Phase 2 Manual
  2. Eurostat, D2.5 - Definition of validation levels and other related concepts
  3. ECB guide to internal models - Credit Risk, Sep 2018