Data Cleansing

From Open Risk Manual


Data Cleansing (also Data Editing) is a Data Quality augmentation process aiming to correct and possibly transform data in order to produce a set that is suitable for use, e.g., in Model Development and/or Model Validation.

Data Cleansing Activities

In a formal data quality framework data cleansing will typically follow a documented assessment of data quality. The type of activities relates to the specific issue causing the data quality problem. Activities may be manual or automated and include a combination of the below:

Issues and Challenges

  • Inappropriate data cleansing steps can potentially Bias the sample (for example removal of meaningful, informative, Data Outliers, right censoring of observations such as Withdrawn Ratings etc)
  • Data cleansing may hide issues with the dataset that would prevent building a model that is fit for purpose. For example engineering a less representative sample.
  • Data cleansing may be implicitly introducing further Model Assumptions. For example, filling in missing data will typically require assuming some distribution which may or may not coincide with the actual

