Data Cleansing (also Data Editing) is a Data Quality augmentation process aiming to correct and possibly transform data in order to produce a set that is suitable for use, e.g., in Model Development and/or Model Validation.
Data Cleansing Activities
In a formal data quality framework data cleansing will typically follow a documented assessment of data quality. The type of activities relates to the specific issue causing the data quality problem. Activities may be manual or automated and include a combination of the below:
- Imputing missing Values
- Correcting wrong Representations (e.g. Date, Percentages)
- Correcting wrong Values (Summations)
Issues and Challenges
- Inappropriate data cleansing steps can potentially Bias the sample (for example removal of meaningful, informative, Data Outliers, right censoring of observations such as Withdrawn Ratings etc)
- Data cleansing may hide issues with the dataset that would prevent building a model that is fit for purpose. For example engineering a less representative sample.
- Data cleansing may be implicitly introducing further Model Assumptions. For example, filling in missing data will typically require assuming some distribution which may or may not coincide with the actual