Data Cleansing: Difference between revisions

Latest revision as of 21:01, 11 September 2020

Definition

Data Cleansing (also Data Editing) is a Data Quality augmentation process aiming to correct and possibly transform data in order to produce a set that is suitable for use, e.g., in Model Development and/or Model Validation.

Data Cleansing Activities

In a formal data quality framework data cleansing will typically follow a documented assessment of data quality. The type of activities relates to the specific issue causing the data quality problem. Activities may be manual or automated and include a combination of the below:

Imputing missing Values
Correcting wrong Representations (e.g. Date, Percentages)
Correcting wrong Values (Summations)

Issues and Challenges

Inappropriate data cleansing steps can potentially Bias the sample (for example removal of meaningful, informative, Data Outliers, right censoring of observations such as Withdrawn Ratings etc)
Data cleansing may hide issues with the dataset that would prevent building a model that is fit for purpose. For example engineering a less representative sample.
Data cleansing may be implicitly introducing further Model Assumptions. For example, filling in missing data will typically require assuming some distribution which may or may not coincide with the actual

References