Difference between revisions of "How to Identify Data Outliers"

Latest revision as of 13:31, 31 October 2019

How to Identify Data Outliers

A standardized procedure for systematically identifying data outliers in a univariate sense comprises of the following steps:

Import the data into a statistical analysis framework (e.g. Python or R based platform
Compute the summary statistics (Descriptive Statistics) that capture stylized statistical properties of the data set
Compute the z-score for all realizations
Set z-score limits (one or two-sided as appropriate)
Optionally calculate the wikipedia:Kernel density estimation
Plot the Histogram along with the z-score boundaries
Inspect the data visually

Issues and Challenges

This methodology aims to provide a powerful filter that can quickly identify outliers in large sets of variables but it does not provide an automatic solution.

Outliers are ultimately defined in a certain Data Generation Process, Data Collection Process and data modelling and usage context. Hence what is an outlier can change depending on that context
The above methodology does not apply to detecting outliers in a multivariate sense
The above methodology is less suited to detect outliers in categorical data
The above methodology is less suited for data with complicated multi-modal distributions

References