Difference between revisions of "How to Identify Data Outliers"
From Open Risk Manual
Wiki admin (talk | contribs) |
(No difference)
|
Latest revision as of 13:31, 31 October 2019
How to Identify Data Outliers
A standardized procedure for systematically identifying data outliers in a univariate sense comprises of the following steps:
- Import the data into a statistical analysis framework (e.g. Python or R based platform
- Compute the summary statistics (Descriptive Statistics) that capture stylized statistical properties of the data set
- Compute the z-score for all realizations
- Set z-score limits (one or two-sided as appropriate)
- Optionally calculate the wikipedia:Kernel density estimation
- Plot the Histogram along with the z-score boundaries
- Inspect the data visually
Issues and Challenges
This methodology aims to provide a powerful filter that can quickly identify outliers in large sets of variables but it does not provide an automatic solution.
- Outliers are ultimately defined in a certain Data Generation Process, Data Collection Process and data modelling and usage context. Hence what is an outlier can change depending on that context
- The above methodology does not apply to detecting outliers in a multivariate sense
- The above methodology is less suited to detect outliers in categorical data
- The above methodology is less suited for data with complicated multi-modal distributions