Difference between revisions of "How to Identify Data Outliers"
From Open Risk Manual
Wiki admin (talk  contribs) (Created page with "== How to Identify Data Outliers == A standardized procedure for systematically identifying data outliers ''in a univariate sense'' comprises of the follo...") 
(No difference)

Latest revision as of 13:28, 3 September 2019
How to Identify Data Outliers
A standardized procedure for systematically identifying data outliers in a univariate sense comprises of the following steps:
 Import the data into a statistical analysis framework (e.g. Python or R based platform
 Compute the summary statistics (Descriptive Statistics) that capture stylized statistical properties of the data set
 Compute the zscore for all realizations
 Set zscore limits (one or twosided as appropriate)
 Optionally calculate the wikipedia:Kernel density estimation
 Plot the Histogram along with the zscore boundaries
 Inspect the data visually
Issues and Challenges
This methodology aims to provide a powerful filter that can quickly identify outliers in large sets of variables but it does not provide an automatic solution.
 Outliers are ultimately defined in a certain Data Generation Process, Data Collection Process and data modelling and usage context. Hence what is an outlier can change depending on that context
 The above methodology does not apply to detecting outliers in a multivariate sense
 The above methodology is less suited to detect outliers in categorical data
 The above methodology is less suited for data with complicated multimodal distributions