Exploratory Data Analysis

Definition

Exploratory Data Analysis (EDA) in the context of Risk Management is the process of systematically analysing Risk Data for the purpose of identifying and summarizing their main characteristics in text based (tabular) or visual reports.

EDA can be employed as a standalone process, producing management information or as part of Quantitative Risk Management as one of the first steps towards Model Development . Due to the reliance on data and sensitivity to Data Quality, EDA conceptually overlaps and is usually preceded (or iterated) with Risk Data Review activities

EDA is distinct form regular Risk Reporting, the activity of producing standardized monitoring reports to help manage already identified risk factors.

Objectives

The analysis objectives can span a wide range depending on context:

Provide suggestions for further data collection and/or data quality improvements (e.g. Missing Data)
Defining and identifying outliers
Creating an overall map of the underlying structure of the data (identifying clusters of similar variables, reducing dimensionality)
Identify potential hypotheses about the causes underlying the observed risks (important / statistically significant factors)
Support the selection of appropriate models

NB: While exploratory data analysis does not aim to produce a model, it may use auxiliary models (e.g. regressions) to obtain preliminary insights into potential causal relationships

EDA Actitivies

The activities included in the EDA scope depend on the nature of the Data Science data. Indicatively, in the context of Credit Scorecard Development the following activities are typical:

univariate analysis of predictor variables
investigation of outlier data points
multivariate analysis of the predictor variables, correlation estimates and scatter plots
association between default and borrower characteristics
contingency tables, odds ratios and visual mosaic plots

Techniques

Qualitatively, exploratory data analysis techniques can be classified according to

whether the summary information generated is numerical (e.g. tabulated numerical results) or graphical (e.g. plots and charts) in nature. (Keeping in mind that many analyses can be expressed in both forms)
the dimensionality of the data (sub)set being investigated (1D, 2D, or multidimensional). Univariate and Bivariate analysis in particular are highly standardized subcategories
whether the analysis is model-free (non-parametric) or involves some modelling assumptions (generally simple)
whether the data are panel data, Timeseries Data or some more complex data cube

Summary Statistics

Univariate analysis
- Moments (Mean, Standard Deviation and higher moments)
- Quantile statistics (min / max value, Q1, median, Q3 etc)
- Other Measures (Range, Mode(s), Most frequent values)
Bivariate analysis
- Correlations (Spearman, Pearson and Kendall)
- Measures of Association for Categorical variables (Chi-Squared, Information Value)

Visualization

The most suitable visualization choices depend on the data set. In many cases alternative visualizations provide complementary insights

Examples

Open Source Software

KNIME, Konstanz Information Miner - Open-Source data exploration platform based on Eclipse.
Orange, an open-source software data mining and machine learning software suite.
Python, an open-source programming language widely used in data mining and machine learning.
R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for Data-Science.
Weka an open source data mining package that includes visualization and EDA tools such as targeted projection pursuit.

Issues and Challenges

EDA is not very well defined in scope. Hence there can be confusion as to what constitutes a good analysis
EDA may already expose researchers and data scientistis to potential bias