# Difference between revisions of "Exploratory Data Analysis"

Wiki admin (talk | contribs) (→Visualization) |
Wiki admin (talk | contribs) (→Techniques) |
||

Line 22: | Line 22: | ||

* the dimensionality of the data (sub)set being investigated (1D, 2D, or multidimensional). Univariate and Bivariate analysis in particular are highly standardized subcategories | * the dimensionality of the data (sub)set being investigated (1D, 2D, or multidimensional). Univariate and Bivariate analysis in particular are highly standardized subcategories | ||

* whether the analysis is model-free (non-parametric) or involves some modelling assumptions (generally simple) | * whether the analysis is model-free (non-parametric) or involves some modelling assumptions (generally simple) | ||

+ | * whether the data are panel data, timeseries data or some more complex data cube | ||

=== Summary Statistics === | === Summary Statistics === |

## Revision as of 17:11, 5 September 2019

## Contents

## Definition

**Exploratory Data Analysis** (EDA) in the context of Risk Management is the process of systematically analysing Risk Data for the purpose of identifying and summarizing their main characteristics in text based (tabular) or visual reports.

EDA can be employed as a standalone process, producing management information or as part of Quantitative Risk Management as one of the first steps towards Model Development . Due to the reliance on data and sensitivity to Data Quality, EDA conceptually overlaps and is usually preceded (or iterated) with Risk Data Review activities

EDA is distinct form regular Risk Reporting, the activity of producing standardized monitoring reports to help manage already identified risk factors.

## Objectives

The analysis objectives can span a wide range depending on context:

- Provide suggestions for further data collection and/or data quality improvements (e.g. Missing Data)
- Defining and identifying outliers
- Creating an overall map of the underlying structure of the data (identifying clusters of similar variables, reducing dimensionality)
- Identify potential hypotheses about the causes underlying the observed risks (important / statistically significant factors)
- Support the selection of appropriate models

NB: While exploratory data analysis does not aim to produce a model, it may use auxiliary models (e.g. regressions) to obtain preliminary insights into potential causal relationships

## Techniques

Qualitatively, exploratory data analysis techniques can be classified according to

- whether the summary information generated is numerical (e.g. tabulated numerical results) or graphical (e.g. plots and charts) in nature. (
*Keeping in mind that many analyses can be expressed in both forms*) - the dimensionality of the data (sub)set being investigated (1D, 2D, or multidimensional). Univariate and Bivariate analysis in particular are highly standardized subcategories
- whether the analysis is model-free (non-parametric) or involves some modelling assumptions (generally simple)
- whether the data are panel data, timeseries data or some more complex data cube

### Summary Statistics

- Univariate analysis, Moments
- Bivariate analysis, Correlations

### Visualization

The most suitable visualization choices depend on the data set. In many cases alternative visualizations provide complementary insights

#### Core Examples

#### Additional Examples

- Multi-vari chart
- Run chart
- Pareto chart
- Stem-and-leaf plot
- Parallel coordinates
- Odds ratio
- Targeted projection pursuit
- Dimensionality reduction:
- Multidimensional scaling
- Principal component analysis
- Multilinear PCA
- Nonlinear dimensionality reduction
- Median polish
- Trimean
- Ordination
- Box plot

## Open Source Software

- KNIME, Konstanz Information Miner - Open-Source data exploration platform based on Eclipse.
- Orange, an open-source software data mining and machine learning software suite.
- Python, an open-source programming language widely used in data mining and machine learning.
- R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for Data-Science.
- Weka an open source data mining package that includes visualization and EDA tools such as targeted projection pursuit.