Difference between revisions of "Exploratory Data Analysis"

From Open Risk Manual
(Visualization)
Line 28: Line 28:
  
 
=== Visualization ===
 
=== Visualization ===
* [[Box plot]]
+
The most suitable visualization choices depend on the data set. In many cases alternative visualizations provide complementary insights
 +
 
 +
==== Core Examples ====
 
* [[Histogram]]
 
* [[Histogram]]
 +
* [[Scatter plot]]
 +
 +
==== Additional Examples ====
 
* [[Multi-vari chart]]
 
* [[Multi-vari chart]]
 
* [[Run chart]]
 
* [[Run chart]]
 
* [[Pareto chart]]
 
* [[Pareto chart]]
* [[Scatter plot]]
 
 
* [[Stem-and-leaf plot]]
 
* [[Stem-and-leaf plot]]
 
* [[Parallel coordinates]]
 
* [[Parallel coordinates]]
Line 46: Line 50:
 
* [[Trimean]]
 
* [[Trimean]]
 
* [[Ordination]]
 
* [[Ordination]]
 +
* [[Box plot]]
  
 
== Open Source Software ==
 
== Open Source Software ==

Revision as of 13:45, 4 September 2019

Definition

Exploratory Data Analysis (EDA) in the context of Risk Management is the process of systematically analysing Risk Data for the purpose of identifying and summarizing their main characteristics in text based (tabular) or visual reports.

EDA can be employed as a standalone process, producing management information or as part of Quantitative Risk Management as one of the first steps towards Model Development . Due to the reliance on data and sensitivity to Data Quality, EDA conceptually overlaps and is usually preceded (or iterated) with Risk Data Review activities

EDA is distinct form regular Risk Reporting, the activity of producing standardized monitoring reports to help manage already identified risk factors.

Objectives

The analysis objectives can span a wide range depending on context:

  • Provide suggestions for further data collection and/or data quality improvements (e.g. Missing Data)
  • Defining and identifying outliers
  • Creating an overall map of the underlying structure of the data (identifying clusters of similar variables, reducing dimensionality)
  • Identify potential hypotheses about the causes underlying the observed risks (important / statistically significant factors)
  • Support the selection of appropriate models


NB: While exploratory data analysis does not aim to produce a model, it may use auxiliary models (e.g. regressions) to obtain preliminary insights into potential causal relationships

Techniques

Qualitatively, exploratory data analysis techniques can be classified according to

  • whether the summary information generated is numerical (e.g. tabulated numerical results) or graphical (e.g. plots and charts) in nature. (Keeping in mind that many analyses can be expressed in both forms)
  • the dimensionality of the data (sub)set being investigated (1D, 2D, or multidimensional). Univariate and Bivariate analysis in particular are highly standardized subcategories
  • whether the analysis is model-free (non-parametric) or involves some modelling assumptions (generally simple)

Summary Statistics

  • Univariate analysis, Moments
  • Bivariate analysis, Correlations

Visualization

The most suitable visualization choices depend on the data set. In many cases alternative visualizations provide complementary insights

Core Examples

Additional Examples

Open Source Software

  • KNIME, Konstanz Information Miner - Open-Source data exploration platform based on Eclipse.
  • Orange, an open-source software data mining and machine learning software suite.
  • Python, an open-source programming language widely used in data mining and machine learning.
  • R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for Data-Science.
  • Weka an open source data mining package that includes visualization and EDA tools such as targeted projection pursuit.

Contributors to this article

» Wiki admin