Difference between revisions of "Garbage In Garbage Out"

From Open Risk Manual
 
(One intermediate revision by the same user not shown)
Line 5: Line 5:
  
 
=== Missing data or incorrect data formats ===
 
=== Missing data or incorrect data formats ===
 
 
Different IT systems (databases, programming languages etc) strike different compromises between the need to allow flexible handling of data and the need to enforce a strict data type. The result may be:
 
Different IT systems (databases, programming languages etc) strike different compromises between the need to allow flexible handling of data and the need to enforce a strict data type. The result may be:
  
Line 12: Line 11:
  
 
=== Erroneous data values ===
 
=== Erroneous data values ===
 
 
Erroneous data values are data values that are ''nominally valid'' but are nevertheless wrong entries, typically as a symptom of manual data processing. Infamous examples are [https://en.wikipedia.org/wiki/Fat-finger_error Fat-finger errors]
 
Erroneous data values are data values that are ''nominally valid'' but are nevertheless wrong entries, typically as a symptom of manual data processing. Infamous examples are [https://en.wikipedia.org/wiki/Fat-finger_error Fat-finger errors]
  
 
=== Inaccurate data values ===
 
=== Inaccurate data values ===
 
 
[[Data Accuracy]] refers to the degree to which the available data represents the phenomenon that is being modelled. Accuracy may be difficult to establish. Some typical indicators of increased risk are
 
[[Data Accuracy]] refers to the degree to which the available data represents the phenomenon that is being modelled. Accuracy may be difficult to establish. Some typical indicators of increased risk are
  
Line 25: Line 22:
  
 
== GIGO in Development versus GIGO in Production  ==
 
== GIGO in Development versus GIGO in Production  ==
 
 
The nature of the GIGO pathology becomes more specific when one considers the phases of [[Model Development]] and [[Model Usage]]  (''in production''). The risk of GIGO may refer to either  
 
The nature of the GIGO pathology becomes more specific when one considers the phases of [[Model Development]] and [[Model Usage]]  (''in production''). The risk of GIGO may refer to either  
 
* The [[Model Estimation]] phase, where poor selection of datasets may lead to flawed model selection or model parameter estimation
 
* The [[Model Estimation]] phase, where poor selection of datasets may lead to flawed model selection or model parameter estimation
 
* During model use in production, where problematic model inputs may lead to flawed model outcomes (predictions)
 
* During model use in production, where problematic model inputs may lead to flawed model outcomes (predictions)
 +
 +
== XKCD ==
 +
* [https://www.explainxkcd.com/wiki/index.php/2295:_Garbage_Math Garbage Math]
 +
 +
== See Also ==
 +
* [[BCBS 239]]
  
 
----
 
----

Latest revision as of 11:30, 27 June 2020

Definition

Garbage in garbage out (GIGO) in the context of Quantitative Risk Management refers to the fact that mathematical algorithms may process as Model Inputs flawed, even nonsensical data ("Garbage In") and as a consequence produce nonsensical, unusable Model Outputs outcomes ("Garbage Out").

Types of GIGO

Missing data or incorrect data formats

Different IT systems (databases, programming languages etc) strike different compromises between the need to allow flexible handling of data and the need to enforce a strict data type. The result may be:

  • Allowing for Missing Data when the downstream algorithms require valid data
  • Misinterpretation of values (e.g. assigning a numerical value to a string)

Erroneous data values

Erroneous data values are data values that are nominally valid but are nevertheless wrong entries, typically as a symptom of manual data processing. Infamous examples are Fat-finger errors

Inaccurate data values

Data Accuracy refers to the degree to which the available data represents the phenomenon that is being modelled. Accuracy may be difficult to establish. Some typical indicators of increased risk are

  • Stale data. For dynamic (evolving) data sets, Data Timeliness may be critical for accuracy.
  • Extensive use of Data Proxies due to lack of more relevant / representative data
  • Complex domain with many alternative indicators. For example financial reports contain hundreds of variables with varying degrees of suitability to any given question
  • Challenging modelling requirement. Some uses cases it may be intrinsically difficult to base any quantitative model on observed data. In this instance data accuracy overlaps with Model Risk

GIGO in Development versus GIGO in Production

The nature of the GIGO pathology becomes more specific when one considers the phases of Model Development and Model Usage (in production). The risk of GIGO may refer to either

  • The Model Estimation phase, where poor selection of datasets may lead to flawed model selection or model parameter estimation
  • During model use in production, where problematic model inputs may lead to flawed model outcomes (predictions)

XKCD

See Also


Contributors to this article

» Wiki admin