This page aims to be a comprehensive collection of publicly available models and algorithms used for credit scoring.
The credit scoring model collection focuses on the classic one period credit assessment / classification problem that typically produces a credit score and/or a probabilistic estimate of credit risk on the basis of selected characteristics of a borrower.
Out of scope for this page are related credit risk model categories of:
Those types of credit risk models, albeit related to credit scoring:
The following characteristics define the credit scoring model collection documented in this page more precisely:
Credit scoring models have been used globally for decades and in a variety of contexts. The significant overlap of credit scoring methodology with other statistical disciplines means that the entire arsenal of statistical methods has been available and tried with varying degrees of success, usability and adoption. We identify here some key model attributes that can help categorize the variety of models.
These attributes are focused on characterizing the models themselves and not the domain to which they are applied. For example a logistic regression based credit score model applied to individuals might differ from one applied to SME in the number and type of characteristics used. For the purposes of this catalog these two instances belong to the same category.
Generative models produce distributions for the entire set of variables, that is, also for the population characteristics. In classic credit scoring the population characteristics are typically analyzed statistically but not are not modeled jointly with the outcome variable. Examples: Hidden Markov Models, Naive Bayes. https://en.wikipedia.org/wiki/Generative_model Examples of Discriminative models: Linear/Logistic regression, Random Forests, Support vector machines, Boosting (meta-algorithm), Conditional random fields, Neural networks https://en.wikipedia.org/wiki/Discriminative_model
Parametric models posit explicit functional relations between a finite number of variables versus non-parametric models which imply the functional form directly from the data, implicitly allowing an infinite number of variables. https://en.wikipedia.org/wiki/Parametric_model https://en.wikipedia.org/wiki/Nonparametric_statistics There can be also mixtures (semi-parametric models, combining an explicit set of variables together with non-parametric treatment of others). Examples: Models employing Kernel Density Estimation, KNN
Linear models impose linear relations between the variables of the model. Generalized linear models relax this constraint only in the relationship between input and output variables, thereby retaining significant tractability versus a fully non-linear model https://en.wikipedia.org/wiki/Generalized_linear_model Examples: GLM: Logistic Regression, Non-linear Neural Networks
Predictive models allow the estimation of a continuous variable whereas classification models predict membership of a class (expressed by a category). In classic credit scoring the response variable is actually binary, hence most algorithms can be seen as classification problems, even if they are actually regressions. Example: Logistic Regression. Clustering algorithms provide as a primary output an identification of similarity classes.
Supervised models require the presence of labels (e.g. realized credit events) in the training data set. Unsupervised models do not require such information (and therefore will only indirectly classify or predict credit events). Unsupervised models are further sub-divided into clustering (identifying population groupings) and association rules. Example: K-means. Semi-supervised machine learning corresponds in credit scoring to a situation of censored dataset.
In the first category all variables are in principle observable (manifest). In the second category there is an assumption that important dependencies between observable variables are intermediated with latent (hidden, unobservable) variables. Such variables may represent an internal "state" that has its own well defined meaning (e.g. credit worthiness) https://en.wikipedia.org/wiki/Latent_variable_model or be hidden layers (sets of intermediate variables) as in neural network models https://en.wikipedia.org/wiki/Multilayer_perceptron
Elementary algorithms are a single defined set of statistical relationships. Composite algorithms are instead constructed out of ensembles or averages of more elementary models. https://en.wikipedia.org/wiki/Ensemble_learning There are various options for constructing the ensemble: Boostrap, Adaboost etc.
In a frequentist approach models are fit to data without any use of prior knowledge about model parameters (hence assuming uniform, or non-informative, priors). A Bayesian approach will allow the systematic incorporation of prior information in the model estimation. https://en.wikipedia.org/wiki/Bayesian_inference . Examples: Markov Chain Monte Carlo.
This a live catalog of credit scoring models (algorithms). The granularity of both model coverage and model characteristics may increase!
|Linear Discriminant Analysis (LDA) ||No||Yes||Regr.||Yes||Yes||Yes||Yes||Yes|
|Logistic Regression ||No||Yes||Yes||Yes (GLM)||Yes||Yes||Yes||Yes|
|Tobit / Probit Regression ||No||Yes||Regr.||Yes (GLM)||Yes||Yes||Yes||Yes|
|Classification Tree ||No||No||Clas.||No||Yes||Yes||Yes||Yes|
|Random Forest ||No||No||Clas.||No||Yes||Yes||No||Yes|
|Support Vector Machine ||No||No||Clas.||No||Yes||No||Yes||Yes|
|k-Nearest Neighboors (k-NN) ||No||No||Clas.||No||Yes||Yes||Yes||Yes|
|Multilayer Perceptron ||No||No||Clas.||No||Yes||No||Yes||Yes|
|k-Means Clustering ||No||No||Clus.||No||No||Yes||Yes||Yes|
|Naive Bayes Classifier ||Yes||Yes||Regr.||Yes (GLM)||Yes||Yes||Yes||Yes|
|Bayesian Network ||Yes||Yes||Regr.||Yes (GLM)||Yes||Yes||Yes||Yes|
List of references (academic / other publications). Preference should be given to:
The list is not aimed to establish academic priority but to provide sufficient documentation for each listed model. Multiple references are ok if they complement each other.
Usual disclaimer applies: Inclusion in the list does not imply any assurances about correctness, completeness or suitability.