Credit Scoring with Python

A catalog of python packages that can be used for building a Credit Scorecard.

Motivation and Scope

The objective is to assist with the development of digital Credit Scoring processes that are built around open source software. There is currently no single python framework that covers the full Model Development and Model Validation of Credit Scoring Models, especially not to a standard that would be required if such models where to used in actual production environments. The catalog aims to associate the available functionality of various existing packages with the various steps of the model development / validation process.

The focus of the the catalog is on the variety of statistical scoring models that can be developed quantitatively using historical performance data.

Out of scope are:

low data approaches such as Expert Based Models
models that use Financial Market Information such as observed credit spreads
Credit Rating Model

Catalog of Python Libraries

Data Collection

Data Collection is a highly context dependent process (depends on the existing systems, databases and their schemas, operating environments etc that hold credit data). Hence it is not possible to pin down concrete packages that would be sufficient in every case. The table is thus only indicative

Procedure	Pandas	Scikit-learn	Other	Remarks
Connect to SQL database	read_sql		SQLAlchemy	In-memory only workflows, see Dask for scaling
Connect to NoSQL database			pymongo
Load from csv / tsv files	read_csv	numpy arrays only	csv package	sklearn.datasets.load_files
Load from json files	read_json	numpy arrays only	json
Load from xls / xlsx / ods files	read_excel		xlrd, openpyxl
Merge, join, transform operations	dataframe operations

Data Review

Risk Data Review is a collection of procedures that aim to

establish Data Quality and, where appropriate, identify actions that will improve it
perform Exploratory Data Analysis to help generate insights about the available data sets

The objective of these procedures is to create a collection of data objects that will conceptually support the next step of identifying useful features.

Procedure	Pandas	Scikit-learn	Other	Remarks
Descriptive Statistics	DataFrame.describe, pandas_profiling		stats.describe
Visualization	matplotlib API, pandas.plotting	matplotlib API	Yellowbrick

Data Cleansing

Data Cleansing is a collection of procedures that aim to

Imputing missing Values
Correcting wrong Representations (e.g. Date, Percentages)
Correcting wrong Values (Summations)

The objective of these procedures is to create a collection of data objects that will practically support the next step of identifying useful features by making sure that all quantitative data meet quality requirements for automated processing.

Procedure	Pandas	Scikit-learn	Other	Remarks
Missing Data	fillna (via numpy datatypes)	sklearn.impute		numpy.NaN is the fundamental missing data datatype

Feature Selection and Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. In the context of credit scoring, a feature is usually a Credit Score Factor, that is a variable (attribute, characteristic) that has a potential relationship with the outcome variable. The objective of these procedures is to create a master data table that meets the criteria for supporting the main model development step.

Procedure	Scikit-learn	Other
Feature Selection	sklearn.feature_selection
Standardization / Normalization	sklearn.preprocessing
Transformations	sklearn.pipeline	featuretools

Model Selection and Fit

Model selection and fit is the process of selecting a suitable model (and model hyperparameters) and the subsequent estimation of the model. We assume here that the selection process in made manually

Procedure	Pandas	Scikit-learn	Other	Remarks
Model Fit		scikit-learn.model.fit

Model Validation

Model Validation of a Credit Scorecard aims to contain the Model Risk associated with using the scorecard (the potential for error in the development and implementation of the model and/or the application or interpretation of model results). The nature of possible errors is linked to the nature of the model and its production use (e.g. Type I / Type II classification errors when accepting a new client)

Procedure	Pandas	Scikit-learn	Other	Remarks
Cross -Validation		sklearn.cross_validation
Accuracy Score		sklearn.metrics

NB: Model Selection, Fit and Validation may be an iterative process

Model Deployment

Model Deployment entails (in its most basic form) to make available the credit scorecard to users. It is very common that developed scorecards are re-programmed in other languages when deployed in production. Here we only cover some options of using python also as a deployment platform.

Procedure	Pandas	Scikit-learn	Other	Remarks
As a desktop application			PyQT, wxWidgets, Kivy
Attaching to a web service			Flask, Bottle, Django

Further Notes

Criteria for Inclusion

We make no differentiation by license (provided it is an open source license)
We make no detailed assessment of maturity / testing (we will revisit this in due course)
Initial preference is to the "well known" / widely used projects of the python ecosystem
For some tasks there might be multiple packages that offer the same functionality. In those instances we will want to catalogue all good alternatives.

Method

The structure of the catalog is to decompose the required functionality in a roughly linear fashion following the steps of the Risk Model Lifecycle (see also How to Build a Credit Scorecard). In practice these steps might be performed by different teams, at different times, in different sequences, using different tools etc. It is not for this entry to document best practices in this respect.

Issues and Challenges

Actual credit scorecards may vary significantly in structure as they need to operate in different organizational, operational and regulatory contexts. The aim here is to capture a typical quantitative and in particular machine learning oriented development workflow that uses modern open source libraries available for that purpose
This entry is not a credit model catalog. There is a separate entry for that purpose, although in the model development segment we do list the packages that offer relevant models for credit scorecard development
This entry is not a workflow for complete end-to-end credit scorecard development. How to Build a Credit Scorecard is a separate entry that covers that task
Many tasks in the catalog can be coded from scratch (using e.g numpy) instead of using an existing python library. Using a package introduces additional dependencies but also reduces the risk of code errors and speeds up development.

Credit Scoring with Python

Contents