Credit Scoring with Python

From Open Risk Manual
Revision as of 13:48, 21 September 2020 by Wiki admin (talk | contribs) (→‎Data Cleansing)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Credit Scoring with Python

A catalog of python packages that can be used for building a Credit Scorecard.

Motivation and Scope

The objective is to assist with the development of digital Credit Scoring processes that are built around open source software. There is currently no single python framework that covers the full Model Development and Model Validation of Credit Scoring Models, especially not to a standard that would be required if such models where to used in actual production environments. The catalog aims to associate the available functionality of various existing packages with the various steps of the model development / validation process.

The focus of the the catalog is on the variety of statistical scoring models that can be developed quantitatively using historical performance data.

Out of scope are:

Catalog of Python Libraries

Data Collection

Data Collection is a highly context dependent process (depends on the existing systems, databases and their schemas, operating environments etc that hold credit data). Hence it is not possible to pin down concrete packages that would be sufficient in every case. The table is thus only indicative

Procedure Pandas Scikit-learn Other Remarks
Connect to SQL database read_sql SQLAlchemy In-memory only workflows, see Dask for scaling
Connect to NoSQL database pymongo
Load from csv / tsv files read_csv numpy arrays only csv package sklearn.datasets.load_files
Load from json files read_json numpy arrays only json
Load from xls / xlsx / ods files read_excel xlrd, openpyxl
Merge, join, transform operations dataframe operations

Data Review

Risk Data Review is a collection of procedures that aim to


The objective of these procedures is to create a collection of data objects that will conceptually support the next step of identifying useful features.

Procedure Pandas Scikit-learn Other Remarks
Descriptive Statistics DataFrame.describe, pandas_profiling stats.describe
Visualization matplotlib API, pandas.plotting matplotlib API Yellowbrick

Data Cleansing

Data Cleansing is a collection of procedures that aim to

  • Imputing missing Values
  • Correcting wrong Representations (e.g. Date, Percentages)
  • Correcting wrong Values (Summations)


The objective of these procedures is to create a collection of data objects that will practically support the next step of identifying useful features by making sure that all quantitative data meet quality requirements for automated processing.

Procedure Pandas Scikit-learn Other Remarks
Missing Data fillna (via numpy datatypes) sklearn.impute numpy.NaN is the fundamental missing data datatype

Feature Selection and Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. In the context of credit scoring, a feature is usually a Credit Score Factor, that is a variable (attribute, characteristic) that has a potential relationship with the outcome variable. The objective of these procedures is to create a master data table that meets the criteria for supporting the main model development step.

Procedure Pandas Scikit-learn Other Remarks
Feature Selection sklearn.feature_selection
Standardization / Normalization sklearn.preprocessing
Transformations sklearn.pipeline featuretools

Model Selection and Fit

Model selection and fit is the process of selecting a suitable model (and model hyperparameters) and the subsequent estimation of the model. We assume here that the selection process in made manually

Procedure Pandas Scikit-learn Other Remarks
Model Fit scikit-learn.model.fit

Model Validation

Model Validation of a Credit Scorecard aims to contain the Model Risk associated with using the scorecard (the potential for error in the development and implementation of the model and/or the application or interpretation of model results). The nature of possible errors is linked to the nature of the model and its production use (e.g. Type I / Type II classification errors when accepting a new client)

Procedure Pandas Scikit-learn Other Remarks
Cross -Validation sklearn.cross_validation
Accuracy Score sklearn.metrics

NB: Model Selection, Fit and Validation may be an iterative process

Model Deployment

Model Deployment entails (in its most basic form) to make available the credit scorecard to users. It is very common that developed scorecards are re-programmed in other languages when deployed in production. Here we only cover some options of using python also as a deployment platform.

Procedure Pandas Scikit-learn Other Remarks
As a desktop application PyQT, wxWidgets, Kivy
Attaching to a web service Flask, Bottle, Django

Further Notes

Criteria for Inclusion

  • We make no differentiation by license (provided it is an open source license)
  • We make no detailed assessment of maturity / testing (we will revisit this in due course)
  • Initial preference is to the "well known" / widely used projects of the python ecosystem
  • For some tasks there might be multiple packages that offer the same functionality. In those instances we will want to catalogue all good alternatives.

Method

The structure of the catalog is to decompose the required functionality in a roughly linear fashion following the steps of the Risk Model Lifecycle (see also How to Build a Credit Scorecard). In practice these steps might be performed by different teams, at different times, in different sequences, using different tools etc. It is not for this entry to document best practices in this respect.

Issues and Challenges

  • Actual credit scorecards may vary significantly in structure as they need to operate in different organizational, operational and regulatory contexts. The aim here is to capture a typical quantitative and in particular machine learning oriented development workflow that uses modern open source libraries available for that purpose
  • This entry is not a credit model catalog. There is a separate entry for that purpose, although in the model development segment we do list the packages that offer relevant models for credit scorecard development
  • This entry is not a workflow for complete end-to-end credit scorecard development. How to Build a Credit Scorecard is a separate entry that covers that task
  • Many tasks in the catalog can be coded from scratch (using e.g numpy) instead of using an existing python library. Using a package introduces additional dependencies but also reduces the risk of code errors and speeds up development.

See Also