Difference between revisions of "Credit Scoring with Python"

From Open Risk Manual
(Data Review and Validation)
 
Line 52: Line 52:
 
| Missing Data || fillna (via numpy datatypes) || sklearn.impute ||  ||  numpy.NaN is the fundamental missing data datatype
 
| Missing Data || fillna (via numpy datatypes) || sklearn.impute ||  ||  numpy.NaN is the fundamental missing data datatype
 
|-
 
|-
| Descriptive Statistics || DataFrame.describe, pandas_profiling ||  ||  stats.describe ||  
+
| [[Descriptive Statistics]] || DataFrame.describe, pandas_profiling ||  ||  stats.describe ||  
 
|-
 
|-
 
| Visualization ||  matplotlib API, pandas.plotting ||  matplotlib API || Yellowbrick  ||   
 
| Visualization ||  matplotlib API, pandas.plotting ||  matplotlib API || Yellowbrick  ||   

Latest revision as of 13:22, 3 September 2019

Credit Scoring with Python

This manual entry aims to offer (in due course) a complete catalog of python packages that can be used for the purpose of building a Credit Scorecard to assist with the development of digital Credit Scoring processes that are built around open source software.

There is currently no single python framework that covers the full Model Development and Model Validation of credit scoring models as would be required if such models where to used in actual production environments. Hence the catalog aims to associate the available functionality of various existing packages with the various steps of the model development / validation process.

Scope

The focus of the the catalog is on the variety of statistical scoring models that can be developed quantitatively using historical performance data. Hence out of scope are other approaches such as Expert Based Models or models that use market information such as . Similarly out of scope are other related Credit Risk Modelling activities.

Criteria for Inclusion

  • No differentiation by license (provided of-course it is an open source license)
  • No detailed assessment of maturity / testing (we will revisit this in due course) but initial preference to the "well known" projects of the python ecosystem
  • For some tasks there might be multiple packages that offer the same functionality. In those instances we might want to catalogue a few alternatives.

Method

The structure of the catalog is to decompose the required functionality in a roughly linear fashion following the steps of the Risk Model Lifecycle. In practice these steps might be performed by different teams, at different times, in different sequences, using different tools etc. It is not for this table to document best practices in this respect.

Notes

  • Actual credit scorecards may vary significantly in structure as they need to operate in different organizational, operational and regulatory contexts. The aim here is to capture a typical quantitative and in particular machine learning oriented development workflow that uses modern open source libraries available for that purpose
  • This entry is not a credit model catalog. There is a separate entry for that purpose, although in the model development segment we do list the packages that offer relevant models for credit scorecard development
  • This entry is not a workflow for complete end-to-end credit scorecard development. How to Build a Credit Scorecard is a separate entry that covers that task
  • Many tasks in the catalog can be coded from scratch (using e.g numpy) instead of using an existing python library. Using a package introduces additional dependencies but also reduces the risk of code errors and speeds up development.

Catalog of Python Libraries

Data Collection

Data collection is a highly context dependent process (depends on the existing systems, databases and their schemas, operating environments etc that hold credit data). Hence it is not possible to pin down concrete packages that would be sufficient in every case. The table is thus only indicative

Procedure Pandas Scikit-learn Other Remarks
Connect to SQL database read_sql SQLAlchemy In-memory only workflows, see Dask for scaling
Connect to NoSQL database pymongo
Load from csv, json, xls files read_csv, read_json numpy arrays only csv, json, xlrd, openpyxl
Merge, join, transform operations dataframe operations

Data Review and Validation

Data review is a collection of procedures that aim to

The objective of these procedures is to create a collection of data objects that will support the next step of identifying useful features.

Procedure Pandas Scikit-learn Other Remarks
Missing Data fillna (via numpy datatypes) sklearn.impute numpy.NaN is the fundamental missing data datatype
Descriptive Statistics DataFrame.describe, pandas_profiling stats.describe
Visualization matplotlib API, pandas.plotting matplotlib API Yellowbrick

Feature Selection and Engineering

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. In the context of credit scoring, a feature is usually a Credit Score Factor, that is a variable (attribute, characteristic) that has a potential relationship with the outcome variable. The objective of these procedures is to create a master data table that meets the criteria for supporting the main model development step.

Procedure Pandas Scikit-learn Other Remarks
Feature Selection sklearn.feature_selection
Standardization / Normalization sklearn.preprocessing
Transformations sklearn.pipeline featuretools

Model Selection and Fit

Model selection and fit is the process of selecting a suitable model (and model hyperparameters) and the subsequent estimation of the model. We assume here that the selection process in made manually

Procedure Pandas Scikit-learn Other Remarks
Model Fit scikit-learn.model.fit

Model Validation

Model Validation of a Credit Scorecard aims to contain the Model Risk associated with using the scorecard (the potential for error in the development and implementation of the model and/or the application or interpretation of model results). The nature of possible errors is linked to the nature of the model and its production use (e.g. Type I / Type II classification errors when accepting a new client)

Procedure Pandas Scikit-learn Other Remarks
Cross -Validation sklearn.cross_validation
Accuracy Score sklearn.metrics

NB: Model Selection, Fit and Validation may be an iterative process

Model Deployment

Model Deployment entails (in its most basic form) to make available the credit scorecard to users. It is very common that developed scorecards are re-programmed in other languages when deployed in production. Here we only cover some options of using python also as a deployment platform.

Procedure Pandas Scikit-learn Other Remarks
As a desktop application PyQT, wxWidgets, Kivy
Attaching to a web service Flask, Bottle, Django

See Also


Contributors to this article

» Wiki admin