Introduction
A credit score stands as a pivotal indicator, portraying the financial trustworthiness of an individual. In the financial landscape, it holds a pivotal role, serving as a fundamental yardstick employed by lenders and various financial institutions to meticulously gauge the level of risk associated with extending loans or other financial services. Within the confines of this blog, we endeavor to elucidate how the application of logistic regression, a powerful statistical tool, can be harnessed for the creation of a credit scorecard.
In pursuit of this objective, we used dataset sourced from Kaggle, (Dataset link: https://www.kaggle.com/datasets/laotse/credit-risk-dataset). It is essential to note that the credit scoring model we develop may deviate significantly from the conventional standards established within the industry. This blog simply explores the process in building credit scorecard using logistic regression. You can access the full source code in this github:
https://github.com/Rian021102/credit-scoring-analysis
1. Load Dataset, split into train and test and perform EDA
Following the dataset loading, our immediate action is to divide it into training and testing sets. Subsequently, we delve into Exploratory Data Analysis (EDA), a critical step in which we investigate the dataset. This involves scrutinizing aspects such as missing data, how the data is spread out, summarizing statistics, identifying any unusual data points (outliers), and examining data correlation heatmap, among other valuable insights that EDA can provide.
2. Characteristic Binning
Characteristic binning, at its core, involves the process of grouping diverse attributes or traits that describe individuals or entities. These attributes encompass elements like income, age, credit history, and other pertinent factors. While these attributes play a vital role in assessing an individual’s creditworthiness, they frequently appear in different formats, such as numerical data or categorical labels. By subjecting these characteristics to binning, we streamline the intricacy of the data, rendering it more accessible and understandable.
def create_binning(data, predictor_label, num_of_bins):
"""
Function for binning numerical predictor.Parameters
----------
data : array like
The name of dataset.
predictor_label : object
The label of predictor variable.
num_of_bins : integer
The number of bins.
Return
------
data : array like
The name of transformed dataset.
"""
# Create a new column containing the binned predictor
data[predictor_label + "_bin"] = pd.qcut(data[predictor_label],
q = num_of_bins)
return data
for column in num_columns:
data_train_binned = create_binning(data = data_train,
predictor_label = column,
num_of_bins = 4)
print(data_train_binned.T)
print(data_train_binned.isna().sum())
# # Define columns with missing values
# missing_columns = ['person_emp_length_bin',
# 'loan_int_rate_bin',
# 'cb_person_cred_hist_length_bin']
# Define columns with missing values
missing_columns = ['person_emp_length_bin',
'loan_int_rate_bin']
# Perform grouping for all columns
for column in missing_columns:
# Add category 'Missing' to replace the missing values
data_train_binned[column] = data_train_binned[column].cat.add_categories('Missing')
# Replace missing values with category 'Missing'
data_train_binned[column].fillna(value = 'Missing',
inplace = True)
print(data_train_binned.T)
return data_train_binned
Characteristic binning can provide a way to handle missing data by creating a separate category for missing values as shown in function code above. By creating specific bins or categories dedicated to missing data, we ensure that no valuable information is lost.
3. Weight of Evidence and Information Value (WoE and IV)
In the process of building a credit scorecard, two important concepts come into play: Weight of Evidence (WOE) and Information Value (IV). These statistical techniques are integral for evaluating the predictive power of individual characteristics in credit scoring models.
WoE is measurement of predictive power of each attributes in characteristic while IV is the measurement of total strength of a characteristic
WoE can be formulated as follow:
While Information Value (IV) can be formulated as follow:
And we can use this function to for our WoE and IV
def create_woe_iv(crosstab_list):
num_columns = ['person_age',
'person_income',
'person_emp_length',
'loan_amnt',
'loan_int_rate',
'loan_percent_income',
'cb_person_cred_hist_length']# Define data with categorical predictors
cat_columns = ['person_home_ownership',
'loan_intent',
'loan_grade',
'cb_person_default_on_file']
# Define the initial list for WOE
WOE_list = []
# Define the initial list for IV
IV_list = []
#Create the initial table for IV
IV_table = pd.DataFrame({'Characteristic': [],
'Information Value' : []})
# Perform the algorithm for all crosstab
for crosstab in crosstab_list:
# Calculate % Good
crosstab['p_good'] = crosstab[0]/crosstab[0]['All']
# Calculate % Bad
crosstab['p_bad'] = crosstab[1]/crosstab[1]['All']
# Calculate the WOE
crosstab['WOE'] = np.log(crosstab['p_good']/crosstab['p_bad'])
# Calculate the contribution value for IV
crosstab['contribution'] = (crosstab['p_good']-crosstab['p_bad'])*crosstab['WOE']
# Calculate the IV
IV = crosstab['contribution'][:-1].sum()
add_IV = {'Characteristic': crosstab.index.name,
'Information Value': IV}
WOE_list.append(crosstab)
IV_list.append(add_IV)
print(IV_list)
print(WOE_list)
return IV_list,WOE_list
WoE list then is transformed into WoE table that looks like this
While since IV is summation of strength in characteristic, we can use this rule of thumb by Naeem Siddiqi
4. Forward Selection Using Logistic Regression
Forward selection is a type of stepwise regression, where we start with no variables in the model, then add variables one-by-one, at each step adding the variable that improves the model the most, until no other variables improve the model significantly. We start first with function for forward selection
def forward(X, y, predictors, scoring='roc_auc', cv=5):# Initialize list of results
results = []
# Define sample size and number of all predictors
n_samples, n_predictors = X.shape
# Define list of all predictors
col_list = np.arange(n_predictors)
# Define remaining predictors for each k
remaining_predictors = [p for p in col_list if p not in predictors]
# Initialize list of predictors and its CV Score
pred_list = []
score_list = []
# Cross validate each possible combination of remaining predictors
for p in remaining_predictors:
combi = predictors + [p]
# Extract predictors combination
X_ = X[:, combi]
y_ = y
# Define the estimator
model = LogisticRegression(penalty = None,
class_weight = 'balanced')
# Cross validate the recall scores of the model
cv_results = cross_validate(estimator = model,
X = X_,
y = y_,
scoring = scoring,
cv = cv)
# Calculate the average CV/recall score
score_ = np.mean(cv_results['test_score'])
# Append predictors combination and its CV Score to the list
pred_list.append(list(combi))
score_list.append(score_)
# Tabulate the results
models = pd.DataFrame({"Predictors": pred_list,
"Recall": score_list})
# Choose the best model
best_model = models.loc[models['Recall'].argmax()]
return models, best_model
And next we perform forward selection for different numbers of predictors and keeps track of the best models and their associated predictors and recall scores. It serves as a step in the process of determining the optimal set of predictors for a predictive model.
def list_predictors(X_train,y_train,forward_models):
# Define list of predictors
predictors = []
n_predictors = X_train.shape[1]# Perform forward selection procedure for k=1,...,11 predictors
for k in range(n_predictors):
_, best_model = forward(X = X_train,
y = y_train,
predictors = predictors,
scoring = 'recall',
cv = 10)
# Tabulate the best model of each k predictors
forward_models.loc[k+1] = best_model
predictors = best_model['Predictors']
print(forward_models)
return forward_models, predictors
5. Scaling and Creating Scorecard
The term “scaling scorecard” encompasses the spectrum and configuration of scores within a scorecard, along with the rate at which odds change when scores increase. In mathematical form can be described as:
Score + pdo = Offset + Factor * log (2 * odds of good)
Where pdo refers to points to double to odds. While Factor and Offset can be described as follow:
Factor = pdo/log(2)
Offset = Score — [Factor * ln(odds of good)
In this case we use base score of 1000, base odds of 35, pdo of 80 and rate 2 (referring to this blog by Bruce Yang):
And this is the scorecard after scaling:
The last step in this project involves deployment. We’ve preserved the scorecard as a pickle file named “scorecards.pkl” and are hosting the model using FastAPI, making it accessible on Heroku.
That’s it. The end result is not perfect since and there might require some revision in the future, but the objective of this blog is to encapsulate the process in creating credit scorecard. We hope you enjoy this piece and thank you for taking your time reading this blog.