Utilizing machine learning in identifying how likely a loan applicant will default on their loan so that creditors could lower their risk

Credit risk can be defined as the risk associated to the financial lost that resulted from failure in loan repayment by the borrower. Creditors or lenders need to minimize this risk in order to prevent cash flow interruption or incur additional cost in to collect the defaulted loan. This is why conducting credit risk analysis based on history data is crucial before granting any loan. Considering there’s huge amount of historical data, this is where machine learning will come in handy to make the process easier and faster.

Since this is a binary classification problem (default or not), logistic regression would probably be the most common go-to method. Logistic regression is different from linear regression in that it uses sigmoid function to estimate probability, which can only be in between 0 and 1. Therefore, logistic regression can be used in machine learning for predictive modelling. If the probability of default for a particular loan applicant is high, they can automatically be classified as high risk. Then, creditor can either refuse to grant the loan, raise the interest, or demand assets as collateral.

The dataset is obtained from Kaggle, which contains several essential information and characteristics of loan applicants such as their age, income, home ownership status, loan amount, loan interest rate, etc. The data is actually unbalance with significantly more data of not default than default. Undersampling method will be used to balance this out by randomly selecting data from the majority class and taking only as much sample as what we have in the minority class.

`data_def = data[data["Default"] == "Y"]`

data_non_def = data[data["Default"] == "N"]

data_non_def = data_non_def.sample(n=len(data_def), random_state=123)

data = pd.concat([data_def, data_non_def])

For this analysis, we’ll focus on how loan amount and interest rate affect the risk of default. The data will be split into train and test data with 80:20 ratio.

`X = data[["Amount", "Rate"]]`

y = data["Default"]

y = y.apply(lambda x: 1 if x == "Y" else 0)X_train, X_test, y_train, y_test = train_test_split(X,

y,

test_size=0.2,

stratify=y,

random_state=123)

Before running the logistic regression, standardization for the input variables is required as both of them have a significantly different scale. The loan amounts are in the thousands while the loan rate are in percentage. This is done by subtracting the data with the mean and dividing it with its standard deviation

`def scaler_transform(data, scaler=StandardScaler()):`

scaler.fit(data)

data_scaled = scaler.transform(data)

data_scaled = pd.DataFrame(data_scaled)

data_scaled.columns = data.columns

data_scaled.index = data.index

return data_scaledX_train_scaled = scaler_transform(data=X_train)

Next, the best parameters for logistic regression can be determined using Grid Search:

`logreg = LogisticRegression(random_state=123)`

parameters = {'solver': ['liblinear', 'saga'],

'C': np.logspace(0, 10, 10)}

logreg_cv = GridSearchCV(estimator=logreg,

param_grid=parameters,

cv=5)

logreg_cv.fit(X=X_train_scaled, y=y_train)

Use the best parameters to build the model, fit it to our data, and check the performance using our train data:

`logreg = LogisticRegression(C=logreg_cv.best_params_['C'],`

solver=logreg_cv.best_params_['solver'],

random_state=123)logreg.fit(X_train_scaled, y_train)

y_pred_train = logreg.predict(X_train_scaled)

print(confusion_matrix(y_true=y_train, y_pred=y_pred_train))

print(classification_report(y_true=y_train,

y_pred=y_pred_train,

target_names=["Not default", "default"]))

Result:

It shows that our model have above 80% precision for both default and not default prediction. This looks like a great result, but it still needs to be validated on the test data and check the result:

The model seems to work well with the test data as well, even slightly better for the not default class at 91% precision. Since the model performance seems good enough, lets visualize it on our data:

The black dashed line shows the decision boundary generated by the logistic regression. It shows that if the standardized loan rate is above 0, which means it is above average, the loan is more likely to default.

Codes used are available from GitHub.

Using logistic regression, it was found that interest rate has a more pronounced effect on default risk than loan amount. As interest rate rise above average, the loan will most probably default. In contrary, higher amount of loans at the same interest rate doesn’t really increase the default risk. In fact those with higher loan amount can actually handle a slightly higher interest rate before risking loan default.

However, this doesn’t necessarily mean that lowering the interest rate will lower the risk of default. This is because often, creditors will assign higher interest rate to individuals with already bad credit history, such as those who have late payments in the past. Therefore, more information on past credit behaviors of these same individuals is required for better understanding

Reference:

https://machinelearningmastery.com/logistic-regression-for-machine-learning/

https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/