The fourth part of this series will focus on how I performed feature engineering, feature selection, and chose the best model. The feature engineering was done to increase the base model scores. The feature selection was done to eliminate the multicollinearity problems in the previous section.
If you have not yet read the introduction part of this project, I strongly suggest you follow this link and read it to give you a clearer picture of what I’m doing.
Note to the readers: I did two experiments for this phase. One focused on eliminating the multicollinearity problems, and the other used the original and engineered features. I will not describe the other one since the steps were the same. The result was that the one focusing on eliminating the multicollinearity problems was better. If you want to see my experiments, please follow the notebook by clicking here.
My approach in engineering these features for the first experiment is by eliminating the multicollinearity problems and simplifying some columns. By doing those, I hoped to increase the model performances and eventually select the best model to make the predictions.
Below is the outline of what I did during this phase:
- Simplified the purpose column. Judging from the feature importances from the previous notebook, the values in this column didn’t add much to the models’ prediction power. It might be because there were too many unique values in it. Therefore, I mapped these values into ‘debt_consolidation’, ‘business_loans’, ‘personal_loans’, and ‘other’.
- Created debt_equity_ratio. This feature was derived from dividing monthly debt and annual income. I didn’t have each applicant’s net worth/equity data, so the annual income became the equity proxy. Also, although the correlation between these two variables is not greater than 0.5, it came quite close (0.47).
- Created credit_utilization_ratio column. This feature represented the percentage of a person’s available credit that they were currently using. The values were calculated by dividing current_credit_balance with maximum_open_credit. A high credit utilization ratio indicates that borrowers use most of their available credit, thus indicating high financial stress. This also solved the high correlation between these two variables.
- Simplified months_since_last_delinquent and years_in_current_job columns into is_months_delinquent missing, which indicated whether a borrower was late in repaying the debt, and has_stable_job, which indicated whether a borrower had a stable job or not judging by how many years the borrower had been working in the current company.
Simplifying the purpose column
This was the mapping I used to simplify the purpose column:
purpose_mapping = {
'debt_consolidation': 'debt_consolidation',
'business_loan': 'business_loans',
'small_business': 'business_loans',
'other': 'other',
'home_improvements': 'personal_loans',
'buy_a_car': 'personal_loans',
'medical_bills': 'personal_loans',
'buy_house': 'personal_loans',
'take_a_trip': 'personal_loans',
'major_purchase': 'personal_loans',
'moving': 'personal_loans',
'wedding': 'personal_loans',
'educational_expenses': 'personal_loans',
'vacation': 'personal_loans',
'renewable_energy': 'personal_loans',
}
The first five rows of the resulting dataframe were:
Creating new features
I used this code to create the new features and simplify the ‘month_since_last_delinquent’ and ‘years_in_current_job’ column:
#create new features
df['debt_equity_ratio'] = df['monthly_debt'] / df['annual_income']
df['credit_utilization_ratio'] = df['current_credit_balance'] / df['maximum_open_credit']
df['is_months_delinquent_missing'] = df['months_since_last_delinquent'].isnull().astype(int)
df['has_stable_job'] = (df['years_in_current_job'] > 2).astype(int)#dropping unnecessary columns
df.drop(['bankruptcies', 'monthly_debt', 'annual_income', 'current_credit_balance', 'years_in_current_job', 'maximum_open_credit', 'months_since_last_delinquent', 'tax_liens'], axis = 1, inplace = True)
Since infinity values were in the ‘credit_utilization_ratio’ column, I decided to replace them with NaN values and impute them with KNN imputer.
To check whether there were still multicollinearity problems, I created another correlation heatmap plot:
There were no correlation values greater than 0.5, so it was safe to assume that multicollinearity is no longer a problem. Next, I performed one-hot-encoding for the ‘purpose’ and ‘home_ownership’ columns, followed by standardizing the numerical columns using StandardScaler.
The resulting columns were:
The data was ready, so I moved on to the next step, which was the model training.
The steps I did for the model training were the same as training the baseline models. Therefore, I will not delve deep into this section. However, I will explain whether the scores increased and how I chose the final model to be deployed.
The resulting models had these metrics:
Below are the model comparison bar charts:
The Gradient Boosting Model
The key takeaways from the gradient boosting model summary were:
- The F1 score indicated that the model could achieve a good balance between making correct positive predictions and capturing positive instances. The F1 score increased compared to the base model.
- Based on the precision score, this model could predict 98.38% of approved loans when the loans are supposed to be approved. This was a good score for the model, and the precision score improved from the base model, although the improvement was very little.
- Based on the recall score, this model could recognize 72.72% of eligible loan applicants. This was also an improvement, although it was very little.
- The AUC score indicated the model’s power to discriminate between positive and negative cases (loan given and loan refused) was somewhat good. The score decreases compared to the base model.
The Deep Learning Model
The key takeaways from the deep learning model summary were:
- The F1 score indicated that the model could achieve a somewhat good balance between making correct positive predictions and capturing positive instances. I could improve the F1 score by quite a high number.
- Based on the precision score, this model could predict 92.67% of approved loans when the loans were supposed to be approved. This was a good score for the model. However, the precision score dropped significantly from 0.9984 to 0.9267.
- This model can recognize 74.47% of eligible loan applicants based on the recall score. This was also an increase from the base model.
- The AUC score indicated that the model’s power to discriminate between positive and negative cases (loan given and loan refused) was good, although it could be improved more. However, this score also dropped from 0.678 to 0.6611.
XGBoost Model
The key takeaways from the XGBoost model summary were:
- The F1 score indicated that the model could achieve a somewhat good balance between making correct positive predictions and capturing positive instances.
- Based on the precision score, this model could predict 70.47% of approved loans when the loans were supposed to be approved. This was a good score for the model.
- This model could recognize 73.12% of eligible loan applicants based on the recall score.
- The AUC score indicated that the model’s power to discriminate between positive and negative cases (loan given and loan refused) was not too good. It was too close to 0.5, indicating that the model is closer to a random guess than an accurate prediction.
- All of the metrics scores decreased compared to the baseline XGBoost model.
The resulting base models and models generated from the experiments are mixed.
The best AUC scores were generated from the base models, with the highest being the base GBM model. However, the GBM model with no multicollinearity holds the best precision score. In terms of scores, this model outperformed the best GBM model apart from the AUC score. The deep learning model with no multicollinearity had the best F1 and recall scores. However, although the precision score was quite high, it dropped from the base deep learning model.
The best model I have to choose depends on the user’s requirements. If the user demands a high precision score, I will use the GBM model with no multicollinearity. If the user wants the highest recall or F1 score, I will use the deep learning model with no multicollinearity. I will use the baseline GBM model if the concern is the AUC score.
I chose the deep learning model with no multicollinearity (dl_grid_model_66) for this project since it generally performed quite well. However, I have to sacrifice the precision and AUC scores. The model explainability was as follows:
The bar chart indicated that the top 3 most important features were credit_score, credit_utilization_ratio, and current_loan_amount. They affected the model’s prediction the most. It has to be noted, however, that deep learning models are considerably more challenging to interpret. For this model, as far as I know, the H2O variable importance plot works better with tree-based models compared to the deep learning model. If I need to interpret this model well, I might need to create it from scratch using TensorFlow or Pytorch and use SHAP vales or Grad-cam techniques.
From these experiments, I got the best model to make predictions. Since this was just a project to hone my skills, I limited the experiments to 2 because training the models took a long time. Usually, I try to get the scores as high as possible by getting new data, engineering more features, or mix-and-match the features. For example, I might do more experiments by using the engineered features and bankruptcies, the engineered features with tax_liens, or the engineered features with both. I will strive to get the demographic features (such as age, sex, location, type of jobs, etc.) from the data engineers and create more features.
Usually, I select the best model based on the task requirements. Some might want the highest precision so that the user is confident that the loans given to applicants will be paid, and some might want to have the highest recall so that the chance of losing business is the lowest. I chose the deep learning model with no multicollinearity for this project since it generally performed well.
This part explained how I performed feature engineering, feature selection, and chose the best model for this loan classification project. The next part will focus on the model deployment using FastAPI. If you are interested, please follow this link.