The objective of this exercise is to develop a machine learning model (XG Boost) to predict the probability of a customer to opt for balance consolidation. Balance consolidation is the act of combining high interest loans into a single low interest loan.

There are 3 stages in model explainability:

· Feature selection (Section 2)

· Model selection (Section 3)

· Explainable output (Section 4)

There are 179,235 observations and 14 columns (**please refer appendix A**). There are 6 categorical variables, 7 numerical variables, and 1 dependent variable.

It was observed that MonthlyPayment and DebtToIncome have missing data. The missing data is imputed with the mean value.

The model is developed on 552K observations and tested on 179,235 observations (**please refer appendix B**).

· **Full data **(Test data)**:** There are 179,235 observations. **60%** of the observations did not opt for balance consolidation and **40%** of the opted for balance consolidation.

· **Unique observations **(Train data)**: **There are 552K observations. **47%** of the observations did not opt for balance consolidation and **53%** of the opted for balance consolidation.

Correlation matrix is used to identify if the variables are correlated with each other. It is observed that there is very high correlation between few of the variables. Hence, MaritalStatus, EmploymentStatus, OccupationArea and UseOfLoan are dropped.

It is observed that the correlation between the remaining variables is more than -0.7 and less than 0.7. Hence there is no multi co-linearity.

The Random Forest model is developed to identify the key predictors. A Random Forest is a combination of decision trees. Ensemble Learning is an average of Models.

The optimal hyper-parameters are determined using iterative process (**please refer appendix C**). The best hyper-parameters are:

· **Max depth:** 5 (maximum depth of the individual regression estimators)

· **Min samples leaf:** 100 (minimum number of samples in the leaf node)

· **Number of estimators:** 1000 (number of tress in the forest)

Feature importance assigns a score to input features based on how useful they are at predicting the default. It is observed that few features have very low importance. Hence, Amount, MonthlyPayment, IncomeTotal and LiabilitiesTotal_bin are dropped.

The green bars are the feature importance of the forest. There are 5 independent variables in the model.

Gradient boosting combines a set of weak learners and delivers improved prediction accuracy. The outcomes predicted correctly are given lower weight compared to the miss-classified outcomes. The hyper-parameters of this ensemble model can be divided into 3 categories:

· **Tree-Specific Parameters: **These affect each individual tree in the model.

· **Boosting Parameters:** These affect the boosting operation in the model.

· **Miscellaneous Parameters:** Other parameters for overall functioning.

GridSearchCV exhaustively considers all parameter combinations. It is used for tuning the hyper-parameters of an estimator. The GridSearchCV instance implements the usual estimator API, when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.

The optimal hyper-parameters are determined using iterative process (**please refer appendix C**). The best hyper-parameters are:

· **Loss function:** deviance

· **Max depth:** 5 (maximum depth of the individual regression estimators)

· **Max features:** sqrt (number of features to consider when looking for the best split)

· **Min samples leaf:** 100 (minimum number of samples in the leaf node)

· **Number of estimators:** 200 (number of boosting stages to perform)

· It is observed that the cut-off is at **0.45**. Cut-off is decided based on F1-Score.

· The model has accuracy of **0.884** (Accuracy = (True positive + True negative ) / All)

· The model has F1 score of **0.889 **(F1 = 2 * (precision * recall) / (precision + recall))

· The model has AUROC of **0.974** (measures how well the model is able to distinguish between good and bad)

· The model has KS of **0.612 **(maximum separation between cumulative good and cumulative bad)

For each of the binned variables, the count of observations (index) and the probability of event (Y) are shown below.

Local interpretable model agnostic explanations works with most models. Surrogate models are trained to approximate the predictions of the underlying black box model. LIME is best applied to Classification problems. LIME focus is on explaining individual predictions (**please refer appendix D**).

SHapley Additive exPlanations were introduced by Lundberg and Lee. SHAP aims to explain each instance by computing the marginal contribution of each feature to the prediction. SHAP computes each value using coalitional game theory.

· **Global Interpretability:** The SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable

· **Local Interpretability:** Each observation gets its own set of SHAP values. Helps to explain why a case receives its prediction and the contributions of the predictors (**please refer appendix E**).

**Categorical Variables:**

· **UseOfLoan **— if use of loan is -1 then -1, else 1

· **Education **— if education is -1 then -1, if education is 2 then 1, else education is 0

· **MaritalStatus **— if marital status is -1 then -1, else 1

· **EmploymentStatus **— if employment status is -1 then -1, else 1

· **OccupationArea **— if occupation area is -1 then -1, else 1

· **NewCreditCustomer **— if new credit customer is True then 1, else -1

**Numerical Variables: **Decision trees are used to bin the numerical variables (max depth of the tree is 3 and min samples leaf is 42)

· **Amount (in thousands) **— if Amount <= 530.35 then bin is 0, if Amount <= 532 then bin is -1, if Amount<=2100 then bin is 0, if Amount<= 2125 then bin is 1, else bin is 0

· **Interest (in percentage) **— if Interest <= 48.22 then bin is -1, if Interest <= 79.87 then bin is 0, else bin is 1

· **LoanDuration **— if LoanDuration <= 30 then bin is -1, else bin is 1

· MonthlyPayment — if MonthlyPayment <= 0 then bin is 1, if MonthlyPayment <= 16.17 then bin is -1, else bin is 0

· **IncomeTotal **— if IncomeTotal <= 304 then bin is -1, else bin is 1

· **LiabilitiesTotal **— if LiabilitiesTotal <= 349.99 then bin is -1, if LiabilitiesTotal <= 350.11 then bin is 1, else bin is -1

· **DebtToIncome **— if DebtToIncome<0.61 then bin is -1, if DebtToIncome <= 62.35 then bin is 0, else bin is 1

There are 179,235 observations in the raw data. The categorical and numerical variables are converted into bins (coarse classing).

· **Train data **— 552K observations with binned variables. These are the unique observations in the dataset.

· **Test data — **179,235 observations with binned variables

Hyperparameters control the over-fitting and under-fitting of the model. For each proposed hyperparameter setting the model is evaluated. The hyperparameters that give the best model are selected. To get the best hyperparameters the following steps are followed:

· Starting point for hyperparameters tuning — [a, b, c]

· If a is selected then in next step — [a/2, a, (a+b)/2]

· If b is selected then in next step — [(a+b)/2, b, (b+c)/2]

· If c is selected then in next step — [(b+c)/2, c, 2*c]

· The above process is continued till there no further improvement in the model performance (e.g. AUROC)

**Random Forest**

**Gradient Boosting**

· **Inputs: **The average value of the inputs (bins) is considered. Since all the inputs (bins) are between -1 and 1 the average value is close to 0.

· **Outputs (0 — blue & 1 — yellow): **LIME provides predicted probability for each instance

**Inputs: **The average value of the inputs (bins) is considered. Since all the inputs (bins) are between -1 and 1 the average value is close to 0.

**Outputs: **SHAP provides predicted probability for each instance

· Feature 0 — DebtToIncome_bin

· Feature 1 — Education

· Feature 2 — Interest_bin

· Feature 3 — NewCreditCustomer

· Feature 4 — LoanDuration_bin