Marketing campaigns are a crucial part of any business strategy. In the banking industry, marketing campaigns are used to promote various products, including savings accounts. However, not all customers respond to these campaigns, and it can be challenging to determine who is more likely to open a savings account. This is where machine learning comes in. By using a machine learning algorithm, we can predict which customers are more likely to open a savings account, allowing us to optimize our marketing campaigns.
We have a dataset that provides information on a local bank’s marketing campaign. The dataset contains information on 45211 customers, including their age, job, marital status, education, and more. The target variable is whether the customer opened a savings account or not. The data is divided into a training set and a test set.
We start by reading the input files.
Types of variables:
- 10 categorical variables
- 3 discrete numerical variables
- 6 continuous numerical variables
Next, we check for duplicate records in the train set.
The train dataset contains 905 rows of data that is duplicate in nature.
Then, we check for any missing values in the train and test dataset.
There was an unknown class in certain categorical variables and they were dealt with in our models.
We then check the train set for imbalance. When working with datasets, it’s common to encounter imbalanced data, where one class is heavily overrepresented compared to others. This can be a significant problem for machine learning algorithms as they tend to be biased towards the majority class, leading to poor performance on the minority class. In this blog post, we’ll discuss why imbalance in a dataset is a problem and explore some techniques for handling it.
The outcome variable (indicating customer opened a savings account or not) is highly imbalanced.
~11% for class 1 (yes) and ~89% for class 0 (no)
As part of the EDA, we also check for any outliers in the numerical variables.
Next, we use a heat map to plot the correlation matrix between the numerical variables. This is done to check and deal with multicollinearity among them. Multicollinearity is a common issue that can arise when working with numerical variables in a dataset. It can lead to inaccurate predictions, inflated coefficients, and p-values, making it difficult to interpret the results of your model. However, there are several techniques that can be used to handle multicollinearity, including removing correlated variables, combining them into a single variable, or using regularization techniques.
In conclusion, it’s important to understand multicollinearity and its impact on your machine learning model. By identifying and handling multicollinearity in your dataset, you can improve the performance and accuracy of your model, leading to better predictions and insights.
We drop the duplicate records from the train dataset and combine the train and test data sets available to perform preprocessing.
We then separate the numerical and categorical variables and the outcome column.
Next, we check the variation of numerical variables and eliminate any variables with a standard deviation lower than 1.
The variable ‘lcdays’ has the highest variation among all numerical variables. ‘lcdays’ represents the number of days that have passed since the client was last contacted by a previous campaign (numeric; 999 means the client was not previously contacted). The column contains records from 0–27 and 999, which results in a very high standard deviation value.
To handle this issue, we have created a dummy variable that records a value of ‘1’ for values ranging from 0–27 and ‘0’ for values of 999 in the ‘lcdays’ column.
We used the CatBoostClassifier algorithm to predict savings account openings. CatBoost is a machine learning algorithm that is known for its ability to handle categorical variables and produce accurate results. The algorithm uses gradient boosting and can handle missing values and categorical features without the need for one-hot encoding.
CatBoostClassifier is a powerful machine learning model that handles class imbalance, unknown values, and categorical variables. Its gradient boosting algorithm, optimized handling of categorical features, and automatic feature scaling make it superior to other models. CatBoostClassifier is resistant to overfitting due to L2 (Ridge) regularization and early stopping, preventing the model from fitting too closely to the training data. Its novel ordering algorithm also improves missing value handling.
As we’re using CatBoostClassifier(), we have not performed any standardization on the numerical variables or one-hot encoding on the categorical variables.
Splitting the train and test data sets.
To counter the imbalance in the data set we assign weights to the two classes. The class weights can be used as a hyper-parameter while fitting the CatBoostClassifier model.
Next, we use hyper-parameter tuning to select the best model to fit.
We trained the CatBoostClassifier on the training set and achieved an AUC of 79%. We then tested the model on the test set and achieved an AUC of 82%, indicating that the model is not overfitting.
Next, we plot the feature importance graph.
The most important predictors for opening a savings account identified by our CatBoost model are socio-economic variables like number of employees (employees), euribor 3 month rate (euri3) and consumer confidence index (cconf). Employees and euri3 are exogenous variables outside the bank’s control, while cconf depends on historic performance and customer confidence. This prediction is reasonable. Future campaigns should focus on variables like last contact month of year (month), mode of contact (ctype), and the number of times the customer is contacted for the current marketing campaign(ccontact). This prediction is also along rational lines.
Using the CatBoostClassifier algorithm, we were able to accurately predict which customers are more likely to open a savings account. This information can be used to optimize marketing campaigns, targeting customers who are more likely to open a savings account. This will save the bank time and resources, as they will not have to market to customers who are less likely to open an account.
In conclusion, machine learning algorithms like CatBoostClassifier can be used to optimize marketing campaigns in the banking industry. By predicting savings account openings, banks can save time and resources and focus their efforts on customers who are more likely to open an account.