Given the recent adjustments to interest rates by the government, people are now facing a worsening financial future.
This adjustment in monetary policy is testing people and businesses’ adaptability alike. Despite the difficulties introduced by interest rate hikes, this adjustment also serves as an opportunity for people to reevaluate their financial strategies.
Although I could easily spend hours talking about the ins and outs of personal finances, my main focus today is different. I’m going to show you how regular people and businesses can use machine learning to help predict the future. This way, you can better understand and prepare for the changes that are coming your way.
Before I get to my model, let’s review what machine learning is all about. Machine learning is a subset of artificial intelligence (AI) focused on the development of algorithms that computers use to perform tasks.
Basically, we are teaching a computer to recognize different patterns in specific datasets and asking it to make decisions based on those patterns. For example, as the computer continuously practices new situations (new data, new directions, etc.), it learns to improve each decision it makes. That’s machine learning!
Why is machine learning useful? Think of the world as one gigantic dataset. It can often be unpredictable and needs a lot cleaning up. Our day to day lives have a lot of “noise” — messy parts. For example, coincidences, random series of events, and misunderstandings happen to us all, and have no rhyme or reason. To use machine learning to understand our world we ask it to only follow the logical patterns in our data — minimizing the “noise”.
But, even when things get unpredictable and chaotic, like the COVID-19 pandemic shattering what we considered normal in 2020, patterns never fail to emerge. As the pandemic progressed, we saw changes in the economy: prices going up on real estate, the rise of new types of money (cryptocurrencies), and more people borrowing money.
Borrowing is a powerful tool because you can get the money you need now, and pay it back later. But with great power comes great responsibility, if you borrow recklessly, you can easily put yourself in a dangerous financial situation. For example, the average Canadian household owes $1.83 for every $1 of disposable income, meaning unless they earn twice as much they will never get out of debt. That’s even worse than living paycheck to paycheck.
When an individual consistently fails to repay their loans within the agreed time frame, lending companies may classify their loan as a write-off or charge-off. In such cases, these companies typically transfer the loan file to collections agencies to recover at least a portion of the outstanding balance. As a result, this negatively impacts the individual’s credit score and history for a considerable period, typically at least 7 years, depending on the circumstances. Subsequently, individuals with bad credit scores will encounter greater difficulty obtaining favourable interest rates (or any loans at all, for that matter), even if they manage to improve their financial situation. Those with bad credit may also experience relentless harassment from collection agencies, difficulty renting or buying real estate, and many other struggles.
Now that we are all up to speed, the goal of my project is to observe people’s borrowing habits and determine if a machine learning algorithm can accurately predict one’s financial status by analyzing their loan data prior to the application process. The aim is to prevent individuals from facing financial burdens by identifying potential risks in advance.
Diving Into Neural Networks
I used an Artificial Neural Network (ANN) to try and predict whether someone would have trouble paying off their loan. ANNs are made up of nodes: each node is connected to another one, where each one takes in information, processes it and then passes it on. This network then adjusts the strength of these connections based on the given data.
The reasons I found ANNs beneficial are because they are fantastic at handling large datasets — they can handle nonlinear relationships, and most importantly, they can learn from historical data to predict future outcomes.
To illustrate my reasoning, I got my dataset from Nathan George at Kaggle, called “All Lending Club loan data”. Using this information, I wrote the code for my project. After filtering all the necessary information; I created my final dataset file.
The Dataset
df_final_data = pd.read_csv('my_final_dataset.csv')
del df_final_data['Unnamed: 0']
pd.set_option("display.max_columns", None) # I wanted to see everything.
df_final_data.head()
X = df_final_data.drop('loan_status', axis=1)
y = df_final_data['loan_status']
I decided to focus on “loan status” for my prediction model. It’s a handy way to see if someone previously had trouble with their loans based on the information provided from the dataset. So, we’ll use this as our main variable to study. Using binary classification, we define that a “0” indicates a client who paid off their loan, while a “1” indicates a client who has been charged off by the Lending Club.
The Purposes Of Each Import
Allow me to analyze and break down each line of code, providing a brief explanation of the reasoning behind its implementation.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Train_test_split: to divide the data into training and testing sets, which allows the model to learn from the training data and use what it has learned on new data it hasn’t seen before to test it out. In this case, the test size is 20% as in 20% of our total data will be set for testing purposes.
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import regularizers
Sequential: to set up a neural network as a stack of layers. Each layer can be configured with various parameters such as the ones below.
Dense & Dropout: to create fully connected layers and introduce a method to prevent overfitting by ignoring some neurons during training.
Overfitting means a model is trained so well it memorizes the dataset instead of learning an overall pattern. We want to avoid this at all costs!
Regularizers: to help further prevent overfitting by adding penalties to complex models.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Standard Scaler: to standardize the features in the dataset. This calculates the mean and standard deviation of the features and transforms the resulting data to have a mean of zero and a standard deviation of one.
Coming Together
model = Sequential()
model.add(Dense(64, input_shape=(37,), activation='relu', kernel_regularizer=regularizers.l1(0.001)))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu', kernel_regularizer=regularizers.l1(0.001)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test)
relu: to introduce non-linearity and handle complex patterns used in neural networks dataset.
kernel_regularizer: to prevent overfitting by adding a penalty term.
sigmoid: to use in binary classification problems.
adam: to train neural networks by adjusting the learning rate dynamically during training.
binary_crossentropy: to minimize the difference between predicted and actual outcomes.
accuracy: to evaluate the model’s performance using accuracy values.
The Final Verdict: Unveiling the Outcome
How can we determine the performance of a model? Data scientists employ various techniques to assess its effectiveness and accuracy. Here are some of the approaches commonly used:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Loss Graph Visualization')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train','valid'], loc='upper left')
plt.show()
When visualizing the graph above we see not only a decreasing trend but also a convergence. Both are great signs as a decreasing trend in the loss values over the epochs indicates that the model is improving while both the train and valid values are converging and stabilizing together at a low and consistent-ish level.
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Accuracy Graph Visualization')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train','valid'], loc='upper left')
plt.show()
Another great way to see how well the model’s predictions are performing is by visualizing an accuracy graph. We want two key distinct characteristics; high accuracy and consistency. A higher accuracy value indicates better model performance, and a better consistency implies the model is generalizing and learning the information very well.
from sklearn.metrics import classification_report
predictions = (model.predict(X_test).astype("int32"))
print(classification_report(y_test,predictions))
The purpose of the classification report is to provide an evaluation of the model’s performance by showing values such as precision, recall, F1 score, and support for each class. In this example, we have an accuracy of 96%!
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
The purpose of the confusion matrix is to visually summarize the performance of the model by showcasing the number of true positive, true negative, false positive, and false negative prediction values, to observe the model’s accuracy and misclassifications.
True Positive (TP) = 76,763
False Positive (FP) = 0
True Negative (TN) = 3,748
False Negative (FN) = 17,432
The Deployment
import picklePkl_Filename = "model.pkl"
with open(Pkl_Filename, 'wb') as file:
pickle.dump(model, file)
model = pickle.load(open('/Users/amirsaccount/Desktop/ConcordiaBootcamp/model.pkl', 'rb'))input_dict = {"loan_amnt":25000.0,"emp_length": 6.0,"int_rate": 9.0,"installment": 450,
"annual_inc": 45000.0,"delinq_2yrs":0.0,"pub_rec": 0.0, "total_pymnt": 13000,
"collections_12_mths_ex_med":0.0,"acc_now_delinq":0.0,
"tot_cur_bal": 250000.0,"mort_acc": 1.0,"pub_rec_bankruptcies": 0.0,"fico_range": 650.0,
"term__36_months": 0,"term__60_months": 1,
"home_ownership_ANY": 0,"home_ownership_MORTGAGE": 1,"home_ownership_NONE":0,
"home_ownership_OWN": 0,"home_ownership_RENT": 0,
"verification_status_Not_Verified": 0,
"verification_status_Source_Verified": 0,"verification_status_Verified": 1,
"purpose_car": 0,"purpose_credit_card": 0,"purpose_debt_consolidation": 1,
"purpose_home_improvement":0,"purpose_house": 0,"purpose_major_purchase": 0,
"purpose_medical": 0,"purpose_moving": 0,"purpose_other":0,
"purpose_renewable_energy":0,"purpose_small_business":0,"purpose_vacation":0,
"purpose_wedding":0}
input_features = np.array(list(input_dict.values())).reshape(1, -1)
prediction = model.predict(input_features)
if prediction == 1:
print("High likelyhood to get charged off by LoanViewer")
else:
print("Low likelyhood to get charged off by LoanViewer")
So, this client wishes to acquire a $25,000 debt consolidation loan at an interest rate of 9.0% with an annual income of $45,000, and an mortgage debt of $250,000.
So, obviously, this client has a “High likelihood to get charged off by LoanViewer”, and is what the model predicted correctly!
The Limitations
Unfortunately, the model didn’t actually perform as perfectly as I had hoped. While it achieved a 96% accuracy and the graphs showed promising results, I suspect the model may have still overfitted the dataset despite my efforts to prevent it. It struggled with new information but performed well on data similar to its training set. I’m determined to find a solution and unlock the great potential of this idea, making it accessible and beneficial for everyone.
Additionally, I plan to refine the model by adding more accurate criteria, such as job stability, credit utilization ratios, and debt-to-income ratios. However, these are just a few limitations to consider, which hinder the creation of the best possible prediction model.
Considering all these factors, it’s evident machine learning algorithms will become an integral part of our future. Imagine having a personal banker at your disposal, capable of providing you with tailored and informed advice on the market, regardless of time or location. With machine learning, the days of making risky decisions without accounting for every possible risk are over. Gone are the frustrating wait times at banks, only to be met with disappointing outcomes. This represents the potential of machine learning in the world of banking, creating a future where intelligent systems empower individuals with unprecedented financial insights and decision-making capabilities at all times with just a few pieces of information you inputted.
Thank you for reading 🙂