Remaking my first hackathon challenge
A few months ago I had published one of my first articles, right here in Medium, about my first contact with Data Science. The project which I had made with a few partners in a hackathon, offer by one course which I had made in the last semester, and the purpose was to create a model to predict if a bank costumer will leave the services or not. In others words, the challenge was create a Machine Learning model and a deployment for a bank institution.
It’s funny look back at this article and see how I dealt with my limited knowledge about data science, programming skills and many others things. I tried to impress my fellows with python librares as Pycaret for the auto ML and Sweetviz to made the EDA part.
Today I understand the necessity of get the best insights of the data through my own efforts, not trying the get this insights by a single line of code. And that’s why I decide to remake this challenge with others interesting libraries and modules which Python have.
That’s why I decided to focus on different libraries such as: SHAP, streamlit, plotly and apply the repetition structures with came with Python.
That dataset is very famous, ok not like titanic or iris, but there are many notebooks and very interesting studies about that data and was very easy to find at Kaggle.
That project is divide in a few files. I decide to do this to not lose the focus of each step of the development and better organize my own code. To begin I started to understand a few patterns at the dataset and there was very interesting insights to take a look.
First of all our dataset has ten thousand records with fourteen columns. With clients spread across tree european countries, age, Credit Score, with four services available and a few columns with binary values like: Has credit card, is active member and the target exited.
One of my greatest concerns about any dataset is simples questions such as: There is duplicated data? Are null values? how many outliers are?Are there too many zeroed values?
Maybe these questions are too much inside other part of the data process, it’s possible begin with something more easy like: Are there numeric values? Are there values written? To start to answer these questions I always like to run a simple code.
df.info . Them, it’s possible to begin to answer my concerns. One of the biggest problems about kaggle dataset is great part of the data available it’s uniformed to run Machine Learning scripts, don’t have to care about many others things beside understand the data and run machine learning codes.
Anyway, one of my newest discovers about pandas is the there is a way to turn the tables more interesting and visual charming, furthermore charts are not the only way to show data, right? The pandas style is a very interesting way yo display data. The cons of using this method is the all data will be display, so if you dealing with ten thousand values maybe is not interesting show all rows and all columns values, but it’s very nice could check small tables with many information.
At the code below, it’s possible to check many zeroed values are by column in the dataset.
And I thought, why not trying to apply this in one of my favorite method of pandas
df.describe() ? Of course, there are a few more lines of code to get this result below…
Enjoyed? Cause at this point it’s time to plot some charts and going deeper at the exploratory data analysis.
So, thanks to the built in functions of python combine with some data viz libraries like seaborn it’s possible to figure out this.
All clients, without exceptions, with number of products 4 had left the bank. If there was any doubt about this note, I executed the code
df.groupby(NumOfProdcts)[Exited].mean()*100, just to have an quick view about that.
Although the Exploratory Data Analysis — EDA, it was possible to know although great number of clientes were in France and Spain, the proportion of German costumers left the it’s more worrying them other two countries. In numerical terms, 16.2% of France costumers had left the bank, and 16.7 of the Spanish costumers being that 32,4 of German had left. Other columns doesn’t provide satisfying result as these columns,
If I won’t developed a Machine Learning, my job would finish here, but he wasn’t finished yet.
Also, there is a range in the column age with the exit which represent
There is a lot of information here and it may be pass go unnoticed, but if we look more closely…
We can check this. The only part of this chart here the number of clients who left the bank is bigger than the clients which stay with the bank services.
So if I had to elaborate a report for C level, that information certainly would be in a report,