Data Science projects to add in your resume to land in your dream job.
Data Science is one of the fastest growing career fields as you can see growth in the number of data based jobs. However, the requirements for such jobs are ever expanding and changing rapidly.
According to a report from Glassdoor, data-driven jobs are the most popular for Gen Z. So you can see that competition is high here. As a result, it’s getting harder and harder to get a job in this growing field of data science.
If you are a newbie applying for a job in Data Science, you’ll need to show your skills. Many people take dozens of courses to learn these skills. But how can you prove your skills?
And the simple answer is — project! and a well-built Data Science portfolio!
Yes, projects show your understanding of a business problem, your approach to solving the problem using data, and also your technical skills. Regardless of which sub-field you are interested in, here are the 3 projects that should not be missing from your Data Science portfolio.
As a data analyst, you will spend most of your time cleaning the data. Real-world data is always messy, and you need to clean it up and organize it well for further processing.
Data cleaning may involve removing irrelevant columns, which requires extensive research to understand the purpose of each column in the dataset and assess its importance going forward.
Often you will have to deal with missing values. Although finding missing values is a simple task, dealing with them can be frustrating. Completely removing records with missing values is not always a good option; you need to deal with the missing values intelligently.
You also need to check the data types of all the columns in the data set. Sometimes a column with date and time values has
String data type and you need to correct it manually.
You can learn more about data cleaning in Python through —
Well, the process of data cleaning depends on the problem you want to solve.
However, when selecting a dataset for a data cleaning project, keep the following points in mind so that you have the best possible chance to show your technical skills.
- It is very practical and based on a non-dummy data/events, so you can discuss it at your interview if needed.
- It contains various impurities like missing values, wrong data types, irrelevant columns, so you can apply various cleanup methods.
- The data needs to be retrieved from multiple files so you can also show your data wrangling skills.
Here are some resources where you can find such data.
- Data.gov: open source data offered by US government
- opendata.aws: datasets offered through AWS resources
- worldbank.org: datasets offered by the world bank
- FiveThirtyEight: It is privately owned website. Most of their datasets are available for public use.
Since you want to show your technical skills, do not use the common and simple data sets you find in any course on data analysis.
In almost 50% of my job interviews I was asked the question: “How will you deal with missing values in your data?”. So, besides the theoretical explanation of the process, I discussed my projects and showed my skills in dealing with missing data.
Once you have cleaned the data, the next important part is to explore it, and you can create a completely new project for your data science portfolio.
As the name implies, EDA helps you examine the data for underlying patterns, trends, and correlations between different variables. After EDA, you can describe the data set in summary form.
You can perform exploratory data analysis at any stage of the project. However, it is a good practice to explore the data in the first step, immediately after cleaning it. This gives you the opportunity to investigate anomalies in the data set and gain useful but hidden insights into the data.
You can use many approaches to achieve your goal, e.g. –
- Generate statistical summary with built-in and user-defined functions.
- Quickly and easily create graphs of raw data that help you see patterns and correlations between different variables.
- Combine the above two items to get a statistical summary in graphs.
Whether you want to be a data scientist, analyst, or machine learning engineer, EDA skills are a must.
Here’s an interesting project I did as part of my Data Analyst nano degree, which is present in my resume as well. You can get it for free from my GitHub repo.
You can always start with a new dataset or use the dataset which you cleaned up in the first project.
There are also a variety of projects on Kaggle.
Although EDA covers some data visualization skills, you can always create a complete data visualization project to show off your data visualization skills.
Visual representation of the data makes it easy to understand and interpret. However, this is not an easy task.
A correct visual can simplify even complex data, while an incorrect visual can confuse the simplest things.
How should you choose the right chart type then?
There are four basic but broad types of charts, depending on what kind of insight you want to present.
If you want to plot the distribution of a single variable, you can always use a histogram. Otherwise, you can use scatter plots to show the distribution of two variables.
Scatter plots and bubble plots are always good for visualizing a relationship between two or three variables.
Bar and column charts are best for comparing two or more variables. If variables change over time, you can use line graphs. If you have variables that change in different categories, you can combine bar or column charts with scatter plots.
If you want to visualize the composition of a whole, such as the market share of all companies in the market adding up to 100%, then you can use pie charts.
Pie charts can be confusing if all parts of the pie have almost the same share or there are more than 5–6 parts of the pie. You can better use stacked bar or column charts to show composition.
You can also use the BI tools like PowerBI, Tableau for data visualization.
In my last interview I was asked — “How will you visualize data with 3 variables in a 2D graph?” This corresponded exactly to such a question that I had solved in my project— Communicate Data Findings.
Also, you can always improve your visualization skills by participating in various data visualization competitions like the Maven Analytics Challenges.
Depending on which subfield you are interested in, e.g. machine learning, deep learning, NLP, data engineering, you will also need to include 1–2 projects from these areas.
You can find some projects in data analytics, machine learning and data engineering on my GitHub repo!