Exploratory Data Analysis, as the name suggests is analysis to explore the data. It consists of a number of components; neither are all essential all the time, nor all of them have equal importance. Below, I am listing down a few components based on my experience.
Please note that it is by no means an exhaustive list, but a guiding framework.
1. Understand the lay of the land.
You don’t know what you don’t know — but you can explore!
The first and foremost thing to do is to get the feel of the data — look at the data entries, eye-ball the column values. How many rows, columns you have.
- a retailer dataset might tell you — Mr X visited store#2000 on the 01st of Aug 2023 and purchased a can of Coke and one pack of Walker Crisps
- a social media dataset might tell you — Mrs Y logged onto the social networking website at 09:00 am on the 3rd of June and browsed A, B, and C sections, searched for her friend Mr A and then logged out after 20 mins.
It’s beneficial to get the business context of the data you have, knowing the source and mechanism of data collection; for e.g. survey data vs. digitally collected data etc.).
2. Double-click into variables
Variables are the talking tongue of a dataset, they are continuously talking to you. You just need to ask the right questions and listen carefully.
→ Questions to ask::
– What do the variables mean/represent?
– Are the variables continuous or categorical? .. Any inherent order?
– What are the possible values they can take?
- For continuous variables — check distributions using histograms, box-plots and carefully study the mean, median, standard deviations etc.
- For categorical / ordinal variables — find out their unique values, and do a frequency table checking the most / least occurring ones.
You may or may not understand all variables, labels and values — but try to get as much information as you can
3. Look for patterns/relationships in your data
Through EDA, you can discover patterns, trends, and relationships within the data.
→ Questions to ask::
– Do you have any prior assumptions/hypothesis of relationships between variables?
– Any business reason for some variables to be related to one another?
– Do variables follow any particular distributions?
Data Visualisation techniques, summaries, and correlation analysis help reveal hidden patterns that may not be apparent at first glance. Understanding these patterns can provide valuable insights for decision-making or hypothesis generation.
Think visual bi-variate analysis.
- In case of continuous variables — use scatter plots, create correlation matrix / heat maps etc.
- A mixture of continuous and ordinal/categorical variables — Consider plotting bar or pie charts, and create good-old contingency tables to visualise the co-occurrence.
EDA allows you to validate statistical assumptions, such as normality, linearity, or independence, for analysis or data modelling.
4. Detecting anomalies.
Here’s your chance to become Sherlock Holmes on your data and look for anything out of the ordinary! Ask yourself::
– Are there any duplicate entries in the dataset?
Duplicates are entries that represent the same sample point multiple times. Duplicates are not useful in most cases as they do not give any additional information. They might be the result of an error and can mess up your mean, median and other statistics.
→ Check with your stakeholders and remove such errors from your data.
– Labelling errors for categorical variables?
Look for unique values for categorical variables and create a frequency chart. Look for mis-spellings and labels that might represent similar things?
– Do some variables have Missing Values?
This can happen to both numeric and categorical variables. Check if
- Are there rows which have missing values for a lot of variables (columns)? This means there are data points which have blanks across the majority of columns → they are not very useful, we may need to drop them.
- Are there variables (or columns) which have missing values across multiple rows? This means there are variables which do not have any values/labels across most data points → they cannot add much to our understanding, we may need to drop them.
– Count the proportion of NULL or missing values for all variables. Variables with more than 15%-20% should make you suspicious.
– Filter out rows with missing values for a column and check how the rest of the columns look. Is it that the majority of columns have missing values together ?.. is there a pattern?
– Are there Outliers in my dataset?
Outlier detection is about identifying data points that do not fit the norm. you may see very high or extremely low values for certain numerical variables, or a high/low frequency for categorical class variables.
- What seems an outlier can be a data error.
While outliers are data points that are unusual for a given feature distribution, unwanted entries or recording errors are samples that shouldn’t be there in the first place.
- What seems an outlier can just be an outlier.
In other cases, we might just have data points with extreme values and perfectly fine reasoning behind them.
Study the histograms, scatter plots, and frequency bar charts to understand if there are a few data points which are farther from the rest. Think through:
– Can they be true and take these extreme values?
– Is there a business reasoning or justification for these extremities
– Would they add value to your analysis at a later stage
5. Data Cleaning.
Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. These anomalies can disproportionately skew the data and hence adversely affect the results of our analysis from this dataset.
Remember: Garbage In, Garbage Out
– Course correct your data.
- Remove the duplicate entries if you find any, missing values and outliers — which do not add value to your dataset. Get rid of unnecessary rows/ columns.
- Correct any mis-spellings, or mis-labelling you observe in the data.
- Any data errors you spot which are not adding value to the data also need to be removed.
– Cap Outliers or let them be.
- In some data modelling scenarios, we may need to cap outliers at either end. Capping is often done at the 99th/95th percentile for the higher end or the 1st/5th percentile for the lower-end capping.
– Treat Missing Values.
We generally drop data points (rows) with a lot of missing values across variables. Similarly, we drop variables (columns) which have missing values across a lot of data points
If there are a few missing values we might look to plug those gaps or just let them be as it is.
- For continuous variables with missing values, we can plug them by using mean or median values (maybe across a particular strata)
- For categorical missing values, we might assign the most used ‘class’ or maybe create a new ‘not defined’ class.
– Data enrichment.
Based on the needs of the future analysis, you can add more features (variables) to your dataset; such as (not restricted to)
- Creating binary variables indicating the presence or absence of something.
- Creating additional labels/classes by using IF-THEN-ELSE clauses.
- Scale or encode your variables as per your future analytics needs.
- Combine two or more variables — use arrange of mathematical functions like sum, difference, mean, log and many other transformations.