A study on correlation, how it can be misleading, and why it is important to deep dive into the data before taking any business decision.
The Correlation Coefficient between variables is one of the very first, sometimes the only number, we check while doing any data analysis or building models. Decisions like which features can go into the model & which should be dropped, which features can highly impact our target variable, which features can or can’t go together in a model, etc. are often taken solely on the basis of correlation analysis. But is it always the right thing to do?
Tyler Vigen and his machine of correlation
Tyler Vigen was a student of Harvard Law School. A week before his finals, he set up an automated system that could pair up thousands of different randomly selected variables together and calculate Correlation Coefficient between them. He also built a website where he published all the results along with the charts. By letting his correlation hunting machine loose in the data-rich world of the internet and publishing them on his website he is constantly reminding us about the danger of mindlessly applying correlation analysis (or any data or analysis for that matter) at his face value. Since he kept the computer running, you can find loads of crazy correlations on his website. A few examples are, the Correlation Coefficient between “Divorce rate in Maine and Per capita consumption of Margarine” is 99.26%, “Nicolas Cage appearing in a movie and No. of people who drowned by falling into a pool” is 66.6%, “Age of Miss America and Murders by steam, hot vapors and hot object” is 87.02% and “Per capita cheese consumption and No. of people who died by becoming tangled in their bed sheet” is 94.71%.
Why it happens — The Reason
- A lot of time, it is pure coincidence. Two features just happen to be similar to each other leading to a high correlation and convincing everyone that there is a pattern exists between them when actually there isn’t any.
- Small data size also might lead to deceiving correlation. Two features with a small number of observations can have their crests and troughs coincide with each other. Given enough samples, this might be gone.
- Data cleaning can generate a high correlation while there is none in reality. Often, we smoothen the data by binning or taking ranges, averages, rates, or ratios of features. All of these remove the necessary variability from features making two features seemed to be closely related to each other and thus have a high spurious correlation. Removing outliers without proper analysis also may lead to reduced variability or smaller sample size, further leading to a high spurious correlation coefficient.
Spurious or is it? — Confounding factors and requirements of Bayesian and/or Causal AI
So, all high correlations between apparently non-related variables are spurious and we can just detect and ignore them. But wait!! is even that correct?
At first glance, some of the correlations may seem spurious, for example, the correlation between “Sales of sunscreen lotion and Sales of ice cream” or “Total revenue generated by golf courses in US & Amount of money spent on spectator sports”. In both of these cases, it may appear that the events are independent of each other and the high Correlation Coefficient between them is just spurious. However, a little thought about them reveals that there is a hidden confounder that ties them together. In the first case, it is the interest of people watching golf firsthand and in the second case, it is the summer season which are tying the events.
Often, it requires prior human knowledge about the situation to understand this type of confounding effect. Hence, a Bayesian approach might help. Having said that, the confounding factors can be amusing at times and it might be very difficult to detect them. On the other hand, it is absolutely necessary to detect them since if not detected and taken care of them properly, they might influence a lot of big business decisions, often wrongly taken due to the wrong analysis. A lot of time even human knowledge might not be enough to detect them. Causal AI or Causal Inference may be useful in those situations which can detect the confounders from the data itself and can further be considered in the analysis.
In this context, it is also useful to remember the age-old saying that “correlation does not imply causation”. Sales of sunscreen might be correlated with sales of ice cream through the confounding effect of the season, but definitely one is not causing the other in any way.
Nonlinearity and Correlation
So far we have seen some high correlations between variables that in reality might not be related to each other at all. However, a lot of time, 0 correlation also does not imply that the features are not related to each other. It is to be noted here that the Correlation Coefficient essentially measures the strength of linear association between variables. But the variables can still be associated in a non-linear way. For example, take a look at the following graph:
The Correlation Coefficient between them is very low (-1.74%). However, they are not independent. They have a very strong nonlinear relationship. Hence, 0 correlation does not always imply no relationship.
Have you looked at your data? Now understand it thoroughly
From all of the above it is clear that, before carrying out any analysis or building a model and take important decisions based on correlation, it is extremely important to have an in-depth understanding of the data first. Below are a few non-exhaustive ways to do that.
- Understand each variable: Don’t use any variable without understanding its business origin, meaning, and implications. If there is any doubt about any of the variables, it is always wise to go back to business and have an in-depth understanding of them. Try to understand the source of each of the variables. Often names can be deceiving, but, if we understand how the feature was generated, we may find out that two highly correlated features have nothing in common and should not have any relationship at all. The vice-versa is also true. It also helps us to find any confounding factors between the two features.
- Plot the data: Both univariate and bivariate plots of the data are important. While the univariate plot might help us to understand the trend, variance, or existence of extreme values in the data, bivariate plots will help us to know the relationship between two features, whether it is linear or nonlinear, and what transformation might be helpful to establish a linear relationship between them.
- Do proper EDA: EDA or Exploratory Data Analysis is a must in any analytical work. Identifying missing values and outliers, understanding their origin, and treating them in a proper way is extremely important. Also, knowing each variable’s distribution (mean, variance, etc. along with probability distribution) helps in properly using them in the decision-making process. For example, a feature with very low variance might not be suitable for a classification task.
- Predictive Power Score: Methods like Predictive Power Score (PPS) helps identify how one feature is useful in predicting the values of the other features. It is asymmetric in nature. This means that how feature A impacts feature B is different from how B impacts A, which is not true in the case of correlation analysis. Also, it is data type agnostic i.e. it works for both categorical and nominal along with numerical variables. Additionally, unlike the Correlation Coefficient, it can also identify the nonlinear relationship.
- Feature importance and other interpretability methods: If it is a modeling task, it is often useful to see the feature importance and other explainability methods like SHAP, LIME, etc. to know which features are most important in the decision-making process and whether they have any spurious relationship or not.
In this article, we saw how Correlation Coefficient can be misleading in many cases and why it is important to fully analyze each feature (sometimes using Bayesian or Causal AI) before coming to any conclusion about their relationship and taking decisions based on that. Any decision, taken without properly understanding the data, can be extremely dangerous. We also saw a few generic method methods to understand the data.
So, next time, before dropping/combining features or concluding anything based on correlation analysis or feeding your data directly to your favorite AutoML library, understand, analyze and explore the data thoroughly.