What is Causal Inference? Learn how you can know what works and use it effectively to optimize business processes. A user-friendly guide to Causal Inference!
Almost every one of you must have studied in your school the favorite mantra of statistical correlation, which was “correlation doesn’t imply causation”. Just because there is a high correlation between two phenomena, it doesn’t mean that there is causal attribution too. For example, there is a high correlation between sunrise and the call of the rooster; however, it doesn’t help us to know if it’s the call of the rooster that causes the sun to rise, or is it the rising of the sun that causes call of the rooster. Various businesses and industries operate with limited resources, so it is absolutely important for the managers to what causes the desired change. For example, which of the marketing campaigns is driving forward the sales? Which drug is more effective in treating headaches? Which app version leads to higher user retention? All of these questions are answered by what the researchers/economists/data scientists call causal inference. In this article, I am going to tell you what causal inference is, why you should care, how you can use causal inference for your work, the methods, and lastly what the important considerations one needs to have in mind while applying causal inference. Lots of stuff to unpack here, so grab a cup of coffee (or tea) and start sipping the knowledge right away!
In very simple words, causal inference is the study of cause-and-effect relationships. Those who practice causal inference ask questions such as does X cause Y, what are the effects of changing X on Y? For example, what is the effect of changing the format of the website on user retention? Do we see increased customer satisfaction when we use in-person conversations vs automated chats on the website? To answer these kinds of questions, we need to see two different words in parallel — a world in which the “intervention” is given to a person and one in which the intervention is not given to the person. Let us take an example of a very simple A/B Testing. Ideally, we would want to measure the behavior of a person, X, while using our website version A, and the behavior of the same person, X, while using our website version B, both at the same time. Now because we do not live in a world where there can be clones of a person who can visit the same thing at the same time, we cannot ideally measure the impact of website version A vs website version B. This is a problem. What do we do now? Enter counterfactual!
The word counterfactual can be divided into two words counter and factual. Therefore, the meaning of counterfactual is something that is opposite to the factual in a given situation. For example, if a group of students in a university is given additional job training, then this group is called the real-factual group. The counterfactual is the IDENTICAL group that is not given the training. The question of counterfactual concerns itself with what would have happened if the group had not received the training. However, we cannot give and not give training to the same student at the same time, we need an ESTIMATE of the counterfactual, that is to say, a group that is as close as possible to the group that is being trained. This group is also called the control group, that is, a group that is controlled and does not have the training. Sounds complex? Let me try to explain with the help of an example. Let’s say we work for a bank that has the details about the demographics and income data of individuals having accounts in the bank. We would like to roll out a new personal service to those customers who earn more than one million rupees a year to help them understand the various products by the bank and know if that personal service leads to the uptake of other financial products. Therefore, the intervention is personal service, the outcome is uptake as measured by the amount of financial product purchased. The estimated counterfactual group will be the group that is as similar as possible to the original group. How will you define this group? Let’s proceed further!
The most important reason is that our world works in cause-and-effect relationships. Rising of the sun casues the rooster to make a call. Choosing a particular angle while hitting the ball causes the ball to fall in the hole. The second important reason is that the interactions of the our users with our products are not random, but deliberate and thought-out. Unless we know why they do what they do, we will not be able to offer them what they need. Thus, it requires us to have empathy for our users and then test our methodoligically what works and under what conditions. The third is that with the rise of machine learning, it is possible for us to have even more nuanced approach to causal infernece for making business decisions. Machines are great at doing repeatable fast tasks, but they’re still not great doing tasks that require an understanding of causes-and-effects. With the world having limited resources, we need to test out and then roll out policies that are most-impactful; the field of econometrics allow us to do that.
The field of Econometrics concerns itself with the structured study of Causal Inference. However, the whole of the knowledge can be boiled down to its core following steps.
- Defining the intervention/treatment/policy/service — The first step is defining what is it that you are going to roll out. This could be a new welfare scheme, or a new marketing approach, or a new sales technique.
- Defining the outcome of interest — The next step is defining the outcome of interest. What is it that you are interested in understanding the effect on? For example, does the new sales technique lead to higher average basket value? Does the marketing campaign lead to higher audience engagement? Does the child welfare scheme lead to lower child mortality? These are the kinds of outcomes you may be interested in.
- Defining the group that receives the treatment/intervention/policy/service — This group can be randomly decided which gives rise one of the most influential experimental method called Randomized Control Trials. However, randomized control trials are always not possible to be executed because of monetary/social/political/geographical issues. The other way this group can be decided is by using a cut-off or a boundary for includable people. For example, a policy is only to be rolled out in the farms located in one province and not the neigbouring or giving preferential treatment only to those customers who shop for more than a specified amount and so on.
- Estimating the counterfactual — As I have described above, we can never have a perfect counterfactual, but we can have a near-perfect counterfactual using the methods of econometrics. Once this counterfactual is made, we can proceed to the next step.
- Do the comparison — Once we have a clear understanding of the counterfactual and the factual group, we need to compare the outcome of interest for both the groups, and this can help us to make statistical decisions on whether the policy/initiative led to meeting strategic goals of the organizations or not. At this stage, it is also important to state the assumptions and possible errors that could arise in the analysis. These are more clearly detailed in the last section of this article.
One of the most influential figures in the field of Causal Inference, Joshua Angrist, has coined the term “Furious Five” to describe the five most frequently used methods of causal inference. For the sake of simplicity, I will mention only those five here (alongwith short descriptions); however, there are many more methods with their intricacies that a researcher/data scientist can use for their work. I will suggest the motivated learner open any influential econometrics journal and find out more methods. Alright, what are the furious five?
- Random Assignment — This method involves dividing the sampling frame into two random parts. One of them receives the intervention and the other doesn’t. This is not possible to do always because of ethical issues involved in withholding the intervention from some people.
- Regression — This is our old friend called OLS regressions that measures the effect of change of one variable on the other. For example, if someone is interested in knowing the difference in wages that each additional year of education leads to in an individual’s life. This isn’t my favourite method, so I will request the user to delve deeper on this on their own.
- Instrumental Variables — In simple words, IV is a third variable that is introduced which is correlated with the intervention/policy/activity/service variable, but is not correlated with the outcome of interest variable. This is mostly used when there is a risk of ommited variable bias and there is observational data in question.
- Regression Discontunituity Design — This method uses an artifical cut-off to divide people into two groups. For example, in a school only those who score more than 90% will receive the training and the other will not be. The assumption here is that apart from the 90% cut-off, those who get marks between the range of 88–90% and those in the range of 90–92% are similar to each other.
- Difference-in-differences Method — This methods capture the difference in the outcome between two groups. For example, one set of people receive the intervention and over-time the outcome is measured and the other group doesn’t receive the intervention and the outcome is measured over-time. The final comparison sees how much is the difference between these two groups, thus it’s called difference-in-differences.
As mentioned earlier, these are not the only methods of causal inference. There are some highly advanced methods such as Synthetic Control Methods, Bayesian Optimization, and Interrupted Time Series Design.
With all the promises, you must be feeling excited to try out causal inference on your data right away. However, there are important considerations you need to keep in mind while using the methods. The first and foremost is that correlation between two variables doesn’t mean causation. Apart from this, there are four more that require special mention.
- Ommitted Variables: Sometimes, there is a lurking third variables which is affecting both the variables. You may have read about this famous data that asserted that eating more ice-creams is linked to higher death by a shark-attack. What is important to miss here is that higher temperature is the ommitted variables, which leads more people to swim in the sea, as well as more people to buy ice-creams. Similarly, having more coke leads to higher violence is due to the poverty.
- Reverse Causality: One needs to be careful and have subject-matter expertise in order to articulate clearly that the issue at hand is not the one of reverse causality. For example, this article mentioned that more sex leads to more income. However, some people said that it’s the other way round as people with more income are more likely to have dates that result into sex.
- Sampling Bias: The classical statistical issue of sampling bias plays out in the causal inference as well, especially when users/customers/people self-select themselves for a particular intervention. Have you ever come across some things like 98% of the people who took this online survey find that online-surveys are helpful? It’s the anamoly of self-selection.
- Measurement Error: Many a times, it’s just not possible to have accurate measurements for what we are trying to measure. In that case, both inferential statistics and causal inference can have problems in being used for analysis. Another thing is that sometimes people are more reserved when telling about the reality of their situation. For example, in countries having a more heterodox attitude towards homosexual relationships, people are less likely to report accurately. These are some of the issues one needs to keep in mind.
If you have reached this line reading this article, it means you are very interested in learning more about causal inference and the methods that are used for causal analysis. Three of my favorite books are:
- Causal Inference by Scott Cunningham
- Causal Inference: What If by James Robins and Miguel A Hernan
- Mostly Harmless Econometrics by Joshua Angrist
- Special Mention: Causal Inference for the Brave and True
If you are interested in knowing more, or if you are a practitioner who would like to network, this is my LinkedIn profile. I will be happy to connect.