How to use random forests to do policy targeting
In my previous blog post, we have seen how to use causal trees to estimate the heterogeneous treatment effects of a policy. If you haven’t read it, I recommend reading that first since we will take that article’s content for granted and start from there.
Why heterogenous treatment effects (HTE)? First of all, the estimation of heterogeneous treatment effects allows us to select which users (patients, users, customers, … ) to offer treatment (a drug, ad, product, …), depending on their expected outcome of interest (a disease, firm revenue, customer satisfaction, …). In other words, estimating HTE allows us to do targeting. In fact, as we will see later in the article, a treatment can be ineffective or even counterproductive on average while bringing positive benefits to a subset of the users. The opposite can also be true: a drug can be effective on average, but its effectiveness can be improved if we identify users on whom it has side effects.
In this article, we will explore an extension of causal trees: causal forests. Exactly as random forests extend regression trees by averaging multiple bootstrapped trees together, causal forests extend causal trees. The main difference comes from the inference perspective, which is less straightforward. We are also going to see how to compare the outputs of different HTE estimation algorithms and how to use them for policy targeting.
For the rest of the article, we resume the toy example used in the causal trees article: we assume we are an online store, and we are interested in understanding whether offering discounts to new customers increases their expenditure in the store.
To understand whether the discount is cost-effective, we have run the following randomized experiment or A/B test: every time a new customer browses our online store, we randomly assign it to a treatment condition. To treated users we offer the discount, to control users we do not. I import the data-generating process
src.dgp. I also import some plotting functions and libraries from
src.utils. To include not only code but also data and tables, I use Deepnote, a Jupyter-like web-based collaborative notebook environment.
We have data on 100.000 online cstore visitors, for whom we observe the
time of the day they accessed the website, the
device they use, their
browser, and their geographical
region. We also see whether they were offered the
discount, our treatment, and what is their
spend, the outcome of interest.
Since the treatment was randomly assigned, we can use a simple difference-in-means estimator to estimate the treatment effect. We expect the treatment and control group to be similar, except for the
discount, therefore we can causally attribute any difference in
spend to the
The discount seems to be effective: on average the spending in the treatment group increases by 1.95$. But are all customers equally affected?
To answer this question, we would like to estimate heterogeneous treatment effects, possibly at the individual level.
There are many different options to compute heterogeneous treatment effects. The simplest one is to interact with the outcome of interest with a dimension of heterogeneity. The problem with this approach is which variable to pick. Sometimes we have prior information that might guide our actions; for example, we might know that
mobile users on average spend more than
desktop users. Other times, we might be interested in one dimension for business reasons; for example, we might want to invest more in a certain
region. However, when we do not extra information we would like this process to be data-driven.
In the previous article, we explored one data-driven approach to estimate heterogeneous treatment effects: causal trees. We will now expand them to causal forests. However, before we start, we have to give an introduction to its non-causal cousin: random forests.
Random Forests, as the name suggests, are an extension of regression trees, adding two separate sources of randomness on top of them. In particular, a random forest algorithm takes the predictions of many different regression trees, each trained on a bootstrapped sample of the data, and averages them together. This procedure is generally known as bagging, bootstrap-aggregating, and can be applied to any prediction algorithm and is not specific to Random Forest. The additional source of randomness comes from feature selection since at each split, only a random subset of all the features X is considered for the optimal split.
These two extra sources of randomness are extremely important and contribute to the superior performance of random forests. First of all, bagging allows random forests to produce smoother predictions than regression trees by averaging multiple discrete predictions. Random feature selection instead allows random forests to explore the feature space more in-depth, allowing them to discover more interactions than simple regression trees. In fact, there might be interactions between variables that are on their own not very predictive (and therefore would not generate splits) but together very powerful.
Causal Forests are the equivalent of random forests, but for the estimation of heterogeneous treatment effects, exactly as for causal trees and regression trees. Exactly as for Causal Trees, we have a fundamental problem: we are interested in predicting an object that we do not observe: the individual treatment effects τᵢ. The solution is to create an auxiliary outcome variable Y* whose expected value for every single observation is exactly the treatment effect.
If you want to know more details on why this variable is unbiased for the individual treatment effect, have a look at my previous post where I go more in detail. In short, you can interpret Yᵢ* as the difference-in-means estimator for a single observation.
Once we have an outcome variable, there are still a couple of things we need to do in order to use Random Forests to estimate heterogeneous treatment effects. First, we need to build trees that have an equal number of treated and control units in each leaf. Second, we need to use different samples to build the tree and evaluate it, i.e. compute the average outcome per leaf. This procedure is often referred to as honest trees and it’s extremely helpful for inference since we can treat the sample of each leaf as independent from the tree structure.
Before we go into the estimation, let’s first generate dummy variables for our categorical variables,
We can now estimate the heterogeneous treatment effects using the Random Forest algorithm. Luckily, we don’t have to do all this by hand, but there is a great implementation of Causal Trees and Forests in Microsoft’s EconML package. We will use the
Differently from Causal Trees, Causal Forests are harder to interpret since we cannot visualize every single tree. We can use the
SingleTreeCateInterpreter function to plot an equivalent representation of the Causal Forest algorithm.
We can interpret the tree diagram exactly as for the Causal Tree model. On the top, we can see the average $Y^*$ in the data, 1.917$. Starting from there, the data gets split into different branches according to the rules highlighted at the top of each node. For example, the first node splits the data into two groups of size 46,878$ and 53,122$ depending on whether the
time is later than 11.295. At the bottom, we have our final partitions with the predicted values. For example, the leftmost leaf contains 40,191$ observation with
time earlier than 11.295 and non-Safari
browser, for which we predict a spend of 0.264$. Darker node colors indicate higher prediction values.
The problem with this representation is that, differently from the case of Causal Trees, it is only an interpretation of the model. Since Causal Forests are made of many bootstrapped trees, there is no way to directly inspect each decision tree. One way to understand which feature is most important in determining the tree split is the so-called feature importance.
time is the first dimension of heterogeneity, followed by
device (mobile in particular) and
browser (safari in particular). Other dimensions do not matter much.
Let’s now check the model performance.
Normally, we would not be able to directly assess the model performance since, differently from standard machine learning setups, we do not observe the ground truth. Therefore, we cannot use a test set to compute a measure of the model’s accuracy. However, in our case, we control the data-generating process and therefore we have access to the ground truth. Let’s start by analyzing how well the model estimates heterogeneous treatment effects along the categorical dimensions of the data,
For each categorical variable, we plot the actual and estimated average treatment effect.
The Causal Forest algorithm is pretty good at predicting the treatment effects related to the categorical variables. As for Causal Trees, this is expected since the algorithm has a very discrete nature. However, differently from Causal Trees, the predictions are more nuanced.
We can now do a more relevant test: how well the algorithm performs with a continuous variable such as
time? First, let’s again isolate the predicted treatment effects on
time and ignore the other covariates.
We can now replicate the previous figure, but for the
time dimension. We plot the average true and estimated treatment effect for each time of the day.
We can now fully appreciate the difference between Causal Trees and Forests: while, in the case of Causal Trees, the estimates were essentially a very coarse step function, we can now see how Causal Forests produce smoother estimates.
We have now explored the model, it’s time to use it!
Suppose that we were considering offering a 4$ discount to new customers that visit our online store.
For which customers is the discount effective? We have estimated an average treatment effect of 1.9492$ which means that the discount is not really profitable on average. However, we are now able to target single individuals and we can offer the discount only to a subset of the incoming customers. We will now explore how to do policy targeting and in order to get a better understanding of the quality of the targeting, we will use the Causal Tree model as a reference point.
We build a Causal Tree using the same
CausalForestDML function but restricting the number of estimators and the forest size to 1.
Next, we split the dataset into a train and a test set. The idea is very similar to cross-validation: we use the training set to train the model — in our case the estimator for the heterogeneous treatment effects — and the test set to assess its quality. The main difference is that we do not observe the true outcome in the test dataset. But we can still use the train-test split to compare in-sample predictions with out-of-sample predictions.
We put 80% of all observations in the training set and 20% in the test set.
First, let’s retrain the models only on the training sample.
Now we can decide on a targeting policy, i.e. decide to which customers we offer the discount. The answer seems simple: we offer the discount to all the customers for whom we anticipate a treatment effect larger than the cost, 4$.
A visualization tool that allows us to understand on whom the treatment is effective and how, is the so-called Treatment Operative Characteristic (TOC) curve. The name is remindful of the much more famous receiver operating characteristic (ROC) curve that plots the true positive rate against the false positive rate for different thresholds of a binary classifier. The idea is similar: we plot the average treatment effect for different shares of the treated population. At one extreme, when all customers are treated, the curve takes value equal to the average treatment effect, while at the other extreme, when only one customer is treated, the curve takes value equal to the maximum treatment effect.
Now let’s compute the curve.
Now we can plot the Treatment Operating Characteristic curves for the two CATE estimators.
As expected, the TOC curve is decreasing for both estimators since the average effect decreases as we increase the share of treated customers. In other words, the more selective we are in releasing discounts, the higher the effect of the coupon, per customer. I have also plotted a horizontal line with the discount cost so that we can interpret the shaded area below the TOC curve and above the cost line as the expected profits.
The two algorithms predict a similar share of treated, around 20%, with the Causal Forest algorithm targeting slightly more customers. However, they predict very different profits. The Causal Tree algorithm predicts a small and constant margin, while the Causal Forest algorithm predicts a larger and steeper margin. Which algorithm is more accurate?
In order to compare them, we can evaluate them in the test set. We take the model trained on the training set, we predict the treatment effects and we compare them with the predictions from a model trained on the test set. Note that, differently from machine learning standard testing procedures, there is a substantial difference: in our case, we cannot evaluate our predictions against the ground truth, since the treatment effects are not observed. We can only compare two predictions with each other.
It seems that the Causal Tree model performs better than the Causal Forest model, with a total net effect of 8,386$$ against 4,948$$. From the plot, we can also understand the source of the discrepancy. The Causal Forest algorithm tends to be more restrictive and treats fewer customers, making no false positives but also having a lot of false negatives. On the other hand, the Causal Tree algorithm is much more generous and distributes the
discount to many more new customers. This translates into both more true positives but also false positives. The net effect seems to favor the causal tree algorithm.
Normally, we would stop here since there is not much more we can do. However, in our case, we have access to the true data-generating process. Therefore we can check the ground-truth accuracy of the two algorithms.
First, let’s compare them in terms of the prediction error of the treatment effects. For each algorithm, we compute the mean squared error of the treatment effects.
The Random Forest model better predicts the average treatment effect, with a mean squared error of 0.5555$ instead of 0.9035$.
Does this map into better targeting? We can now replicate the same barplot we did above, to understand how well the two algorithms perform in terms of policy targeting.
The plot is very similar, but the result differs substantially. In fact, the Causal Forest algorithm now outperforms the Causal Tree algorithm with a total effect of 10,395$ compared to 8,828$. Why this sudden difference?
To better understand the source of the discrepancy let’s plot the TOC based on the ground truth.
As we can see, the TOC is very skewed and there exist a few customers with very high average treatment effects. The Random Forest algorithm is better able to identify them and therefore is overall more effective, despite targeting fewer customers.