## A guide to avoiding the common pitfalls of event studies

Event studies are useful tools in the context of causal inference. They are used in **quasi-experimental situations**. In these situations, the treatment is not randomly assigned. Thus, in contrast to randomized experiments (i.e., A/B tests), one cannot rely on a simple comparison of the means between groups to make **causal inferences. **In these types of situations, event studies are very useful.

Event studies are also frequently used to see if there are any pre-treatment differences between the treated and nontreated groups as a way to pretest parallel trends, a critical assumption of a popular causal inference method called difference-in-differences (DiD).

**However, recent literature illustrates a variety of pitfalls in event studies. If ignored, these pitfalls can have significant consequences when using event studies for causal inference or as a pretest for parallel trends.**

In this article, I will discuss these pitfalls and recommendations on how to avoid them. I will focus on the applications in the context of panel data where I observe units over time. I will use a toy example to illustrate the pitfalls and recommendations. You can find the full code used to simulate and analyze the data here. In this article, I limit the use of code to the most crucial parts to avoid cluttering.

## An Illustrative Example

Event studies are commonly used to investigate the impact of an event such as a new regulation in a country. A recent example of such an event is the implementation of lockdowns due to the pandemic. In the case of the lockdowns, many businesses got affected because people started spending more time at home. For example, a music streaming platform may want to know whether people’s music consumption patterns have changed due to lockdowns so that they can address these changes and serve their customers better.

A researcher working for this platform can investigate whether the amount of music consumed has changed after the lockdown. The researcher could use the countries that never imposed a lockdown or imposed a lockdown later as control groups. An event study would be appropriate in this situation. Assume for this article that the countries that impose a lockdown stay so until the end of our observation period and the implementation of the lockdown is binary (i.e., ignore that the strictness of the lockdown can vary).

## Event Study Specification

I will focus on event studies in the shape of:

*Y*ᵢₜ is the outcome of interest. αᵢ is the unit-fixed effects and it controls for time-constant unit characteristics. γₜ is the time-fixed effects and it controls for time trends or seasonality. *l *is the time relative to the treatment and it indicates how many periods it has been since the treatment at a given time *t*. For example, *l* = -1 indicates that it is one period before the treatment, *l* = 2 indicates that it is two periods after the treatment. D*ˡᵢₜ *is the treatment dummy for the relative time period *l *at time *t *for unit *i.* Basically, we include both the leads and lags of the treatment. ϵᵢₜ is the random error.

The coefficient of interest β*ₗ *indicates the average treatment effect in a given relative time period *l*. In the observation period, there are T periods, thus, the periods range from 0 to T-1. The units get treated at different periods. Each group of units that are treated at the same time composes a treatment cohort. This type of event study is a difference-in-differences (DiD) design in which units receive the treatment at different points in time (Borusyak et al. 2021)

*Illustrative example continued:*

In line with our illustrative example, I simulate a panel dataset. In this dataset, there are 10,000 customers (or units) and 5 periods (from period 0 to 4). I sample unit- and time-fixed effects at random for these units and periods, respectively. Overall, we have 50,000 (10,000 units x 5 periods) observations at the customer-period level. The outcome of interest is the music consumption measured in hours.

I randomly assign the customers to 3 different countries. One of these countries imposed a lockdown in period 2, the other in period 3, and one never imposed a lockdown. Thus, **customers from these different countries are treated at different times**. To make it easy to follow, **I will refer to the customers by their treatment cohorts** depending on when they have been treated: *cohort period 2 *and* cohort period 3 *for customers treated in periods 2 and 3, respectively*. *One of the cohorts is never treated and, thus, I refer to them as *cohort period 99 *for the ease of coding.

In the simulation, after these customers are randomly assigned to one of these cohorts, I create the treatment dummy variable `treat`

which equals 1 if `cohort_period >= period`

, 0 otherwise. `treat`

indicates whether a unit is treated in a given period. Next, I create a **dynamic treatment effect** that grows in each treated period (e.g., 1 hour in the period where treatment happens and 2 hours in the period after that). Treatment effects are zero for pre-treatment periods.

**I calculate the outcome of interest **`hrs_listened`

as the sum of a constant that I randomly chose (80), unit- and time-fixed effects, the treatment effect, and error (random noise) for each unit and period. **By construction, the treatment (lockdowns) has a growing positive impact on music consumption**.

I skip some of the setup and simulation parts of the code to avoid cluttering but you can find the full code here.

In the following image, I show a snapshot of the data. `unit`

refers to customers, `cohort_period`

refers to when a unit was treated. `hrs_listened`

is the dependent variable and it measures the music consumption in a given period in hours for a given customer.

`rm(list = ls())`

library(data.table)

library(fastDummies)

library(tidyverse)

library(ggthemes)

library(fixest)

library(kableExtra)data <- make_data(...)

kable(head(data[, ..select_cols]), 'simple')

In the following image, I illustrate the trends in the average music listening by cohort and period. I also mark when the countries have imposed lockdowns for the first time. You can see that there seems to be a positive impact of the lockdowns for both the earlier- and later-treated countries compared to the customers from the untreated cohort.

` # Graph average music listening by cohort and period`

avg_dv_period <- data[, .(mean_hrs_listened = mean(hrs_listened)), by = c('cohort_period','period')]

ggplot(avg_dv_period, aes(fill=factor(cohort_period), y=mean_hrs_listened, x=period)) +

geom_bar(position="dodge", stat="identity") + coord_cartesian(ylim=c(79,85))+

labs(x = "Period", y = "Hours", title = 'Average music listening (hours)',

caption = 'Cohort 2 is the early treated, cohort 3 is the late treated and cohort 99 is the never treated group.') +

theme(legend.position = 'bottom',

axis.title = element_text(size = 14),

axis.text = element_text(size = 12)) + scale_fill_manual(values=cbPalette) +

geom_vline(xintercept = 1.5, color = '#999999', lty = 5)+

geom_vline(xintercept = 2.5, color = '#E69F00', lty = 5) +

geom_text(label = 'Cohort period 2 is treated',aes(1.4,83), color = '#999999', angle = 90)+

geom_text(label = 'Cohort period 3 is treated',aes(2.4,83), color = '#E69F00', angle = 90) +

guides(fill=guide_legend(title="Treatment cohort period"))

Since this dataset is simulated, I know the true treatment effect of lockdowns for each cohort and each period. In the following graph, I present the true treatment effect of the lockdowns.

In the first period after the treatment (relative period 1), both cohorts increase their listening by 1 hour. In the second period relative to the treatment, the treatment effect is 2 hours for both cohorts. For the relative period 3, we see that the treatment effect is 3 hours.

One thing to notice here is that the treatment effect is homogenous across cohorts over relative periods (e.g., 1 hrs in relative period 1; 2 hrs in relative period 2). Later, we will see what happens if this is not the case.

`# Graph the true treatment effects`

avg_treat_period <- data[treat == 1, .(mean_treat_effect = mean(tau_cum)), by = c('cohort_period','period')]

ggplot(avg_treat_period, aes(fill=factor(cohort_period), y=mean_treat_effect, x=period)) +

geom_bar(position="dodge", stat="identity") +

labs(x = "Period", y = "Hours", title = 'True treatment effect (hrs)',

caption = 'Cohort 2 is the early treated, cohort 3 is the late treated and cohort 99 is the never treated group.') +

theme(legend.position = 'bottom',

axis.title = element_text(size = 14),

axis.text = element_text(size = 12)) + scale_fill_manual(values=cbPalette) +

guides(fill=guide_legend(title="Treatment cohort period"))

Now, we do an event study by regressing the `hrs_listened`

on relative period dummies. The relative period is the difference between `period`

and `cohort_period`

. The negative relative periods indicate the periods before the treatment and the positive ones indicate the periods after the treatment. We use unit fixed-effects (αᵢ) and period fixed-effects (γₜ) for all the event study regressions.

In the following table, I report the results of this event study. Unsurprisingly, there are no effects detected pre-treatment. Post-treatment effects are precisely and correctly estimated as 1, 2, and 3 hours. So everything works so far! Let’s see situations where things don’t work as well…

`# Create relative time dummies to use in the regression`

data <- data %>%

# make relative year indicator

mutate(rel_period = ifelse(cohort_period == 99,99,period - cohort_period))

summary(data$rel_period)data <- data %>%

dummy_cols(select_columns = "rel_period")

rel_per_dummies <- colnames(data)[grepl('rel_period_', colnames(data))]

# Change name w/ minuses to handle them more easily

rel_per_dummies_new<-gsub('-','min', rel_per_dummies)

setnames(data, rel_per_dummies, rel_per_dummies_new)

# Event study

covs <- setdiff(rel_per_dummies_new, c('rel_period_99','rel_period_min1'))

covs_collapse <- paste0(covs, collapse='+')

formula <- as.formula(paste0('hrs_listened ~ ',covs_collapse))

model <- feols(formula,

data = data, panel.id = "unit",

fixef = c("unit", "period"))

summary(model)

Everything worked well so far but here are the top four things to be careful of to avoid the potential pitfalls when using the event study approach:

**1. No anticipation assumption**

Many applications of event studies in the literature impose a no-anticipation assumption. No anticipation assumption means that **treated units don’t change their behavior in expectation of the treatment before the treatment**. When the no-anticipation assumption holds, one can use the period before the event as (one of) the reference period(s) and compare other periods to this period.

However, no anticipation assumption might not hold in some cases, e.g., when the treatment is announced to the panel before the treatment is imposed and the units can respond to the announcement by adjusting their behavior. In this case, **one needs to choose the reference periods carefully to avoid bias**. If you have an idea of when the subjects start to anticipate the treatment and change their behavior you can use that period as the de facto beginning of the treatment and use the period(s) before that as the reference period (Borusyak et al. 2021).

For example, if you suspect that the subjects change their behavior in *l* = -1 (one period before the treatment) because they anticipate the treatment you can use *l* = -2 (two periods before the treatment) as your reference period. You can do this by dropping the D*ˡᵢₜ *where *l* = -2 from the equation instead of dropping the dummy for *l* = -2. This way you use the *l* = -2 period as the reference period. To check whether your hunch on units changing their behavior in *l* = -1 is true, you can check if the estimated treatment effect in *l* = -1 is statistically significant.

*Illustrative example continued:*

Going back to our illustrative example, lockdowns are usually announced a bit before the imposition of the lockdown, which might affect the units’ pre-treatment behavior. For example, people might already start working from home once the lockdown is announced but not yet imposed.

As a result, people can change their music-listening behavior even before the actual implementation of the lockdown. If the lockdown is announced 1 period before the actual implementation one can use the relative period = -2 as the reference period by dropping the dummy for the relative period -1 from the specification.

In line with this example, I copy and modify the original data to introduce some anticipation effects. I introduce a 0.5 hrs increase in the hours listened to all units in relative period -1. I call this new dataset with anticipation `data_anticip`

.

The next graph shows the average music listening time over relative periods. It is easy to notice that the listening time already starts to pick up in the relative period -1 compared to the relative periods -2 and -3. Ignoring this significant change in the listening time can create misleading results.

`# Summarize the hours listened over relative period (excluding the untreated cohort)`

avg_dep_anticip <- data_anticip[rel_period != 99, .(mean_hrs_listened = mean(hrs_listened)), (rel_period)]

setorder(avg_dep_anticip, 'rel_period')rel_periods <- sort(unique(avg_dep_anticip$rel_period))

ggplot(avg_dep_anticip, aes(y=mean_hrs_listened, x=rel_period)) +

geom_bar(position="dodge", stat="identity", fill = 'deepskyblue') + coord_cartesian(ylim=c(79,85))+

labs(x = "Relative period", y = "Hours", title = 'Average music listening over relative time period',

caption = 'Only for the treated units') +

theme(legend.position = 'bottom',

legend.title = element_blank(),

axis.title = element_text(size = 14),

axis.text = element_text(size = 12)) + scale_x_continuous(breaks = min(rel_periods):max(rel_periods))

Now, let’s do an event study as we did before by regressing the hours listened on the relative time period dummies. Keep in mind that the only thing I changed is the effect in the relative period -1 and the rest of the data is exactly the same as before.

You can see in the following table that the pre-treatment effects are negative and significant even though there are no real treatment effects in these periods. The reason is that we use the relative period -1 as the reference period and this messes up all the effect estimations. What we need to do is to use a period where there is no anticipation as the reference period.

`formula <- as.formula(paste0('hrs_listened ~ ',covs_collapse))`

model <- feols(formula,

data = data_anticip, panel.id = "unit",

fixef = c("unit", "period"))

summary(model)

In the following table, I report the event study results from the new regression where I use relative period -2 as the reference period. Now, we have the right estimates! There is no effect detected in the relative period -3, though an effect is correctly detected for the relative period -1. Furthermore, the effect sizes for the post-treatment periods are now correctly estimated.

`# Use release period -2 as the reference period instead`

covs_anticip <- setdiff(c(covs,'rel_period_min1'),'rel_period_min2')

covs_anticip_collapse <- paste0(covs_anticip,collapse = '+')formula <- as.formula(paste0('hrs_listened ~ ',covs_anticip_collapse))

model <- feols(formula,

data = data_anticip, panel.id = "unit",

fixef = c("unit", "period"))

summary(model)

**2. Assumption of homogenous treatment effects across cohorts**

In the equation shown before, the treatment effect can only vary by the relative time period. **The implicit assumption here is that these treatment effects are homogenous across treatment cohorts**. However, if this implicit assumption is wrong the estimated treatment effects can be significantly different than the actual treatment effect causing bias (Borusyak et al. 2021). An example situation could be where earlier cohorts benefit more from the treatment compared to the later treated groups. This means that the treatment effects across cohorts differ.

The simplest solution to address this issue is to **allow for heterogeneity**. To allow for the treatment effect heterogeneity between cohorts, one can estimate relative time and cohort-specific treatment effects, as seen in the following specification. In the following specification, *c *stands for the treatment cohort. Here, everything is the same as the previous specification except that the treatment effects are going to be estimated for each relative time & treatment-cohort combination with the estimator for β*ₗ,c*. D*ᵢ*ᶜ* *stands for the treatment cohort dummy for a given unit *i*.

*Illustrative example continued:*

In the lockdown example, it might be that the effect of lockdowns is different across treated countries for different reasons (e.g., maybe in one of the countries, people are more likely to comply with the new regulation). Thus, one should estimate the country and relative time-specific treatment effects instead of merely estimating the relative time-specific treatment effect.

In the original simulated dataset, I introduce cohort heterogeneity in treatment effects across periods and call this new dataset `data_hetero`

. The treatment effect for cohort period 2 is 1.5 times more than the cohort period 3 across all treated periods as illustrated in the next graph.

Now, as we did before, let’s run an event study for the `data_hetero`

. The results of this event study are reported in the following table. Even though there are no treatment or anticipation effects in the pre-treatment periods, the event study detects statistically significant effects! This is because we do not account for the heterogeneity across cohorts.

`# Event study `

formula <- as.formula(paste0('hrs_listened ~ ',covs_collapse))

model <- feols(formula,

data = data_hetero, panel.id = "unit",

fixef = c("unit", "period"))

summary(model)

Let’s account for the heterogeneity in treatment effects across cohorts by running the hours listened on cohort-specific relative period dummies. In the following table, I report the results of this event study. In this table, the treatment effect estimates for each cohort and relative period are reported. By allowing the treatment effects to vary per cohort, we account for the heterogeneity and as a result, we have the right estimates! No effects are detected for the pre-treatment as they should be.

`# Create dummies for the cohort-period `

data <- data_hetero %>%

dummy_cols(select_columns = "cohort_period")

cohort_dummies <- c('cohort_period_2','cohort_period_3')

# Create interactions between relative period and cohort dummies

interact <- as.data.table(expand_grid(cohort_dummies, covs))

interact[, interaction := paste0(cohort_dummies,':',covs)]

interact_covs <- interact$interaction

interact_covs_collapse <- paste0(interact_covs,collapse = '+')# Run the event study

formula <- as.formula(paste0('hrs_listened ~ ',interact_covs_collapse))

model <- feols(formula,

data = data_hetero, panel.id = "unit",

fixef = c("unit", "period"))

summary(model)

**3. Under-identification in the fully dynamic specification in the absence of a never-treated group**

**In a fully dynamic event study specification** where one includes all leads and lags (usually only relative time -1 is dropped to avoid perfect multicollinearity) of the treatment,** the treatment effect coefficients are not identified in the absence of a non-treated group**. The reason for this is that the dynamic causal effects cannot be distinguished from the combination of unit and time effects (Borusyak et al. 2021). The **practical solution for this is to drop another pre-treatment dummy** (i.e., another one of the lead treatment dummies) to avoid the under-identification problem.

*Illustrative example continued:*

Imagine that we do not have data on any untreated countries. Thus, we only have the treated countries in our sample. We can still do an event study utilizing the variation in the treatment timing. In this case, however, we have to use not only one but at least two reference periods to avoid under-identification. One can do this by dropping the period right before the treatment and the most negative relative period dummies from the specification.

In the simulated dataset, I drop the observations from the untreated cohort and call this new dataset `data_under_id`

. Now, we have only treated cohorts in our sample. The rest is the same as the original simulated dataset. Thus, we have to use at least two reference periods by dropping the dummies for any of the pre-treatment relative period dummies. I choose to exclude the dummies for the relative periods -1 and -3. I report the results from this event study below. As you can see now, I have only one relative period estimated in the model. The estimates are correct, great!

**4. Using event studies as a pretest for parallel trends assumption**

**It is a common strategy to use event studies as a pretest for the parallel trends assumption (PTA), a crucial assumption of the difference-in-differences (DiD) approach**. PTA states that in the absence of the treatment, the treated and untreated units would follow parallel trends in terms of the outcome of interest. Event studies are used to see whether the treated group behaves differently than the non-treated group before the treatment occurs. It is thought that if a statistically significant difference is not detected between the treated and untreated groups the PTA is likely to hold.

However, Roth (2022) shows that **this approach can be problematic**. One issue is that **these types of pretests have lower statistical power. **This makes it harder to detect differing trends. Another issue is that if you have high statistical power you might detect differing pre-treatment effects (pre-trends) even though they are not so critical.

Roth (2022) recommends a **few approaches to address this problem**:

**Do not rely solely on the statistical significance**of the pretest coefficients and take the statistical power of the pretest into account. If the power is low the event study won’t be very informative with regard to the existence of a pre-tend. If you have high statistical power the results of the pretest might still be misleading as you might find a statistically significant pre-trend that is not so important.**Consider approaches that avoid pretesting altogether**, e.g., use economic knowledge in a given context to choose the right PTA such as a conditional PTA. Another way is to use the later treated group as the control group if you think the treated and untreated groups follow different trends and are not as comparable. Please, see Callaway & Sant’Anna’s 2021 paper for potential ways to relax the PTA.

*Illustrative example continued:*

Going back to the original example where we have three countries, let’s say that we want to perform a DiD analysis and we want to find support indicating that the PTA holds in this context. This would mean that if the treated countries were not to be treated the music consumption would move in parallel to the music consumption in the untreated country.

We think about using an even study as a way to pretest the PTA because there is no way to test the PTA directly. First, we need to take the statistical power of the test into account. Roth (2021) provides some tools to do this. Although this is out of the scope of this article, I can say that in this simulated dataset we have a relatively high statistical power. Because the random noise is low and we have a relatively big sample size with not that many coefficients to estimate. Still, it can be good to run scenario analyses to see how big of a pre-treatment effect one can correctly detect.

Secondly, regardless of the statistical significance status of the pre-treatment estimates take the specific context into account. Do I expect the treated countries to follow the same trends as the untreated country? In my simulated data, I know this for sure as I determine what the data looks like. However, in the real world, it is unlikely that this would hold unconditionally. Thus, I would consider using a conditional PTA by conditioning the PTA on various covariates that make countries more comparable to each other.

## Conclusion

Event studies are powerful tools. However, one should be aware of their potential pitfalls. In this article, I explored the most commonly encountered pitfalls and provided recommendations on how to address these using a simulated dataset. I discussed the issues relating to no anticipation assumption, heterogeneity of the treatment effects across cohorts, under-identification in the absence of an untreated cohort, and using event studies as a pretest for PTA.