[ad_1]

#### Example with a real-life consulting example

In the real world, after studying data science or/and statistics, this is what might be most important: **how to deal with imperfect real-life data ? **As a statistical consultant and teacher of statistics to researchers in in-vivo labs, I am often required to process data after it has been collected. The fact is that there is a gap between the ideal paper experiment designed by a statistician and what people have done or can do.

This article will first present the main idea (three rules) and then illustrate the idea with a recent real-life example from my practice.

**My three rules to deal with imperfect experimental designs**

from the corporate world with A/B testing to scientific research:

1. **Accept imperfection:** Define the optimal study design and try to get as close to it as possible. Often, we cannot implement the ideal experiment we have defined on paper. However, comparing what you actually do with your initial ideal experimental design can help you identify and assess potential limitations.

2. **Be realistic:** Assessing objectively what you have may be one of the most difficult tasks, but it can also save you the most time. Trying to get something that does not exist is often very costly. You might give up later anyway, you might come to the wrong conclusion/recommendation or other people might find out (you might lose credibility even for other work). For my part, there have been times when my co-authors and I have thrown away five years of research. Then, we had to start from scratch with a different angle. I am still convinced that this was the best decision and that we saved a lot of time by doing so. On the other hand, as we will see in the following example, the client thought they had the perfect experience in place, and indeed had a very promising idea. However, a major flaw rendered one of the key features of the study obsolete, and thus downgraded the research plan. Although the client tried to revive the initial idea several times during our ZOOM exchange, I advised to take another (less “sexy”) approach and objectively assess the limitation.

3. **Be conservative: **Statistical models rely on assumptions, and if the mechanism is based on a theoretical model, there are even more assumptions made before arriving at the results. Therefore, even if the results are statistically significant, there may be a gap between reality and the model estimates. Assessing the limitations does not usually disqualify the author, but rather strengthens his or her credibility. Let me give you two examples. First, I have seen academic presentations where the author starts by “claiming” causality. Usually this early assertion was followed by the audience attacking the speaker for the rest of the talk with (sometimes very unrealistic) stories and examples of why causality might not be achieved in this case. A more conservative approach, in this case, is to explain how you address the problems preventing a causal effect from being measured while leaving the reader/audience to conclude for themselves whether this is sufficient. Second, think of someone who would be systematically overconfident on any issue. If you have been around for a while in the world of statistics, you know that everything is complex, that there are very rarely simple unconditional answers.

### Illustration with a recent consulting question:

A client contacted me with a simple statistical question about a recent experiment they had conducted. To protect privacy, I will not disclose the names or the exact content of the question.

**Research question:** Does method B to measure performance yield similar results to method A? (Method A is the current state-of-the-art but very costly, while method B would be significantly cheaper.)

**Statistical hypothesis to test:** Is method B correlated with method A? (There are several ways to answer this question, I will focus on one approach for parsimony).

**Set-up:** The client conducted eight performance measures with each method over eight days with 27 participants. The idea was to do a pairwise comparison (instead of comparing individuals, we compare the measures within each individual).

**Key feature: **One of the main aim of this design was to see if the new performance test was sensitive enough to capture individual variation. These were physical tests. So if you compare an athlete to a casual runner, both tests might easily detect differences, but the main purpose of the test was to measure recovery time (among other things). So smaller differences (within individuals rather than between them) should be captured.

**Question from the client:** The client asked me in this context what test they should perform. I suggested a pairwise correlation test with repeated observations (c.f. Bakdash, and Marusich (2017)). So the person handed me the data and before doing the test, I took a look using my usual method (fully described in this TDS article).

My usual method follows 5 steps (variable selection, sample selection, univariate analysis, bivariate analysis, and conclusion). I will skip section 4 — bivariate analysis — to get straight to the point. However, if you are also interested in the first steps, you can publicly access my Deepnote file here including all the steps (first steps at the end of the notebook).

#### 4. Bivariate analysis:

We are interested in the relationship between two variables. The initial idea would be to use a repeated pair correlation test. The correlation coefficient is a measure of a linear relationship. Therefore, I wanted to first take a look at the data to see if the linear hypothesis is credible.

https://medium.com/media/dd473ad0ccdcce4fb3e3271911f1a6c5/hrefhttps://medium.com/media/91513dace3f8c92a6ff1c0add508d852/href

**Observation:** There is a positive correlation. The overall relationship seems to be well approximated by a linear trend. The relationship between individuals is clear (those who score higher with method A tend to score higher with method B), while it is not clear within individuals (each group of points of the same colour does not have a clear trend).

**Disclaimer:** This is a first quick analysis that I would do in a few minutes. Of course, I could explore different functional forms, make appropriate tests for extreme values, etc.

Since the linear assumption is plausible and the two variables are continuous, I will indeed follow my initial suggestion: Correlation of matched measures.

https://medium.com/media/50c5928225070044dbade16e743b07c6/href

**Observation:** As expected, the correlation is very far from being statistically significant (p-value=0.52) and is even negative (correlation=-0.048).

**Early signs of a problem:** Looking at this scatterplot, I thought that the variability within individuals was indeed low compared to the variability between subjects. This made me think of a potential problem.

**Finding out the issue: **So on the next call with the client, I asked if we really expected a difference in performance between individuals (even with the same method). Basically, if you measure your ability to do a task that you are used to doing every day for a week, would you expect a significant spontaneous change? We concluded that it was not the case. Change would be observed in a person if something significant happened (for example, as a result of training). Indeed, in the literature about method A, this performance test captures (within) differences when there is actually a treatment (e.g. training, competition, etc.). Therefore, the design has missed something very important: a treatment. This would allow for a pre- and post-treatment observation for the same individual and thus verify whether the two performance tests were able to capture the differences.

#### 5. Conclusion:

**Applying the three rules presented above:**

1. **Ideal vs. reality:** In the ideal set-up, individuals would be exposed to a treatment that allows for pre- and post-measurements. With this in mind, let’s look at the following 2 rules.

2. **Be realistic:** In the current set-up exploiting within-individuals differences does not work. If you remove the between-individual variation, you are left with noise. So… That’s a bummer. However, it is still possible to exploit between-individual variation. The client could already see if both tests capture inter-subject variation. The question the client asked has not been explored in the literature and therefore, although it has some flaws, it is worth publishing these results. If this were the twentieth paper on the subject, it would be much harder to ‘sell’. In addition, the client had a set of important control variables (gender, age, height, weight). I therefore advised a linear regression including the control variables (and errors clustered at the individual level due to repeated values). Linear regression is a very intuitive and simple approach to capture the association between the two methods conditional on the control variables.

**3. Be conservative: **Finally, my advice was to clearly explain the limitations and how to improve the research.

Three rules to deal with imperfect experimental designs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

[ad_2]

Source link