Knowing these common logic traps will help you avoid making mistakes in your analysis
In a recent article, I defined data literacy by starting with the general definition of literacy and adapting it to the data world:
This article follows a similar thread — it was inspired by the logical fallacies I learned in high school. If there are logical errors you can make in your argumentative reasoning, then there are also logical errors you can make in your data analysis and statistical reasoning.
This blog post by Geckoboard was a helpful starting point for my research:
From there, I dove into a few fallacies I have had the most experience with. The six I picked for this article are common mistakes that are easy enough to make. So keep reading to learn more about the logical traps you can fall into when working with data.
To add some color to the explanations of each fallacy, I included my own experiences and pulled in some interesting examples I found from various sources online.
Before you can perform analysis, you need data! And if you have to collect the data yourself, here are some fallacies to avoid.
Observer Effect (a.k.a. Hawthorne Effect)
Do you perform differently when you are being watched? I know I do. And if researchers aren’t careful, this human tendency can affect their data. The observer effect happens when the presence of an observer, or the knowledge of being observed, affects the data collected.
I used to think about this often as an Industrial Engineering intern because I was asked to collect time study data on the manufacturing lines. I was hyper-aware that if workers knew I was timing them, they might perform differently than usual (even if I made it clear that my measurements were not intended to evaluate their performance in any way).
Here’s another example in a manufacturing setting:
“The Western Electric Company was the sole supplier of telephone equipment to AT&T at the time and the Hawthorne plant was a state-of-the art plant that employed about 35,000 people. The experiments were intended to study the effects lighting levels had on output. The hypothesis evolved and groups of workers were studied to see if different lighting levels, levels of cleanliness or different placement of workstations affected output.
The major finding was that no matter what change the workers were exposed to, output improved. But, production went back to normal at the end of the study. This suggested watched employees worked harder.” 
Sampling bias is when the sample population you collect data from is not representative of the population you want to make conclusions about. It’s not always easy to get a representative sample — it can cost extra time and money. But if the data is being used to make decisions that will affect the lives of people outside your sample population, then you have to avoid this error.
Knowing about this logical fallacy makes me look back on my early college days and cringe. Sometimes we had to collect data and draw conclusions, and I would just survey my friends. If I was trying to make a conclusion about the general American population or even the general college population, my sample was not representative of the whole. (Good thing it was just for college credit and not a new product or policy or anything else.)
Here’s an example of how a surveying technique could fall short:
“For example, a “man on the street” interview which selects people who walk by a certain location is going to have an overrepresentation of healthy individuals who are more likely to be out of the home than individuals with a chronic illness. This may be an extreme form of biased sampling, because certain members of the population are totally excluded from the sample (that is, they have zero probability of being selected).” 
Some people are lucky — they survive tough situations where the odds are stacked against them. Natural disasters, economic downturns, risky business ventures, etc. And after they get through it, they may look back and think that success is more common than it is since they were successful. Or, people looking from the outside may only hear the success stories and not the countless failures or tragedies.
This is survivorship bias: when the group that succeeds is mistaken for the whole group. Since the surviving/successful group is more visible, people begin to think that it is really the whole group (the population).
Here’s an example of how this bias can affect the data we encounter in school:
“Students in business school can recall how unicorn start-ups were commonly applauded within the classroom, serving as an example of what students should strive for — an archetypal symbol of success. Even though Forbes reported that 90% of start-ups fail, entire degrees are dedicated to entrepreneurship, with dozens of students claiming that they will one day found a start-up and become successful.” 
Hopefully, you completed your data gathering without any errors. You still need to be careful as you interpret the results from your analysis!
You may have heard the phrase cherry-picking data. It refers to selecting only data points or results that support your argument and conveniently leaving out the data that provides evidence for the counterargument.
When I think of this term, I think of politicians using data. You can add a lot of positive or negative spin to the results of a study just by only selecting part of the findings to include in a speech. It’s in their best interest, and unfortunately, it is relatively common (on all sides of the political spectrum).
You will also see cherry-picking in the media:
“For example, consider a situation where a new study, which is based on the input of thousands of scientists in a certain field, finds that 99% of them agree with the consensus position on a certain phenomenon, and only 1% of them disagree with it. When reporting on this study, a reporter who engages in cherry picking might say the following:
‘A recent study found that there are plenty of scientists who disagree with the consensus position on this phenomenon.’
This statement represents an example of cherry picking, because it only mentions the fact that the study found that some scientists disagree with the consensus position on the phenomenon in question, while ignoring the fact that the study in question also found that the vast majority of scientists support this position.” 
“Wow, he’s rolled three fours in a row now — there’s no way the next roll will be a four!”
Have you ever heard someone say something like this in a game? It makes intuitive sense at first, but when you look at it logically, you have to recognize that a dice roll is an independent event (statistically speaking). Each roll has no effect on the probability of the next roll.
This example takes us to the gambler’s fallacy: the “belief that the probability of a random event occurring in the future is influenced by previous instances of that type of event.” 
Aside from actual gambling, this fallacy can be seen in other applications where historical data is heavily relied on, like in financial analysis:
“Gambler’s fallacy has been shown to affect financial analysis. Investors tend to hold onto stocks that have depreciated and sell stocks that have appreciated. For instance, they may see the continual rise of a stock’s value as an indication that it will soon crash, therefore deciding to sell. Gambler’s fallacy may be at work here, as investors are making decisions based on the probability of a fairly random event (the stock’s price) based on the history of similar past events (the trend in its previous price points). The two are not necessarily related. Its past price trajectory in itself does not determine its future trajectory.”
There are a lot of factors at play in determining a stock’s price, but to simplify the data down to the historical price and make decisions on buying and selling based on that seems in the realm of gambler’s fallacy.
Just because two variables are correlated does not mean one caused the other. Correlation does not equal causation! The false causality data fallacy occurs when there is an assumption that one variable’s trend caused the other variable’s trend — without looking at other possible factors and causes.
For some weird examples, check out this website with charts on spurious correlations. Hopefully, none of you would say that one of these variables caused the other — Nick Cage doesn’t deserve that:
If you have ever fallen into one of these data fallacies, you are not alone! This is why it can be constructive to have people look over your data analysis, whether it’s for work or school, to point out the blind spots you might have in your methodology or reasoning.
Let’s all work together to improve our data literacy skills and stay vigilant against data fallacies!