For data scientists and analysts in business
There is only a one in a million chance that the accused would match the DNA found at the crime scene. So the accused is guilty beyond reasonable doubt. Sound ok? It isn’t!
Based on this evidence alone, the accused may have a good chance of being innocent and this is the logical trap described by the unintuitive but wonderfully illuminating Prosecutor’s Fallacy.
When we understand this, we start to see it everywhere in advertising, the media and people’s day-to-day decision making. It also underpins a common difficulty in correctly interpreting p-values, and getting a better intuition for this brings a lot more clarity to how we can approach hypothesis testing in business settings.
One more example to get the juices going. A medical test you have taken for a serious disease has come back positive. You feel a rush of panic as you recall that the test has an accuracy rate of 95%. And yet your real odds may be a lot better than that — perhaps a 10% chance that you actually have it. How can that be?
The first objective of this article is to give a general introduction to the Prosecutor’s Fallacy and explain it in plain language, targeting the reasons that make it difficult to intuit. If you haven’t come across it or are hazy on the concept, it’s a great tool for life whatever path you walk, and I hope you will find learning about it as enjoyable as I did.
The second part of the article will go into how the Prosecutor’s Fallacy plays into the interpretation of p-values in the use of hypothesis testing and is targeted at the practising data professional. If you struggle to explain why p-values are not the probability of your null hypothesis being false but rather the probability of you getting results as extreme as those observed given the null hypothesis, it’s worth understanding — read on.
So the chance that the accused would match the DNA from the crime scene is one in a million. This is a highly accurate DNA test, and there has been a positive match. And yet, however compelling this evidence may seem, this alone does not incriminate the accused. Not because there is still a small amount of uncertainty in this probability (there is never absolute certainty), but because the probability of the evidence given innocence is not the same as the probability of innocence given the evidence.
To understand what this means, let’s first suppose we are looking for the offender in a city of 10 million people (plus the one offender). And there is a 100% certainty that the true offender’s DNA will match the DNA at the crime scene.
Putting these numbers in a table, we get this:
Now, we have been told that there is only a one in a million chance that an innocent person would get the DNA match. So if we were to test all 10 million people in the city, we would expect 10 of those to match.
Adding this into the table:
Now look at all the innocent people who would match the DNA in this scenario. In fact, the number of innocent matches outweigh the correct guilty match ten to one! That is 10 false positives to the one true positive.
Expressed as a percentage, the probability of guilt given the evidence across the whole city then is 1 / (1 + 10) = 9.1%. On the basis of this DNA evidence alone, it now looks like the accused has a 1 – 9.1% = 90.9% chance of innocence, which is in stark contrast to the 0.0001% innocence that was initially claimed by the prosecutor.
The implication of the original claim was that, since only one in a million innocent people would match the DNA and the accused matched the DNA, the probability of their guilt is 1 – 0.000001, a near certainty. But this logic is flawed and particularly problematic when the population the probability applies to is large; the conclusion we’d draw from the data could dramatically flip like in this example. The probability of innocence given the DNA match must not be confused by the probability of a DNA match given innocence.
So is that to say we can’t do anything with DNA evidence in criminal trials? We can, but key here is to understand that there needs to be other reasons to believe in the guilt of the accused. If we relied just on the DNA match, we expose ourselves to spurious matches, but if we already had other suspicions about someone, then this additional evidence would be extremely compelling. We could also use DNA to narrow down suspects to help focus the investigation (as per the hunt for the Golden State killer), but it is not conclusive, and this has important parallels to the use of p-values in hypothesis testing which we will explore in the second part.
An extension to the Prosecutor’s Fallacy, worth quickly mentioning here, is the Defender’s Fallacy, in which it is incorrectly argued that a piece of evidence incriminating the defendant should be discarded because it could be matched against many others. This was used in the trial of OJ Simpson where the defending lawyers argued that the blood samples from the scene matching OJ’s (1 in 400 match rate) could also be matched against thousands of others across the LA population and so was not useful evidence. This is a misrepresentation of the Prosecutor’s Fallacy and the evidence is useful because it is considered in combination with other evidence, despite not being conclusive on its own.
It was the news you had been dreading. You are healthy but have been screened for a rare disease and the ‘95% accurate’ test you took has come back positive. You feel numb. But should you?
Here we first run into the question of what ‘accuracy’ means and in medical testing, the terms sensitivity and specificity are used to definite two different aspects of accuracy:
- Sensitivity: the likelihood that people with the disease will be correctly identified as having the disease (the true positive rate)
- Specificity: the likelihood that people without the disease will be correctly identified as not having the disease (the true negative rate)
For this example we will assume the 95% accuracy applies to both the sensitivity and specificity. And we will assume the disease affects 1 in 1,000 of the population.
In a population of 10 million therefore, we would expect to see 10,000 people with the disease. Putting these numbers in a table gives us this picture:
Another way of representing this which might be a little more intuitive is using a decision tree:
With these numbers, the odds that you have the disease given you have just had a positive test (and have no other reasons to believe you have the disease) works out to be 9,500 / (9,500 + 499,500) = 1.87%.
Even if the disease was x10 more prevalent in the population, your likelihood of having it after getting a positive test goes up to just 16.1%. On top of this, increase the specificity of the test to 99%, and still your odds are 49%. (For a bit of extra context, in Australia, COVID at-home rapid antigen test kits have to have been certified for a sensitivity of >80% and a specificity of >98%).
The initial 95% ‘accuracy’ measure did not give you enough information on the actual probability of having the disease because you were missing information on the prevalence of the disease in the population, and therefore the prior likelihood of you having the disease.
The probability of you getting a positive test result given you have the disease (95%) is different to the probability that you have the disease given the test result, and to make any judgement on your likelihood of having the disease, you need some sense of how likely you are to have it in the first place, regardless of the test result (great commentary from the ABC using real-world COVID scenarios here).
Prior probabilities and the idea of updating probabilities as new information (e.g. COVID test results) comes in is formalised in the Bayesian framework, and the famous Bayes formula provides a shortcut for working out what we have just been illustrating through contingency tables. This is a topic for another day.
From this point, I assume you have some background in hypothesis testing and have working knowledge of concepts such as the test statistic, the null hypothesis and p-values. This might be a good place to go next if not and you’re interested.
The focus of this section will be on getting a better intuition for p-values and specifically the idea that the p-value is not the probability that the null hypothesis is true; p-value of 0.05 is not a 5% probability that the null hypothesis is true, nor a 95% probability that the alternative hypothesis is true. Rather, it is the probability of getting the result observed (or more extreme), given the null hypothesis, and this difference is difficult to intuit, let alone help others make sense of in practice when organising tests in the real world.
Does it matter?
First of all, does this difference in wording even really matter? For all intents and purposes, isn’t it the same thing? It isn’t, and the reason comes from the same distinctions that make the Prosecutor’s Fallacy a trap with serious practical consequences.
The danger with not understanding the difference is that we may draw unfounded conclusions from our test results, much like the prosecutor made a flawed claim against the defendant’s innocence.
Concretely, the problem comes when we think it’s ok to:
- Repeat tests across multiple segments until we land a p-value under 0.05
- Split our initial test results into subsets and calling out any that come up with a p-value under 0.05 as meaningful
Imagine you are a supermarket testing an ad for Coke on your website and have split your online customers so that half see the ad and half don’t in an A/B test. You carefully calculate the sample sizes you need and let the data accumulate, but at the conclusion of the test you are disappointed to see that there is no significant difference in Coke sales between the two groups.
But, you think, what if it resonated with certain groups of customers? So you go exploring different breakdowns and start to notice results with p < 0.05 here and there: customers who live in Newcastle, retirees who have a leaning towards value products, people who buy a lot of cheese…
The marketers get excited. Maybe we have something here. Yes, it makes sense that people from Newcastle would respond well to Coke ads because we are currently running a large billboard ad in the city centre there and it is already front of mind for the customers. Let’s double down and keep targeting this group online.
This is faulty reasoning akin to the prosecutor randomly picking people from the population until there is a positive DNA match and claiming guilt despite the 9.1% real chance of it. The fact that there seems to be evidence in support of this (the billboard) is confirmation bias; if we go looking for justifications, we often find them, but just because they are plausible in hindsight don’t make them relevant. Be careful not to post-rationalise.
To unpack this, we will use the contingency tables again. Taking again the numbers from our first example, the shaded section is essentially what p-values represent.
But really what we’re interested in is:
So why can’t we just measure this? Simply because we don’t have a way to measure the number for that bottom ‘guilty’ box when we are conducting hypothesis tests. The best we can do is essentially fill out the top row by collecting sample data and make an inference.
Transposing this to our coke ad A/B test example we get something that might look like the following if we did indeed observe a ‘significant’ difference from the ad with a p-value of 0.05:
This gets confusing with the double negatives but the top row is saying that we would only have got the sales result we did (or more extreme) only 5 out of 100 times if the coke ad really made no difference to sales. It is a very surprising result, given the null hypothesis.
But crucially, what this doesn’t say is anything about whether the coke ad actually makes a difference to sales (the true effect under the alternative hypothesis). In order to get a position on that, we would need a number in the bottom left box and this is not something we can collect data on through A/B testing or any other way.
To help us conceptualise why the 0.05 p-value bears no direct relationship with the effectiveness of the coke ad, let’s try putting in some hypothetical numbers.
What could go in here? Let’s start with ‘1’. It’s an arbitrary number that only has meaning in relation to the ‘5’ in the box above but we’re following the same logic that we used in our earlier examples to illustrate the Prosecutor’s Fallacy.
So now, despite the p-value of 0.05, we get a probability of the coke ad making a difference of just 1 / (1+5) = 17%.
Now let’s try putting in a bigger number; ‘45’, let’s say. This gives us a probability that the ad makes a difference of 45 / (45+5) = 90%. Same p-value, two starkly opposed conclusions we could draw from the data.
But what’s the point of throwing around these hypothetical numbers that aren’t based on real, measurable data? This is the crux of this discussion and the key to the better practice of hypothesis testing in the real world. The numbers you put in here represent your prior confidence in the hypothesis.
The higher the number, the higher your confidence. And what we have demonstrated by plugging in different numbers here is that a p-value of 0.05 has no relation to the probability of our hypothesis being true. If the test is a moonshot to begin with, you would still have cause to be sceptical even with favourable p-values. Prior conviction matters in hypothesis testing.
As touched on earlier, the Bayesian model of thinking provides an approach for handling this prior conviction mathematically, but the frequentist model of p-values and confidence intervals are the de facto standard for A/B testing in the business world today this trap in reasoning is too often overlooked. P-values don’t replace the need for human reasoning and we have to start with some prior conviction in the hypothesis for the test results and p-values to be useful. If ever this becomes unintuitive, think back to the prosecutor’s example.
- A p-value tells you little about your hypothesis being true. Don’t draw conclusions purely on if a test passed a certain p-value threshold.
- A p-value is the probability of observing the results or more extreme, given the null hypothesis. This is different to the probability of the null hypothesis given the evidence. The Prosecutor’s Fallacy illustrates why this distinction matters.
- A p-value from an A/B test helps to validate our hypothesis, if we had good reason to believe in it in the first place.
- Don’t go looking for attractive p-values and reasoning backwards from them. The best thing to do if you spot an unexpectedly low p-value for a segment is to validate it with a new test specific to that segment.
When practising hypothesis testing with p-values in the business world, be clear-minded about their correct use and be prepared to explain it, but don’t feel you need to fiercely oppose every misuse of the p-value to be a good adviser.
We are working in environments where decisions need to be made fast and data is imperfect, so don’t get hung up if the decision-makers need to make decisions on their hunch from time to time based on unrigorous data; after all, confidence is part of the equation.
So even in situations where you know the right thing to do is to rerun a test, if the cost of doing so outweighs the cost of going ahead with the decision, then it might not make good practical sense. Be clear on your position but no need to fight for rigour on every decision.
The Art of Statistics describes p-values as ‘weak evidence’ (i.e. a factor in decision making but far from the only factor) and provides some great commentary on how we might layer other evidence on top to build a strong case.
In courts of law, for example, different pieces of evidence can be itemised and each given different likelihood ratios (probability of evidence if accused is guilty / probability of evidence if accused is innocent), with a possibility of multiplying them all together to quantify the strength of a case, though courts in the UK stop short from doing so as that is the role of the jury.
The chapters on Answering Questions and Claiming Discoveries and Learning from Experience the Bayesian Way are full of great real-world examples such as this and the different schools of thought for weighing up evidence.