Image by Author
When working with data and different variables, assigning one variable or value to be greater than the other is easy. We may assume that a specific variable or data point had more impact on the output, but how sure are we that the other variables have an equal impact?
In statistics, the base rate can be seen as probabilities of classes that are unconditional on “featural evidence”. You can see the base rate as your prior probability assumption.
Base rates are important tools in research. For example, if we are a pharmaceutical company and are in the process of developing and dispatching a new vaccination, we want to look into the success of the treatment. If we have 4000 people who are willing to take this vaccination, and our base rate is 1/25.
This means that only 160 people will successfully be cured by the treatment out of 4000 people. In the pharmaceutical world, this is a very low success rate. This is how base rates can be used to improve research, and accuracy and ensure that the product will perform well.
If we split the words up, it will give us a better understanding. Fallacy means a mistaken belief or faulty reasoning. If we now combine that with our definition of the base rate above.
The base rate fallacy, also known as base rate bias and base rate neglect, is the likelihood of judging a specific situation, whilst not taking into consideration all relevant data.
The base rate fallacy has information about the base rate as well as other relevant information. This can be due to various reasons such as not thoroughly examining and analyzing the data properly, or ignorance to favour a specific part of the data.
The base rate fallacy describes the tendency for someone to disregard the existing base rate information, to push and be in favour of the new information. This goes against the fundamental rules of evidence-based reasoning.
You will typically hear about this happening in the financial industry. For example, investors will base their buying or sharing tactics on irrational information, which leads to fluctuation in the market – despite having the base rate to their knowledge.
So now we have a better understanding of the base rate and base rate fallacy. What is its relevance and impact in Data Science?
We’ve spoken about ‘probabilities of classes’ and ‘taking into consideration all relevant data’. If you are a data scientist, or machine learning engineer, or getting your foot in the door – you will know how important probabilities and relevant data are to producing accurate outputs, the learning process of your machine learning model and producing high-performance models.
To analyse and make predictions about data or for your machine learning model to produce accurate outputs – you need to take into consideration every bit of data. As you’re scanning through your data the first time you see it, you might consider some parts relevant and other parts irrelevant. However, this is your judgement and is not yet factual till proper analysis has taken place.
As mentioned above, the initial base rate helps you ensure accuracy and produce high-performance models. So how can we do this in Data Science?
A confusion Matrix is a performance measurement that provides a summary of prediction results on a classification problem. The confusion matrices are all based on the outcome: True, False, Positive, and Negative.
The confusion matrix represents our model’s predictions during the testing phase. The false-negative and false-positive in the confusion matrix are examples of base rate fallacy.
- True Positive (TP) – your model predicted positive and it’s positive
- True Negative (TN) – your model predicted negative and it’s negative
- False Positive (FP) – your model predicted positive and it’s negative
- False Negative (FN) – your model predicted negative and it’s positive
A confusion matrix can calculate 5 different metrics to help us measure the validity of our model:
- Misclassification = FP + FN / TP + TN + FP + FN
- Precision = TP / TP + FP
- Accuracy = TP + TN / TP + TN + FP + FN
- Specificity = TN / TN + FP
- Sensitivity aka Recall = TP / TP + FN
To better understand a confusion matrix, it’s better to look at a visualisation:
Image by Author
As you’re going through this article, you can probably think of a variety of causes of base rate fallacy, such as not taking all the relevant data into consideration, human error, or lack of precision.
Although these are all true and add to the cause of the base rate fallacy. They all relate to the biggest problem of ignoring the base rate information in the first place. Base rate information is often ignored as it is considered irrelevant, however, the base rate information can save people a lot of time and money. Using the base rate information available allows you to be more precise in making probabilities about whether a given event will occur.
Using the base rate information will help you avoid base rate fallacy.
Being aware of fallacies such as opinions, automatic processes, etc – will allow you to combat the issue of base rate fallacy and reduce potential errors. When you are measuring the probability of a certain event occurring, Bayesian methods can help with this to reduce the base rate fallacy.
The base rate is important in data science as it equips you with a base understanding of how to assess your study or project, and fine-tune your model – providing an overall increase in accuracy and performance.
If you would like to watch a video about base rate fallacy in the medical field, check out this video: Medical Test Paradox
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.