Essential Knowledge for Data Scientists
Today we have access to a continuous flow of data from all over the place. Classification models are one of the most popular machine learning tools for finding patterns among data and making sense of it so that we can reveal relevant insights for decision-making. They are a form of supervised learning in which we train a model to group data points based on predetermined characteristics. In return, the model outputs the likelihood or probability for a data point to belong to a specific category.
Use cases are endless and widely spread across industries — Speech recognition, spam detection, anomaly/fraud detection, customer churn prediction, client segmentation, and credit-worthiness assessment.
Therefore, as a Data Scientist, it is essential to master the art of classification models.
In this article, we will be focusing on one of the last steps of creating a model in Data Science: assessing the model performance or, in other words, evaluating how good or bad the classification is.
What’s better than a good story to explain the metrics and how to use them?
Let’s say you are the Head of the Antenatal Department at the hospital of your city. Your ambition is to offer the most positive experience possible to future parents. In that regard, you have hired the best Doctor and build-up a dream Team of nurses and midwives to support him.
The Doctor is incredibly busy and has no time to check on all patients to confirm their pregnancy. So he uses analysis and different blood markers for validation. The role of the nurses is to visit the patients to ensure the predictive. Here we have 4 cases possible:
- The Doctor says the patient is pregnant, and the nurses confirm it
→ True Positive (TP)
- The Doctor says the patient is pregnant, but the nurses invalidated it
→ False Positive (FP)
- The Doctor says the patient is not pregnant, and the nurses confirm it
→ True Negative (TN)
- The Doctor says the patient is not pregnant, but the nurses invalidated it
→ False Negative (FN)
As Head of the Department, you are focused on offering the best quality of services, so you want to evaluate how good the Doctor is at identifying early pregnancies. For that purpose, you can use 5 key metrics:
Accuracy is perhaps the most common metric because it is relatively intuitive to understand. It is the ratio of correct predictions divided by the total number of predictions.
(TP+TN) / (TP + FP + TN + FN)
In other words, accuracy will tell us how good the Doctor is at categorizing patients.
An accuracy of 50% means that the model is as good as flipping a coin.
In general, and depending on the field of application, we aim for accuracy above 90%, 95%, or 99%.
Remember: Some say we have a good model if we have high accuracy. That is true ONLY IF your dataset is balanced — meaning that the classes are relatively homogeneous in size.
Suppose you have many more patients that are NOT pregnant among the group of patients (i.e., pregnant patients are in the minority). In that case, we say that the sample is unbalanced, and the accuracy is NOT the best metric for evaluating performances.
Precision is the number of positive elements appropriately predicted divided by the total number of positive elements predicted. So it’s a measure of exactness and quality — it tells us how good the Doctor is at predicting pregnancy.
TP / (TP + FP)
Depending on the model application, having high precision can be critical. Always evaluate the risk of being wrong to decide whether the precision value is good enough.
If the Doctor announces a pregnancy (and he is wrong), this could impact the patients because they might make life-changing decisions regarding the big news (e.g., buying a new house or changing cars).
Recall — a.k.a sensitivity or true positive rate or hit rate, is the number of positive elements appropriately predicted compared to the actual number of positives. It tells us how good the Doctor is at detecting the pregnancy.
TP / (TP + FN)
Similarly to precision, depending on the model application, having high recall can be critical. Sometimes, we can not afford to miss a prediction (fraud, cancer detection).
Suppose the Doctor misses a case and does not predict a pregnancy. The patient might keep some unhealthy habits like smoking or drinking while pregnant.
Specificity — a.k.a selectivity or true negative rate, summarizes how often a positive class is predicted when the outcome is negative. It can be understood as a false alarm indicator.
TN / (TN + FP)
Ideally, a model should have high specificity and recall, but there is a tradeoff. Every model needs to pick a threshold.
For the reasons mentioned above, we don’t want to miss a pregnancy case. At the same time, we also don’t want to alert a patient if we do not have reliable blood markers to confirm the pregnancy. As Head of the Department, you need to decide what is the shifting point where the Doctor’s prognostic is not good enough, and we need a medical examination from the nurses.
5. F-measure | F1 score
The F1 score is the weighted harmonic mean of precision and recall. It reflects the effectiveness of a model — how performant the Doctor is when missing a case or announcing a pregnancy wrongly is equally risky.
2 x (Precision x Recall) / (Precision + Recall)
When the sample is unbalanced, and accuracy becomes inappropriate, we can use the F1 score to evaluate the performances of a model.
If there are many more patients NOT pregnant than pregnant patients, we consider the sample unbalanced and will use the F1 score to evaluate the Doctor’s performances.
Now that we have set up the critical metrics for evaluating classification models, we can look closer at how to visualize them in a compelling way.
The confusion matrix or error matrix is a simple table compiling the prediction results from a classification model: True Positive, True Negative, False Positive & False Negative. It helps visualize the types of errors made by the classifier by breaking down the number of correct and incorrect predictions for each class.
The confusion matrix highlights where the model is confused when it makes predictions.
Therefore, it is a useful visualization compared to using accuracy alone because it shows off where the model is weak and offers the possibility of improving it.
ROC curve & precision-recall curve
The ROC curve is a plot of the false positive rate (a.k.a inverted specificity) versus the true positive rate (a.k.a sensitivity). It should be used if the number of patients is roughly equal for each class (i.e., balanced dataset), while the precision-recall curve should be used for imbalanced cases (source).
A good model is represented by a curve that increases quickly from 0 to 1. It says the model is not trading too much precision to get a high recall.
A lousy model — a.k.a no-skill classifier- cannot discriminate between the classes and, therefore, predicts a random or constant class in all cases. These models are represented by a diagonal line from the bottom left of the plot to the top right.
So you probably get now that the shape of the curve gives us precious information to debug a model:
- If the curve is closer to the random line at the bottom left, it indicates lower false positives and higher true negatives. And on the other way, if there are larger values on the y-axis of the plot, it means higher true positives and lowers false negatives.
- If the curve is not smooth, it implies the model is not stable.
These curves are also great tools for comparing models and choosing the best ones.
The area under the curve (AUC or AUROC) is often used to summarize the model skill. It can take values from 0.5 (worst model) to 1 (perfect model).
The higher value of the AUROC the best is the model.