Exploring simple flaws that can hamper model performance
Modeling is arguably the most fun part of a machine learning task.
It’s where you get to use the data that you’ve painstakingly processed and transformed to build a model that can generate predictions.
Although model building requires less time to carry out compared to data preprocessing and feature engineering, it is easy to make seemingly harmless decisions that will lead to suboptimal model performance.
Here, we explore a few common mistakes that are detrimental to a user’s efforts to train and tune a reliable model.
A typical example
Compared to data preprocessing and feature engineering, the modeling phase is relatively easy to execute.
As an example, let’s work with a toy dataset in the Scikit Learn package.
Suppose that we are looking to create a gradient boosting classifier that can generate predictions with this data. The code used to train and tune the model could look something like this:
With just this snippet alone, we can:
- Test models with different hyperparameter values
- Implement a cross-validation splitting strategy to avoid overfitting
- Generate predictions with the best-performing model
So much work is done with so little code. Such is the power of the Scikit Learn package.
At face value, this approach seems faultless but consists of a few flaws that can potentially hamper model performance. Let’s cover the few mistakes committed here that tend to be common in machine learning tasks.
Mistake #1: Not using a baseline model
In the modeling phase, many are eager to jump straight into using algorithms that invite more complexity. After all, complexity is often associated with effectiveness.
Unfortunately, more is not necessarily better. It is common for complex models to yield the same performance (if not poorer) as simpler algorithms.
It’s possible that the underlying assumptions of the algorithm of interest don’t suit the data. Perhaps the dataset used to train the model doesn’t have much predictive power to begin with.
To account for such situations, it is important to be able to contextualize the results of a tuned model. Users can properly evaluate the performance of their complex models by establishing a baseline model.
A baseline model is a simple model that serves as an indicator of whether a complex model-building approach is actually reaping benefits.
Overall, it may be tempting to immediately build sophisticated models right after preparing the data, but the first model you build should always be the baseline model.
For a quick intro to baseline models, check out the following article:
Mistake #2: Using a small range of hyperparameters
A common flaw in hyperparameter tuning is to only explore a small range of hyperparameter values.
Hyperparameter tuning should enable users to identify the hyperparameter values that yield the best results. However, this process can only serve its purpose when the range of values being tested is large enough.
How can a tuning procedure identify the optimal hyperparameter values if the values aren’t even in the search space?
In the previous example, the search space for the
learning_rate hyperparameter only comprises 3 values between 0.1 and 10. If the optimal value is outside of this range, it won’t be detected during the tuning process.
For many machine learning algorithms, model performance is heavily dependent on certain hyperparameters. Thus, tuning models with a small range of values is not advisable.
Users may opt to use a smaller range of hyperparameter values since a greater number of values will require more computation and will incur greater run time.
For such cases, instead of using a smaller range of values, it would be preferable to switch to a hyperparameter tuning approach that needs less time and computation to execute.
The grid search is a valid hyperparameter tuning method, but there are other suitable alternatives.
If you’re interested in exploring other alternatives for hyperparameter tuning, check out the following article:
Mistake #3: Using the wrong evaluation metric
It’s always nice to see a model that scores highly. Unfortunately, a model that is trained based on the wrong evaluation metric is useless.
It is not too uncommon for users to perform a grid search while using the default value for the
scoring parameter. The default scoring metric in the grid search is accuracy, which certainly isn’t ideal for many cases.
For example, the accuracy metric is a poor evaluation metric for imbalanced datasets. Precision, recall, or the f1-score can be more appropriate.
That being said, even a robust metric like the f-1 score may not be ideal. The f-1 score weighs precision and recall equally. However, in many applications, a false negative can be much more harmful than a false positive, and vice versa.
For such instances, users would benefit from making their own custom evaluation metric that tailors to the business case. This way, users will be able to tune their models to achieve the desired type of performance.
Rectifying the previous mistakes
Here’s a quick example of what addressing these mistakes with code could look like.
First, we can create a baseline model that will be used to gauge the trained model’s performance. A simple K-nearest neighbors model with default parameters will suffice.
Now we can move on to training and tuning the gradient boosting classifier with the GridSearchCV object. This time, we will consider a greater range of values for the
Moreover, instead of using accuracy as the evaluation metric, let’s suppose that we instead want to consider precision and recall while placing a greater weightage on recall (i.e., penalizing false negatives more).
A solution to this would be to create a custom metric with the
make_scorer wrapper in Scikit Learn, which enables us to use the f-beta score (beta=2) as the evaluation metric in the grid search.
With the improved hyperparameter search space and evaluation metric, we can now carry out the grid search and tune the gradient boosting classifier.
The modeling phase of a machine learning task is much less time-consuming than the preprocessing and feature engineering phases, but it is still easy to make mistakes in this phase that may hamper overall model performance.
Thankfully, Python’s powerful machine learning frameworks do most of the heavy lifting. As long as you contextualize the results of your tuned models with a baseline, consider a wide range of hyperparameter values, and use evaluation metrics that fit the application, your model is much more likely to yield satisfactory results.
I wish you the best of luck in your data science endeavors!