Explore how a text-based model’s performance could be improved by incorporating simple data and/or NLP techniques coupled with hyperparameter tuning for ANNs using KerasTuner
In my previous article on Multiclass Text Classification Using Keras to Predict Emotions, I compared the outcome of the usage of skipgrams to give a deep text classifier a headstart with learnt word embeddings for predicting emotions with that of another deep text classifier that learns the embeddings from scratch. The ANN classifier that learnt the embeddings from scratch was slightly better than the other one which I will consider as my baseline in this article. In this revisit to the same dataset, which is publicly available on Kaggle here and on Hugging Face datasets here, I will try and improve the model that yielded a better weighted-avg recall (of 81% on the test data).
In the image below, the performance of the models on the test set are reported. Here, Model 1 refers to the model trained with word2vec embeddings, and Model 2 refers to the model trained without word2vec embeddings. Recall that RECALL 😏was important in this use-case and so it was concluded that the second model performed marginally better, and therefore, Model 2 is our baseline for this article.
- Appending more data which eventually gives an ML model more examples to learn and generalize from.
- Feature engineering for extracting helpful information from the given data that allow the model to easily and efficiently find patterns for predictions… and feature selection for treating the GIGO problem. Basically, this allows models to work with only a few useful features, remove noise, and save computation time and resources.
- Try Multiple Algorithms to find the best-suited one for the predictions.
- Use Cross-Validation for a robust and well-generalized model. Using cross-validation, you can train and test a model’s performance on multiple chunks of the dataset, get the average performances and figure out if a model is at its best or not.
- Tune the Hyperparameters to identify the best combination suited for the dataset since they have a pivotal influence on the outcome of the model training process.
1. Revised the Data Strategy — In the previous one, I had used 2/3rd data to train and reserved 1/3rd data to validate. In this one, I used the entire train.txt for training and used the validation set for validating and improving the model performance during model training.
2. Text Cleaning and Normalization — Previously, I had straightaway used the text data without cleaning and normalizing it. In this one, I removed stopwords and obtained the stems of the words using Porter Stemmer. It brought down the vocab size to 10375.
3. As a primer, I redesigned the model using a lower drop-out rate and increasing the number of units in the two LSTM layers. Additionally, the vocab size was changed from 15000 to 10000 as a result of the cleaning in the previous step. Besides, I tried out SimpleRNN and Gated Recurrent Units as well apart from LSTM.
… And told the model what metrics to focus on.
Previously, the model was compiled using the metric ‘accuracy’. Since we are working with an unbalanced dataset, accuracy is not the right metric. Precision and/or recall are better. I chose Recall over Precision earlier but I want to improve both anyway, so I specified them as the metrics to evaluate the model on. Also note that, F1-score is not readily available as a metric in Keras, so to keep it simple, I have used precision and recall directly in a list.
Revised Model Performance Evaluation:
There is a significant boost in the model performance when I updated these configurations. Check out the performance of the LSTM model below –
Recall that our weighted RECALL 😉on the test data was 81% for the baseline model which improves to 89% on the same data. The three classes with the highest number of instances in the test data have recall equal to or higher than 90%.
Next, I wanted to try SimpleRNN and GRU as well among the other RNN layers. The performance of GRU on the test data is similar to that of the LSTM with a recall on test data of 90% while for SimpleRNN it is 81% which is equivalent to the baseline model!
The performance of SimpleRNN is poor because the model is evidently overfitted. Here is the training-validation loss curve for the model –
Note that the overlap happens at epoch 4 and the validation loss keeps improving after till epoch 7. After that, the validation loss starts to become unstable, which increases and decreases at intervals. The early stopping algorithm uses patience 5 and due to this instability of the validation loss, the model keeps on training, eventually picking up epoch 19 as the best one. This is again due to the limitation I had set in the code
epochs=20. Had I set epochs more than 20, the model would have trained for about 5 more epochs. Clearly, this model won’t be able to generalize on the test set or any unseen data which is manifested by the classification report.
Besides the training times of the models are as follows:
- LSTM — 6mins
- SimpleRNN — 19mins
- GRU — 8mins
Evidently, the training time for SimpleRNN is longer but LSTM and GRU models are faster to train.
Finally, here is the confusion matrix for the performance of the LSTM model on the test data:
The model still gets slightly confused between anger and sadness, joy and love, and somewhat between joy and sadness as well! But the percentages of misclassification per class have reduced significantly.
- Append more data✅— The training set has more data now since I used the separate validation set for validating the model performance while training.
- Feature engineering and Feature Selection✅— I cleaned and normalized the texts and also the vocab size got reduced from 15k to 10k.
- Multiple Algorithms And Altered Hyperparameters ✅ — I modified the model’s network configurations and updated the focus of improvement from accuracy to precision and recall. I also tested SimpleRNN and GRU.
- Use Cross-Validation ❌
- Tune the Hyperparameters ❌
These three changes improved the model’s performance from 81% to 89% (LSTM) and 90% (GRU) weighted-avg recall on the test data. So, it’s time to explore cross-validation and hyperparameter tuning to see what happens.
Keras-tuner is a library to find the optimal set of hyperparameters (or tune the hyperparameters) for your neural network model.
Install Keras Tuner using the following command:
pip install -q -U keras-tuner
Now, after prepping the text data into padded sequences, the model building procedure using LSTM for tuning is as below:
KerasTuner allows us to define hyperparameters inline while building a model. For example, I used ‘vector size’ as a hyperparameter to tune, specifying that it is supposed to be an integer (hp.Int) whose minimum value is 100 and maximum is 500. It will be incremented by a step size of 100 in the search space.
In the model_builder function, I defined five hyperparameters –
- vector_size — Integer| range 100 to 500, step size: 100
- dropout_rate — Float | range 0.6 to 0.9, step size: 0.1
- lstm_units1 — Integer| range 32 to 512, step size: 32
- lstm_units2 — Integer| range 16 to 512, step size: 32
- learning_rate — Choice| 1e-2, 1e-3, 1e-4
The hyperparameter class gives a few choices to design the search space. We can specify whether a hyperparameter is a boolean(hp.Boolean), integer (hp.Int | line 3, 16, 21), float (hp.Float | line 11), a choice among a few options(hp.Choice |line 27), or a fixed values and so on.
So now, we have our hyperparameter search space defined already which means that the hypermodel is ready for us to start tuning. To find the optimum hyperparameters, first I have instantiated the tuner using the code below:
To instantiate the Hyperband tuner, I have specified the hypermodel the following parameters:
- hypermodel: the model_builder function
- objective: loss function for the model described in the hypermodel. I used validation recall for this use case.
- max_epochs: Maximum number of epochs to train one model
- factor: An integer to define the reducing factor for the number of models and epochs per bracket. The number of models to train in a bracket is calculated as
rounded to nearest(1 + log
- directory/project: To log each trial’s configurations, checkpoints, and scores during hyperparameter tuning
In line 11, I added an early stopping configuration to monitor validation recall. As a result, after the best validation recall is achieved in an epoch, the model keeps on training for the next 5 epochs for any further improvement. Finally, I started the best hyperparameter search in line 21.
How does the HyperBand Tuner Algorithm work?
The hyperband tuner is an extension of the Successive Halving Algorithm(SHA) for adaptive resource allocation with early stopping.
The working principle of the Successive Halving Algorithm for adaptive resource allocation can be summarized as:
- Uniformly allocate all resources to the hyperparameter sets and tune them using half the resources/time. When running the tuner this strategy is manifested. Notice that the initial models get trained for about 3 or 4 epochs which are way lower than the number of maximum epochs specified.
- The top-half best performing set of hyperparameters is then “progressed” onto the next stage where the resulting models are trained with higher resources/time allocated to them. While running the tuner, towards the end, this is why the number of epochs is higher.
- Repeat until there is only one configuration.
This algorithm is tweaked a little to make it more flexible for the HyperBand Tuner. It uses η which is the rate of elimination where only 1/ η of the hyperparameter sets are progressed to the next bracket for training and evaluation. η os determined by the formula
rounded to nearest(1 + log
max_epochs)) in this Keras implementation of the algorithm.
Read more about this in the original paper here.
Now… Getting back to the Code
I experimented using two factors: 3 and 5 and here are the results for the best hyperparameter values.
Next, using the best hyperparameters to build the final model (line 2), I used the same old ANN callbacks and training codes again to train it.
The outcomes are here:
The performances of these two models are similar, however, the overall performance is improved by 11% from the baseline model on the unseen dataset, although, from the last one, there is only a marginal improvement.
And… I also repeated the procedure using GRUs
Here is the updated code for the model_builder function:
The hyperparameters and the performance of the model on test data are:
Notice how the weighted-avg recall is the same as the tuned LSTM and the macro-avg varies marginally. However, the vector size for the tuned GRU model is twice the size of the tuned LSTM with factor 3 and four times to that with factor 5. Besides, the GRU network units are larger than that of the LSTMS which indicates that this model will occupy more memory if compared to both of the tuned LSTM models.
Are these models immune to overfitting?
Overfitting is when a model has learnt too much from the training data and is unable to generalize on unseen data while underfitting is when the model hasn’t learnt enough to draw efficient conclusions from new data.
Here are the training and validation loss graphs for the final models. In both the graphs, we can see the model getting overfitted as the training loss gets lower and lower with increasing epochs.
The validation loss for the tuned model with factor=3 revolves between 0.25 and 0.28 and for the other one, it is stagnant at around 0.22. The overfitting of the models does affect the validation loss as seen in the graphs. In the first one, after the 4th epoch where the two losses are the same, the training loss decreases while the validation slowly increases. However, this model is trained till the 8th epoch where the validation loss is the lowest, i.e. the generalizing capability of the model is highest. Still, the model at the 8th epoch carries the risk of overfitting since the validation loss does not follow the increasing pattern and the value at the best epoch is a sudden drop (so basically it is like an outlier).
If I had to choose… Discarding the tuned GRU model as it has the same performance on the unseen data but utlizes more memory for computation, I would rather choose one of the LSTM models, preferably the one with factor 5 if I focus on resources and computational aspects. The macro and weighted average of the recalls is high on the test set for the tuned LSTM with factor 5 and hence I would choose this one. Besides, the difference between the training loss and validation loss at the best epoch is lower for this model (… and hence lower the chances of overfitting) which is 0.05 (0.21 — 0.16) while that for the other model is 0.09 (0.23 — 0.14). Additionally, the difference between the training loss at the crossover point and the best epoch is also lower for the tuned LSTM with factor 5.
Yet… the best models, after all the tuning and searching and refining, are still confused about love❤️ and joy😄 as in the heatmaps below:
… but when compared to the heatmap of our baseline, definitely the latter looks more colorful which indicates higher misclassifications:
Finally, revisiting the top 5 most common approaches for model improvement, I am at 4.5/5.
- Append more data✅
- Feature engineering and Feature Selection✅
- Multiple Algorithms And Altered Hyperparameters✅
- Cross-Validation ❌ Validation ✅
- Tune the Hyperparameters ✅
The result is a boost from 74% macro-recall to 86%, and 81% weighted-recall to 90% on the test dataset.
The only thing missed out from the list is cross-validation, however, a validation set has been used to validate the model’s performance while training. Unlike Sklearn’s hyperparameter tuning implementations, KerasTuner doesn’t have CV implemented, I could not find anything about it in the docs. Also, to perform cross-validation, I might have to mix up training and validation data to use K-fold cross-validation(CV) techniques. The Sklearn tuner in KerasTuners offers the option but the model-to-be-tuned is supposed to be a sklearn model. Hope to explore this in another blog!💡