[ad_1]

## Explore how a text-based model’s performance could be improved by incorporating simple data and/or NLP techniques coupled with hyperparameter tuning for ANNs using KerasTuner

In my previous article on Multiclass Text Classification Using Keras to Predict Emotions, I compared the outcome of the usage of skipgrams to give a deep text classifier a headstart with learnt word embeddings for predicting emotions with that of another deep text classifier that learns the embeddings from scratch. **The ANN classifier that learnt the embeddings from scratch was slightly better** than the other one which I will consider as my baseline in this article. In this revisit to the same dataset, which is publicly available on Kaggle here and on Hugging Face datasets here, I will try and improve the model that yielded a better weighted-avg recall (of 81% on the test data).

In the image below, the performance of the models on the test set are reported. Here, *Model 1 *refers to the model trained **with **word2vec embeddings, and *Model 2* refers to the model trained **without **word2vec embeddings. Recall that RECALL 😏was important in this use-case and so it was concluded that the second model performed marginally better, and therefore, **Model 2 is our baseline for this article.**

- Appending
**more data**which eventually gives an ML model more examples to learn and generalize from. **Feature engineering**for extracting helpful information from the given data that allow the model to easily and efficiently find patterns for predictions… and**feature selection**for treating the GIGO problem. Basically, this allows models to work with only a few useful features, remove noise, and save computation time and resources.- Try
**Multiple Algorithms**to find the best-suited one for the predictions. - Use
**Cross-Validation**for a robust and well-generalized model. Using cross-validation, you can train and test a model’s performance on multiple chunks of the dataset, get the average performances and figure out if a model is at its best or not. **Tune the Hyperparameters**to identify the best combination suited for the dataset since they have a pivotal influence on the outcome of the model training process.

Referring to the same notebook I had in the previous article, I made a few changes —

**1. Revised the Data Strategy **— In the previous one, I had used 2/3rd data to train and reserved 1/3rd data to validate. In this one, I used the entire train.txt for training and used the validation set for validating and improving the model performance during model training.

**Append more data✅**— The training set has more data now since I used the separate validation set for validating the model performance while training.**Feature engineering and Feature Selection**✅— I cleaned and normalized the texts and also the vocab size got reduced from 15k to 10k.**Multiple Algorithms And Altered Hyperparameters ✅**— I modified the model’s network configurations and updated the focus of improvement from accuracy to precision and recall. I also tested SimpleRNN and GRU.- Use
**Cross-Validation ❌** **Tune the Hyperparameters ❌**

These three changes improved the model’s performance from 81% to 89% (LSTM) and 90% (GRU) weighted-avg recall on the test data. So, it’s time to explore cross-validation and hyperparameter tuning to see what happens.

Keras-tuner is a library to find the optimal set of hyperparameters (or *tune the hyperparameters*) for your neural network model.

Install Keras Tuner using the following command:

`pip install `**-**q **-**U keras**-**tuner

Now, after prepping the text data into padded sequences, the model building procedure using LSTM for tuning is as below:

KerasTuner allows us to define hyperparameters inline while building a model. For example, I used ‘*vector size*’ as a hyperparameter to tune, specifying that it is supposed to be an **integer **(hp.**Int**) whose minimum value is 100 and maximum is 500. It will be incremented by a step size of 100 in the search space.

In the model_builder function, I defined five hyperparameters –

- vector_size — Integer| range 100 to 500, step size: 100
- dropout_rate — Float | range 0.6 to 0.9, step size: 0.1
- lstm_units1 — Integer| range 32 to 512, step size: 32
- lstm_units2 — Integer| range 16 to 512, step size: 32
- learning_rate — Choice| 1e-2, 1e-3, 1e-4

The hyperparameter class gives a few choices to design the search space. We can specify whether a hyperparameter is a boolean(*hp.Boolean*), integer (*hp.Int | line 3, 16, 21*), float (*hp.Float | line 11*), a choice among a few options(*hp.Choice |line 27*), or a fixed values and so on.

So now, we have our hyperparameter search space defined already which means that the ** hypermodel **is ready for us to start tuning. To find the optimum hyperparameters, first I have instantiated the tuner using the code below:

To instantiate the Hyperband tuner, I have specified the hypermodel the following parameters:

- hypermodel: the model_builder function
- objective: loss function for the model described in the hypermodel. I used validation recall for this use case.
- max_epochs: Maximum number of epochs to train one model
- factor: An integer to define the reducing factor for the number of models and epochs per bracket. The number of models to train in a bracket is calculated as
`rounded to nearest`

(1 + log

(**base**=factor`max_epochs`

)) - directory/project: To log each trial’s configurations, checkpoints, and scores during hyperparameter tuning

In *line 11*, I added an early stopping configuration to monitor validation recall. As a result, after the best validation recall is achieved in an epoch, the model keeps on training for the next 5 epochs for any further improvement. Finally, I started the best hyperparameter search in *line 21*.

**How does the HyperBand Tuner Algorithm work?**

The hyperband tuner is an extension of the Successive Halving Algorithm(SHA) for adaptive resource allocation with early stopping.

The working principle of the Successive Halving Algorithm for adaptive resource allocation can be summarized as:

- Uniformly allocate all resources to the hyperparameter sets and tune them using half the resources/time. When running the tuner this strategy is manifested. Notice that the initial models get trained for about 3 or 4 epochs which are way lower than the number of maximum epochs specified.
- The top-half best performing set of hyperparameters is then “progressed” onto the next stage where the resulting models are trained with higher resources/time allocated to them. While running the tuner, towards the end, this is why the number of epochs is higher.
- Repeat until there is only one configuration.

This algorithm is tweaked a little to make it more flexible for the HyperBand Tuner. It uses η which is the rate of elimination where only 1/ η of the hyperparameter sets are progressed to the next bracket for training and evaluation. η os determined by the formula `rounded to nearest`

(1 + log

(**base**=factor`max_epochs`

)) in this Keras implementation of the algorithm.

Read more about this in the original paper here.

## Now… Getting back to the Code

I experimented using two factors: 3 and 5 and here are the results for the best hyperparameter values.

Next, using the best hyperparameters to build the final model (*line 2*), I used the same old ANN callbacks and training codes again to train it.

The outcomes are here:

The performances of these two models are similar, however, the overall performance is improved by 11% from the baseline model on the unseen dataset, although, from the last one, there is only a marginal improvement.

## And… I also repeated the procedure using GRUs

Here is the updated code for the model_builder function:

The hyperparameters and the performance of the model on test data are:

Notice how the weighted-avg recall is the same as the tuned LSTM and the macro-avg varies marginally. However, the **vector size for the tuned GRU model is twice the size of the tuned LSTM with factor 3 and four times to that with factor 5**. Besides, the GRU network units are larger than that of the LSTMS which indicates that this model will occupy more memory if compared to both of the tuned LSTM models.

**Are these models immune to overfitting?**

Overfitting is when a model has learnt too much from the training data and is unable to generalize on unseen data while underfitting is when the model hasn’t learnt enough to draw efficient conclusions from new data.

Here are the training and validation loss graphs for the final models. In both the graphs, we can see the model getting overfitted as the training loss gets lower and lower with increasing epochs.

The validation loss for the tuned model with factor=3 revolves between 0.25 and 0.28 and for the other one, it is stagnant at around 0.22. The overfitting of the models does affect the validation loss as seen in the graphs. In the first one, after the 4th epoch where the two losses are the same, the training loss decreases while the validation slowly increases. However, this model is trained till the 8th epoch where the validation loss is the lowest, i.e. the generalizing capability of the model is highest. Still, the model at the 8th epoch carries the risk of overfitting since the validation loss does not follow the increasing pattern and the value at the best epoch is a sudden drop (so basically it is like an outlier).

**If I had to choose…** Discarding the tuned GRU model as it has the same performance on the unseen data but utlizes more *memory for computation*, I would rather choose one of the LSTM models, preferably the one with factor 5 if I focus on resources and computational aspects. The *macro and weighted average of the recalls is high* on the test set for the tuned LSTM with factor 5 and hence I would choose this one. Besides, the *difference between the training loss and validation loss at the best epoch* is lower for this model (… and hence lower the chances of overfitting) which is 0.05 (*0.21 — 0.16*) while that for the other model is 0.09 (*0.23 — 0.14*). Additionally, the *difference between the training loss at the crossover point and the best epoch *is also lower for the tuned LSTM with factor 5.

Yet… the best models, after all the tuning and searching and refining, are still confused about love❤️ and joy😄 as in the heatmaps below:

… but when compared to the heatmap of our baseline, definitely the latter looks more colorful which indicates higher misclassifications:

Finally, revisiting the top 5 most common approaches for model improvement, I am at 4.5/5.

**Append more data✅****Feature engineering and Feature Selection**✅**Multiple Algorithms And Altered Hyperparameters✅****Cross-Validation ❌ Validation ✅****Tune the Hyperparameters ✅**

The result is a boost from 74% macro-recall to 86%, and 81% weighted-recall to 90% on the test dataset.

The only thing missed out from the list is cross-validation, however, a validation set has been used to validate the model’s performance while training. Unlike Sklearn’s hyperparameter tuning implementations, KerasTuner doesn’t have CV implemented, I could not find anything about it in the docs. Also, to perform cross-validation, I might have to mix up training and validation data to use K-fold cross-validation(CV) techniques. The Sklearn tuner in KerasTuners offers the option but the model-to-be-tuned is supposed to be a sklearn model. Hope to explore this in another blog!💡

[ad_2]

Source link