A Siamese Network structure to fine-tune the RoBERTa model and train a Natural Language Inference Classification task
This article will use Siamese architecture to fine-tune the RoBERTa model. We will also observe the dataset’s embedding distribution with PCA and TSNE as we fine-tune and train it. Training will happen on the fine-tuned RoBERTa model for Natural Language Inference classification (a subset of SNLI).
A subset of the SNLI dataset is used for training. 20,000 training samples, 10,000 validation samples, and 10,000 test samples. The NLI (Natural Language Inference) dataset has a premise and a hypothesis. The goal of the model is to predict whether the hypothesis is in contradiction, entailment, or neutral to the premise.
The original dataset has ~550,000 training samples. This experiment done on the full set will yield better results.
Let’s view the dataset analysis:
We can observe from the above graph that the dataset is balanced. Max sentence length can be estimated as 64 words after concatenating the premise with the hypothesis.
We will be using 2 models for this. The first one is for Siamese architecture to fine-tune RoBERTa. The second model will have a classification head to predict whether the sentences are entailment, contradiction, or neutral to each other.
Sentence Transformer Model:
The output of this model is a 768 feature or the context vector. We will compute embeddings for premise and hypothesis simultaneously and use cosine similarity loss to fine-tune the model.
- -1 for contradiction
- 0 for neutral
- 1 for entailment
The above diagram represents the architecture. After the fine-tuning is done, the weights of the RoBERTa model are saved for further downstream tasks.
Auto Model For Sequence Classification:
After fine-tuning the model with the Siamese architecture we will use the same RoBERTa weights to define a sequence classification model. In this model, the new weights will be assigned only to the last “classifier” layer. This will be further trained on the training data with a cyclic learning rate and Adam optimizer.
A learning rate 2e-5 with a cosine scheduler having 2 cycles is used for training. This is to avoid overfitting.
The above graph displays the loss and accuracy curves for the sequence classification. Test accuracy of 87% is achieved.
After training, we will save the weights. These weights can be again loaded to the Sentence Transformer model, and trained for a couple of epochs to get even better sentence embeddings.
Sentence Embeddings Analysis
We have trained the model 3 times:
- Load the original RoBERTa weights into Sentence Transformers to train with Siamese architecture. We can save these fine-tuned weights.
- Load the fine-tuned weights from the above training into a sequence classification model with RoBERTa. After training, we save these weights.
- Load the weights from the above step into Sentence Transformers to train with Siamese architecture for a couple of epochs and then save the weights.
The sentence embeddings for all the sentence pairs in the training set are saved 3 times:
- Original RoBERTa weights.
- After fine-tuning with Siamese architecture.
- After sequence classification and fine-tuning with Siamese architecture again.
On these 3 versions of sentence embeddings, we will run PCA and TSNE fit to analyze the effect of fine-tuning with the dataset. The following graphs represent the analysis:
Graph 1: We run PCA with n_components = 3 on all the 3 instances of sentence embeddings and plot the graph for 2 dimensions.
Graph 2: We run PCA with n_components = 3 on all the 3 instances of sentence embeddings and plot a 3D graph for all the data points.
Graph 3: We run TSNE fitting with n_components = 2 on all the 3 instances of sentence embeddings and plot the graph.
Graph 4: We run PCA with n_components = 50 and run TSNE fitting on top of it with n_component = 2 on all the 3 instances of sentence embeddings. In this graph, we compare the plot for the previous 2 plots and the current plot.
Graph 4.1: For original RoBERTa weights. The observation on this is that all the data points are scattered and no logic is sufficient to show a clear distinction between sentence pair embeddings based on their label.
Graph 4.2: After fine-tuning the Siamese architecture we can observe some distinctions between sentence pair embeddings based on their label.
Graph 4.3: After sequence classification and fine-tuning the Siamese architecture we can observe that there is a clear distinction between sentence pair embeddings based on their label.
I have presented the impact of training a transformers model with a semi-supervised approach that always leads to a robust model. This experiment was done on a small set of 20,000 data points. Doing this on an actual dataset will have a bigger impact. There are already fine-tuned models with Siamese architecture on huggingface that be used directly.
Other techniques can be used to better cluster or visualize the sentence pair embeddings. These visualizations can also be done for any other NLP task.