Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

Watch Out For Your Beam Search Hyperparameters

admin by admin
January 11, 2023
in Artificial Intelligence


The default values are never the best

Photo by Paulius Dragunas on Unsplash

When developing applications using neural models, it is common to try different hyperparameters for training the models.

For instance, the learning rate, the learning schedule, and the dropout rates are important hyperparameters that have a significant impact on the learning curve of your models.

What is much less common is the search for the best decoding hyperparameters. If you read a deep learning tutorial or a scientific paper tackling natural language processing applications, there is a high chance that the hyperparameters used for inference are not even mentioned.

Most authors, including myself, do not bother searching for the best decoding hyperparameters and use default ones.

Yet, these hyperparameters can actually have a significant impact on the results, and whatever is the decoding algorithm you are using there are always some hyperparameters that should be fine-tuned to obtain better results.

In this blog article, I show the impact of decoding hyperparameters with simple Python examples, and a machine translation application. I focus on beam search, since this is by far the most popular decoding algorithm, and two particular hyperparameters.

To demonstrate the effect and importance of each hyperparameter, I will show some examples produced using the Hugging Face Transformers package, in Python.

To install this package, run in your terminal (I recommend to do it in a separate conda environment) the following command:

pip install transformers

I will use GPT-2 (MIT licence) to generate simple sentences.

I will also run other examples in machine translation using Marian (MIT licence). I installed it on Ubuntu 20.04, following the official instructions.

Beam search is probably the most popular decoding algorithm for language generation tasks.

It keeps at each time step, i.e., for each new token generated, the k most probable hypotheses, according to the model used for inference, and the remaining ones are discarded.

Finally, at the end of the decoding, the hypothesis with the highest probability will be the output.

k, usually called the “beam size”, is a very important hyperparameter.

With a higher k you get a more probable hypothesis. Note that when k=1, we talk about “greedy search” since we only keep the most probable hypothesis at each time step.

By default, in most applications, k is arbitrarily set between 1 and 10. Values that may seem very low.

There are two main reasons for this:

  • Increasing k increases the decoding time and the memory requirements. In other words, it gets more costly.
  • Higher k may yield more probable but worse results. This is mainly, but not only, due to the length of the hypotheses. Longer hypotheses tend to have lower probability, so beam search will tend to promote shorter hypotheses that may be more unlikely for some applications.

The first point can be straightforwardly fixed by performing better batch decoding and investing in better hardware.

The length bias can be controlled through another hyperparameter that normalizes the probability of an hypothesis by its length (number of tokens) at each time step. There are numerous ways to perform this normalization. One of the most used equation was proposed by Wu et al. (2016):

lp(Y) = (5 + |Y|)α / (5 + 1)α

Where |Y| is the length of the hypothesis and α an hyperparameter usually set between 0.5 and 1.0.

Then, the score lp(Y) is used to modify the probability of the hypothesis to bias the decoding and produce longer or shorter hypotheses given α.

The implementation in Hugging Face transformers might be slightly different, but there is such an α that you can pass as “lengh_penalty” to the generate function, as in the following example (adapted from the Transformers’ documentation):

from transformers import AutoTokenizer, AutoModelForCausalLM

#Download and load the tokenizer and model for gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

#Prompt that will initiate the inference
prompt = "Today I believe we can finally"

#Encoding the prompt with tokenizer
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

#Generate up to 30 tokens
outputs = model.generate(input_ids, length_penalty=0.5, num_beams=4, max_length=20)

#Decode the output into something readable
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

“num_beams” in this code sample is our other hyperparameter k.

With this code sample, the prompt “Today I believe we can finally”, k=4, and α=0.5, we get:

outputs = model.generate(input_ids, length_penalty=0.5, num_beams=4, max_length=20)
Today I believe we can finally get to the point where we can make the world a better place.

With k=50 and α=1.0, we get:

outputs = model.generate(input_ids, length_penalty=1.0, num_beams=50, max_length=30)
Today I believe we can finally get to where we need to be," he said.nn"

You can see that the results are not quite the same.

k and α should be fine-tuned independently on your target task, using some development dataset.

Let’s take a concrete example in machine translation to see how to do a simple grid search to find the best hyperparameters and their impact in a real use case.

For these experiments, I use Marian with a machine translation model trained on the TILDE RAPID corpus (CC-BY 4.0) to do French-to-English translation.

I used only the first 100k lines of the dataset for training and the last 6k lines as devtest. I split the devtest into two parts of 3k lines each: the first part is used for validation and the second part is used for evaluation. Note: the RAPID corpus has its sentences ordered alphabetically. My train/devtest split is thus not ideal for a realistic use case. I recommend shuffling the lines of the corpus, preserving the sentence pairs, before splitting the corpus. In this article, I kept the alphabetical order, and didn’t shuffle, to make the following experiments more reproducible.

I evaluate the translation quality with the metric COMET (Apache License 2.0).

To search for the best pair of values for k and α with grid search, we have to first define a set of values for each hyperparameter and then try all the possible combinations.

Since here we are searching for decoding hyperparameters, this search is quite fast and straightforward in constrat to searching for training hyperparameters.

The sets of values I chose for this task are as follows:

  • k: {1,2,4,10,20,50,100}
  • α: {0.5,0.6,0.7,0.8,1.0,1.1,1.2}

I put in bold the most common values used in machine translation by default. For most natural language generation tasks, these sets of values should be tried, except maybe k=100 which is often unlikely to yield the best results while it is a costly decoding.

We have 7 values for k and 7 values for α. We want to try all the combinations so we have 7*7=49 decodings of the evaluation dataset to do.

We can do that with a simple bash script:

for k in 1 2 4 10 20 50 100 ; do
for a in 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 ; do
marian-decoder -m model.npz -n $a -b $k -c model.npz.decoder.yml < test.fr > test.en
done;
done;

Then for each decoding output we run COMET to evaluate the translation quality.

With all the results we can draw the following table of COMET scores for each pair of values:

Table by the author

As you can see, the result obtained with the default hyperparameter (underline) is lower than 26 of the other results obtained with other hyparameter values.

Actually, all the results in bold are statistically significantly better than the default one. Note: In this experiments I am using the test set to compute the results I showed in the table. In a realistic scenario, these results should be computed on another development/validation set to decide on the pair of values that will be used on the test set, or for a real-world applications.

Hence, for your applications, it is definitely worth fine-tuning the decoding hyperparameters to obtain better results at the cost of a very small engineering effort.

In this article, we only played with two hyperparameters of beam search. Many more should be fine-tuned.

Other decoding algorithms such as temperature and nucleus sampling have hyperparameters that you may want to look at instead of using default ones.

Obviously, as we increase the number of hyperparameters to fine-tune, the grid search becomes more costly. Only your experience and experiments with your application will tell you whether it is worth fine-tuning a particular hyperparameter, and which values are more likely to yield satisfying results.



Source link

Previous Post

A Picture is worth a Thousand Words: 10 clear Machine Learning Definitions with Stable Diffusion Generated Images | by Armando Gil – armandoai.space | Jan, 2023

Next Post

Enriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker

Next Post

Enriching real-time news streams with the Refinitiv Data Library, AWS services, and Amazon SageMaker

What is financial statement spreading?

Self-driving Technology and Self-driving cars— when will they become on the roads? | by Sasha Andrieiev | Jan, 2023

Related Post

Artificial Intelligence

3 Ways to Build a Geographical Map in Python Altair | by Angelica Lo Duca | Jan, 2023

by admin
January 30, 2023
Machine Learning

Want to get a quick and profound overview of the 42 most common used Machine Learning Algorithms? | by Murat Durmus (CEO @AISOMA_AG) | Jan, 2023

by admin
January 30, 2023
Machine Learning

Scan Business Cards to Excel or Google Contacts

by admin
January 30, 2023
Artificial Intelligence

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

by admin
January 30, 2023
Artificial Intelligence

Don’t blame a Data Scientist on failed projects! | by Darya Petrashka | Dec, 2022

by admin
January 30, 2023
Edge AI

BrainChip Tapes Out AKD1500 Chip in GlobalFoundries 22nm FD SOI Process

by admin
January 30, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.