As we have seen more parameters do not equate to better performance. For better performance, we need quality tokens (texts), but these are in short supply. How can we obtain them? Can we help ourselves with artificial intelligence?
Why we are not using Chat-GPT to produce text?
If we humans are not producing enough text, why not automate this process? A recent study shows how this process is not optimal. Stanford Alpaca was trained using 52,000 examples derived from GPT-3, but only apparently achieved similar performance. In reality, the model learns the style of the target model but not its knowledge.
Why not train longer?
For both PaLM, Gopher, and LLaMA (also for the other LLMs) it is clearly written that the models were trained for a few epochs (one or however few). This is not a limitation of the Transformer because, for example, the Vision Transformers (ViT) have been trained for 300 epochs on ImageNet (1 million images), as shown in the table:
Because it is beyond expensive. In the LLaMA article, the authors trained for only one epoch (and two epochs for only part of the dataset). Nevertheless, the authors report:
When training a 65B-parameter model, our code processes around 380 tokens/sec/GPU on 2048 A100 GPU with 80GB of RAM. This means that training over our dataset containing 1.4T tokens takes approximately 21 days. (source)
Training an LLM for even a few epochs is extremely expensive. As calculated by Dmytro Nikolaiev (Dimid) this is meaning 4.0 million dollars if you train a model similar to META’s LLaMA on the Google Cloud Platform.
So training for other epochs would lead to an exponential increase in costs. Also, we don’t know if this additional training is really useful: we haven’t tested it yet.
Recently a group of researchers at the University of Singapore studied what happens if we train an LLM for multiple epochs:
Until now we know that the performance of a model is derived not only by the number of parameters but also by the number of quality tokens used to train. On the other hand, these quality tokens are not infinite and we are approaching the limit. If we cannot find enough quality tokens and it is an option to generate with AI, what could we do?
Can we use the same training set and train longer?
There is a Latin locution that states that repeating things benefits (repetita iuvant), but over time someone added “but continuing bores” (continuata secant).
The same is true for neural networks: increasing the number of epochs improves network performance (decrease in loss); at some point, however, while the loss in the training set continues to fall, the loss in the validation set begins to rise. The neural network went into overfitting, beginning to consider patterns that are only present in the training set and losing the ability to generalize.
Ok, this has been studied extensively for small neural networks, but what about huge transformers?
The authors of this study used the T5 model (encoder-decoder model) on the C4 dataset. The authors trained several versions of the model, increasing the number of parameters until the larger model outperformed the smaller model (indicating that the larger model received a sufficient number of tokens, as Chinchilla’s law). The authors noted that there was a linear relationship between the number of tokens required and the size of the model (confirming what DeepMind saw with Chinchilla).
The C4 dataset is limited (does not have infinite tokens) so to increase the number of parameters the authors found themselves in a tokens-scarcity condition. Thus they decided to simulate what happens if an LLM sees repeated data. They sampled a certain number of tokens, so the model found itself seeing them again in tokens training. This showed:
- Repeated tokens lead to degraded performance.
- Larger models are more susceptible to overfitting under tokens-crisis conditions (so even though it theoretically consumes more computational resources this leads to degraded performance).
In addition, these models are used for downstream tasks. Often an LLM is trained unsupervised on a large amount of text and then fine-tuned on a smaller dataset for a downstream task. Or it may go through a process called alignment (as in the case of ChatGPT).
When an LLM is trained on repeated data even though it is then fine-tuned on another dataset, performance is degraded. So the downstream tasks are also impacted.
We just saw that repeated tokens harm training. But why does this happen?
The authors decided to investigate by keeping the number of repeated tokens fixed and increasing the number of total tokens in the dataset. The results show that a larger dataset alleviates multi-epoch degradation issues.
Last year Galactica was published (a model that was supposed to help scientists but lasted only three days). Apart from the spectacular debacle, the article suggested that part of their results was from the quality of the data. According to the authors, data quality reduced the risk of overfitting:
We are able to train on it for multiple epochs without overfitting, where upstream and downstream performance improves with use of repeated tokens. (source)
For the authors, the repeated tokens actually not only do not harm the model training but actually improved downstream performance.
In this new study, the authors use the Wikipedia dataset which is considered a higher quality dataset than C4, and add repeated tokens. The results show that there is a similar level of degradation, which is against what is stated in Galactica’s article.
The authors also tried to investigate whether it was also due to model scaling. During the scaling of a model, both the number of parameters and the computational cost increase. The authors decided to study these two factors individually:
- Mixture-of-Experts (MoE) because although it increases the number of parameters it maintains a similar computational cost.
- ParamShare, on the other hand, reduces the number of parameters but maintains the same computational cost.
The results show that the model with fewer parameters is less affected by repeated tokens. In contrast, the MoE model (greater number of parameters) is more prone to overfitting. The result is interesting because MoE has been used successfully in many AI models, so the authors suggest that although MoE is a useful technique when there is enough data, it can hurt performance when there are not enough tokens.
The authors also explored whether objective training impacts performance degradation. In general, there are two training objectives:
Recently, with PaLM2–2, Google introduced UL2 which is a mix between these two training objectives. UL2 has been shown to accelerate model training however interestingly, UL2 is more prone to overfitting and has greater multi-epoch degradation.
The authors next explored how they could try to alleviate multi-epoch degradation. Since regularization techniques are used precisely to prevent overfitting, the authors tested whether these techniques had a beneficial effect here as well.
Dropout shows to be one of the most efficient techniques to alleviate the problem. This is not surprising because one of the most efficient regularization techniques, it is easily parallelized and used by most of the models.
Moreover, it works best for the authors to start without dropout and only at a later point in the training to add dropout.
On the other hand, the authors note that using Dropout in some models, especially the larger ones, can lead to a slight reduction in performance. So although it may have beneficial effects against overfitting it could lead to unexpected behaviors in other contexts. So much that models GPT-3, PaLM, LLaMA, Chinchilla, and Gopher do not use it in their architecture.
As described in the table below, the authors used for their experiments what are now considered almost small models. Thus, it is expensive to test different hyperparameters when designing an LLM:
For instance, in our specific scenario, training T5-XL five times would require approximately $37,000 USD for renting Google Cloud TPUs. Considering even larger models like PaLM and GPT-4, trained on even larger datasets, this cost becomes unmanageable (source)
Since in their experiments, a Sparse MoE model approximates the behavior of a dense model (which is more computationally expensive), one can use it to search for the best hyperparameters.
For example, the authors show that one can test different learning rates for the MoE model and it exhibits the same performance as the equivalent dense model. So for the authors, one can test different hyperparameters with the MoE model and then train with the chosen parameters the dense model, thus saving cost:
sweeping the MoE Large model incurred an expenditure of approximately 10.6K USD on the Google Cloud Platform. Conversely, training the Dense XL model only once required 7.4K USD. Consequently, the entire development process, including sweeping, amounted to a total cost of 18K USD, which is only 0.48 times the expense of directly tuning the Dense XL model (source)