Steps to debug BERT’s (and other LLMs’) slow prediction times on a personal computer
This all started when I was playing around with BERT models, and I got the ominous message all Data Scientists hope to avoid:
The dreaded “Kernel Died” message 💀
This happened to me while I was running my TensorFlow BERT model on my Jupyter Notebook. Training large language models (LLMs) notoriously takes a large amount of data and compute, so it could make sense for my comparably puny laptop to crash here…
… except this crash occurred during prediction, rather than training, which was strange given my assumption that more memory was used during training than prediction.
The “Kernel Died” error provided is unfortunately not very descriptive, and debugging line-by-line through the TensorFlow sounded like a daunting exercise.
A few quick searches around Stack Overflow did not completely answer my outstanding questions either. But I still needed a path forward.
This is my exploration of the Kernel dying problem and how I found a solution. 🚀
Given the only thing I knew about my issue was that the kernel died, I had to gather more context. From a few other threads, it seemed clear that the reason for the kernel dying was the my model prediction required more RAM than my CPU could provide (8GB), even during prediction.
Now, a very direct solution (which most everyone would assume) is to simply get or rent a GPU via Google Colab or something like that. And I think that is certainly a viable solution.
But I wanted to know how far could I push my CPU on local ML projects before RAM became a problem. And with that in mind, we’ll need to explore a few aspects of the model and system itself.
Given it was a RAM issue, I figured batch size had a major role to play, so I wanted to stress-test this hyperparameter.
First, I wrote three simplified version of BERT, changing only the size of the batches the model was using. I ran three versions:
- FULL: BERT predicting on the entire input at once
- SINGLE: BERT predicting on a single input at a time
- BATCH (100): BERT predicting in batches of 100 inputs at a time
Code for this below:
from transformers import BertTokenizer, BertForSequenceClassification, TFBertForSequenceClassification
import tensorflow as tfclass BERT_model_full:
"""
BERT model predicting on all inputs at once
"""
def __init__(self):
self.model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def predict(self,inputs):
tf_batch = self.tokenizer(inputs, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = self.model(tf_batch)
return(tf_outputs.logits.numpy())
class BERT_model_batch:
"""
BERT model predicting on batches of 100 inputs at a time
"""
def __init__(self):
self.model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def predict(self,inputs):
# Pred by batchsize
i = 0
batch_size = 100
og_preds = []
int_preds = []
while i < len(inputs):
j = min([len(inputs),i+batch_size])
tf_batch = self.tokenizer(inputs[i:j], max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = self.model(tf_batch)
i = j
return(True)
class BERT_model_single:
"""
BERT model predicting on a single input at a time
"""
def __init__(self):
self.model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def predict(self,inputs):
for i in inputs:
tf_batch = self.tokenizer([i], max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = self.model(tf_batch)
return(tf_outputs.logits.numpy())
I then ran each of these models through the same test cases, with increasing input size. I used the classic imdb dataset for this.
size_list = [1,10,100,1000,2000,4000,6000,8000]
single_time_list = []
batch_time_list = []
full_time_list = []BERT = BERT_model_single()
print("BERT Single Input:")
for s in size_list:
data = imdb_data.sample(s)['DATA_COLUMN']
start = time.time()
_ = BERT.predict(data)
end = time.time()
single_time_list.append(end-start)
print(f"{s} samples: {(end-start)/60:.2f} minutes")
BERT = BERT_model_batch()
print("nBERT Small Batch:")
for s in size_list:
data = list(imdb_data.sample(s)['DATA_COLUMN'])
start = time.time()
_ = BERT.predict(data)
end = time.time()
batch_time_list.append(end-start)
print(f"{s} samples: {(end-start)/60:.2f} minutes")
BERT = BERT_model_full()
print("nBERT Full Batch:")
for s in size_list:
data = list(imdb_data.sample(s)['DATA_COLUMN'])
start = time.time()
_ = BERT.predict(data)
end = time.time()
full_time_list.append(end-start)
print(f"{s} samples: {(end-start)/60:.2f} minutes")
And graphing the output produced an interesting trend:
BATCH outperforming SINGLE made sense, because most Machine Learning models and packages like Tensorflow are designed to take advantage of vectorization.
But what was surprising was how much worse FULL performed against BATCH.
I had assumed that FULL would perform the best due to vectorization up until it crashed the kernel from memory constraints, but in fact the memory constrains for even a few thousands examples was so extensive on my laptop that it exponentially increased prediction time.
FULL actually performed worse than processing one input at a time without any vectorization on larger inputs. 🤯
At about 2,000 examples these RAM requirements start to take a toll on my CPU. And what’s amazing is that prior to hitting that 2K, the difference between BATCH and FULL is not that different.
Based on the above chart, I assumed using a batch size of 2,000 would yield the best results.
I was wrong.
It seems the best batch size is closer to 1K, because prediction time starting to creep up if we use a 2K batch size:
The next piece of code I explored was the Tokenizer. Given how many hyperparameters the line contained, I figured it would be a place to optimze as well:
tf_batch = self.tokenizer(inputs, max_length=128,
padding=True, truncation=True,
return_tensors='tf')
However, when I time-checked my FULL Model performance, both on 1K inputs where it performed on-par with BATCH, and on 4K where it performed significantly worse, Tokenizer performance time was an irrelevant fraction of total time:
1000 samples:
Tokenizer Time: 0.06 minutes
Predictionn Time: 1.97 minutes
Tokenizer takes up 3.06% of prediction time4000 samples:
Tokenizer Time: 0.29 minutes
Predictionn Time: 27.25 minutes
Tokenizer takes up 1.06% of prediction time
While Tokenizer time increase did slightly outpace input size increase (Quadrupling the input size led to Tokenizer time 4.8x) prediction time increased an astounding 13.8x!
Clearly, the problem is in the .predict()
portion of my pipeline.
Based on the Stack Overflow thread already referenced above, the most upvoted solution was to downgrade Tensorflow to speed up prediction.
I thought this was a questionable solution, as I assumed that upgraded versions would have more optimizations and better runtimes, not worse. But I still tried it out.
Going to the tensorflow Pypi page, we can see older versions of the package. Choosing packages released approximately one year apart, we get the following package versions:
2.10.0
, released Sept 20222.6.1
, released Nov 20211.15.4
, released Sept 20201.15.0
, released Oct 2019
To iteratively install different versions of the same package, we need to utilize the os
package, allowing us to run terminal commands from python code:
import os
data = list(imdb_data.sample(4000)['DATA_COLUMN'])
full_time_list = []
versions = ["2.10.0","2.6.1","1.15.4","1.15.0"]
for version in versions:
print(version,":")
os.system(f"pip install tensorflow=={version}")try:
from transformers import BertTokenizer, BertForSequenceClassification, TFBertForSequenceClassification
import tensorflow as tf
except:
print("Cannot import relevant packages")
continue
BERT = BERT_model_full()
start = time.time()
_ = BERT.predict(data)
end = time.time()
minutes = (end-start)/60
full_time_list.append(minutes)
print(f"{s} batch size: {minutes:.2f} minutes")
- The
try/except
clause is in there because we don’t know if these functions existed in earlier versions of the package. Luckily, they all do import
statements within a loop looks wrong but is necessary since we need to re-import the functions once the correct package version has been installed
After iterating through each version, we find out that downgrading tensorflow improves runtime by as much as 15%!
My pet theory as to why this is the case is because newer versions of tensorflow are built assuming heavy GPU usage, which means it is optimized for this particular use case at the cost of local CPU performance.
If anyone has the real answer as to why older versions of TF run faster, please let me know!
With the following insights about Tensorflow runtime:
- The optimal prediction batch-size is about 1,000
- Tokenizer parameters do play a large role in prediction time
- Tensorflow 1.X.X has a 15% boost in prediction time
We can put all these together, and see how it performs against our original batch-size experiment:
In the largest case tested, our optimal run beats the Batch(100) by 20% and Single by 57%!
Overall, this exercise was a simple and enjoyable expression of what it means to be a Data Scientist. You need to identify a problem, establish a hypothesis, develop a rigourous test, and analyze your results. In this case, it was my Tensorflow runtime. In the future, I’m sure you will find perplexing data/issues/problems in your own work.
And next time, I hope rather than check Stack Overflow and give up if the answer isn’t there, you roll up your sleeves and explore the problem space yourself. You never know what you might learn 💡
Hope this was helpful in debugging your Tensorflow prediction time issues! 🎉