Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

Exploring TensorFlow Model Prediction Issues | by Adam Brownell | Feb, 2023

admin by admin
February 2, 2023
in Artificial Intelligence


Steps to debug BERT’s (and other LLMs’) slow prediction times on a personal computer

This all started when I was playing around with BERT models, and I got the ominous message all Data Scientists hope to avoid:

The dreaded “Kernel Died” message 💀

This happened to me while I was running my TensorFlow BERT model on my Jupyter Notebook. Training large language models (LLMs) notoriously takes a large amount of data and compute, so it could make sense for my comparably puny laptop to crash here…

… except this crash occurred during prediction, rather than training, which was strange given my assumption that more memory was used during training than prediction.

The “Kernel Died” error provided is unfortunately not very descriptive, and debugging line-by-line through the TensorFlow sounded like a daunting exercise.

A few quick searches around Stack Overflow did not completely answer my outstanding questions either. But I still needed a path forward.

This is my exploration of the Kernel dying problem and how I found a solution. 🚀

Given the only thing I knew about my issue was that the kernel died, I had to gather more context. From a few other threads, it seemed clear that the reason for the kernel dying was the my model prediction required more RAM than my CPU could provide (8GB), even during prediction.

Now, a very direct solution (which most everyone would assume) is to simply get or rent a GPU via Google Colab or something like that. And I think that is certainly a viable solution.

But I wanted to know how far could I push my CPU on local ML projects before RAM became a problem. And with that in mind, we’ll need to explore a few aspects of the model and system itself.

Given it was a RAM issue, I figured batch size had a major role to play, so I wanted to stress-test this hyperparameter.

First, I wrote three simplified version of BERT, changing only the size of the batches the model was using. I ran three versions:

  • FULL: BERT predicting on the entire input at once
  • SINGLE: BERT predicting on a single input at a time
  • BATCH (100): BERT predicting in batches of 100 inputs at a time

Code for this below:

from transformers import BertTokenizer, BertForSequenceClassification, TFBertForSequenceClassification
import tensorflow as tf

class BERT_model_full:
"""
BERT model predicting on all inputs at once
"""

def __init__(self):

self.model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def predict(self,inputs):

tf_batch = self.tokenizer(inputs, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = self.model(tf_batch)

return(tf_outputs.logits.numpy())

class BERT_model_batch:
"""
BERT model predicting on batches of 100 inputs at a time
"""

def __init__(self):

self.model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def predict(self,inputs):

# Pred by batchsize
i = 0
batch_size = 100
og_preds = []
int_preds = []

while i < len(inputs):

j = min([len(inputs),i+batch_size])
tf_batch = self.tokenizer(inputs[i:j], max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = self.model(tf_batch)

i = j

return(True)

class BERT_model_single:
"""
BERT model predicting on a single input at a time
"""

def __init__(self):

self.model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def predict(self,inputs):

for i in inputs:
tf_batch = self.tokenizer([i], max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = self.model(tf_batch)

return(tf_outputs.logits.numpy())

I then ran each of these models through the same test cases, with increasing input size. I used the classic imdb dataset for this.

size_list = [1,10,100,1000,2000,4000,6000,8000]
single_time_list = []
batch_time_list = []
full_time_list = []

BERT = BERT_model_single()
print("BERT Single Input:")
for s in size_list:
data = imdb_data.sample(s)['DATA_COLUMN']

start = time.time()

_ = BERT.predict(data)

end = time.time()

single_time_list.append(end-start)
print(f"{s} samples: {(end-start)/60:.2f} minutes")

BERT = BERT_model_batch()
print("nBERT Small Batch:")
for s in size_list:
data = list(imdb_data.sample(s)['DATA_COLUMN'])

start = time.time()

_ = BERT.predict(data)

end = time.time()

batch_time_list.append(end-start)
print(f"{s} samples: {(end-start)/60:.2f} minutes")

BERT = BERT_model_full()
print("nBERT Full Batch:")
for s in size_list:
data = list(imdb_data.sample(s)['DATA_COLUMN'])

start = time.time()

_ = BERT.predict(data)

end = time.time()

full_time_list.append(end-start)
print(f"{s} samples: {(end-start)/60:.2f} minutes")

And graphing the output produced an interesting trend:

BATCH outperforming SINGLE made sense, because most Machine Learning models and packages like Tensorflow are designed to take advantage of vectorization.

But what was surprising was how much worse FULL performed against BATCH.

I had assumed that FULL would perform the best due to vectorization up until it crashed the kernel from memory constraints, but in fact the memory constrains for even a few thousands examples was so extensive on my laptop that it exponentially increased prediction time.

FULL actually performed worse than processing one input at a time without any vectorization on larger inputs. 🤯

At about 2,000 examples these RAM requirements start to take a toll on my CPU. And what’s amazing is that prior to hitting that 2K, the difference between BATCH and FULL is not that different.

Based on the above chart, I assumed using a batch size of 2,000 would yield the best results.

I was wrong.

It seems the best batch size is closer to 1K, because prediction time starting to creep up if we use a 2K batch size:

batch size impact on prediction time for 4K inputs

The next piece of code I explored was the Tokenizer. Given how many hyperparameters the line contained, I figured it would be a place to optimze as well:

tf_batch = self.tokenizer(inputs, max_length=128, 
padding=True, truncation=True,
return_tensors='tf')

However, when I time-checked my FULL Model performance, both on 1K inputs where it performed on-par with BATCH, and on 4K where it performed significantly worse, Tokenizer performance time was an irrelevant fraction of total time:

1000 samples:
Tokenizer Time: 0.06 minutes
Predictionn Time: 1.97 minutes
Tokenizer takes up 3.06% of prediction time

4000 samples:
Tokenizer Time: 0.29 minutes
Predictionn Time: 27.25 minutes
Tokenizer takes up 1.06% of prediction time

While Tokenizer time increase did slightly outpace input size increase (Quadrupling the input size led to Tokenizer time 4.8x) prediction time increased an astounding 13.8x!

Clearly, the problem is in the .predict() portion of my pipeline.

Based on the Stack Overflow thread already referenced above, the most upvoted solution was to downgrade Tensorflow to speed up prediction.

I thought this was a questionable solution, as I assumed that upgraded versions would have more optimizations and better runtimes, not worse. But I still tried it out.

Going to the tensorflow Pypi page, we can see older versions of the package. Choosing packages released approximately one year apart, we get the following package versions:

  • 2.10.0, released Sept 2022
  • 2.6.1, released Nov 2021
  • 1.15.4, released Sept 2020
  • 1.15.0, released Oct 2019

To iteratively install different versions of the same package, we need to utilize the os package, allowing us to run terminal commands from python code:

import os
data = list(imdb_data.sample(4000)['DATA_COLUMN'])
full_time_list = []
versions = ["2.10.0","2.6.1","1.15.4","1.15.0"]
for version in versions:
print(version,":")
os.system(f"pip install tensorflow=={version}")

try:
from transformers import BertTokenizer, BertForSequenceClassification, TFBertForSequenceClassification
import tensorflow as tf
except:
print("Cannot import relevant packages")
continue

BERT = BERT_model_full()

start = time.time()

_ = BERT.predict(data)

end = time.time()

minutes = (end-start)/60
full_time_list.append(minutes)
print(f"{s} batch size: {minutes:.2f} minutes")

  • The try/except clause is in there because we don’t know if these functions existed in earlier versions of the package. Luckily, they all do
  • import statements within a loop looks wrong but is necessary since we need to re-import the functions once the correct package version has been installed

After iterating through each version, we find out that downgrading tensorflow improves runtime by as much as 15%!

My pet theory as to why this is the case is because newer versions of tensorflow are built assuming heavy GPU usage, which means it is optimized for this particular use case at the cost of local CPU performance.

If anyone has the real answer as to why older versions of TF run faster, please let me know!

With the following insights about Tensorflow runtime:

  • The optimal prediction batch-size is about 1,000
  • Tokenizer parameters do play a large role in prediction time
  • Tensorflow 1.X.X has a 15% boost in prediction time

We can put all these together, and see how it performs against our original batch-size experiment:

In the largest case tested, our optimal run beats the Batch(100) by 20% and Single by 57%!

Overall, this exercise was a simple and enjoyable expression of what it means to be a Data Scientist. You need to identify a problem, establish a hypothesis, develop a rigourous test, and analyze your results. In this case, it was my Tensorflow runtime. In the future, I’m sure you will find perplexing data/issues/problems in your own work.

And next time, I hope rather than check Stack Overflow and give up if the answer isn’t there, you roll up your sleeves and explore the problem space yourself. You never know what you might learn 💡

Hope this was helpful in debugging your Tensorflow prediction time issues! 🎉



Source link

Previous Post

Different Loss Functions used in Regression | by Iqra Bismi | Feb, 2023

Next Post

Artificial Intelligence of Things (AIoT) – Trends and Applications in 2023

Next Post

Artificial Intelligence of Things (AIoT) - Trends and Applications in 2023

AI is Not Here to Replace Us

The Promise of the Moon

Related Post

Artificial Intelligence

10 Most Common Yet Confusing Machine Learning Model Names | by Angela Shi | Mar, 2023

by admin
March 26, 2023
Machine Learning

How Machine Learning Will Shape The Future of the Hiring Industry | by unnanu | Mar, 2023

by admin
March 26, 2023
Machine Learning

The Pros & Cons of Accounts Payable Outsourcing

by admin
March 26, 2023
Artificial Intelligence

Best practices for viewing and querying Amazon SageMaker service quota usage

by admin
March 26, 2023
Edge AI

March 2023 Edge AI and Vision Innovation Forum Presentation Videos

by admin
March 26, 2023
Artificial Intelligence

Hierarchical text-conditional image generation with CLIP latents

by admin
March 26, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.