Deploy large language models with bnb-Int8 for Hugging Face
In this tutorial we will deploy BigScience’s BLOOM model, one of the most impressive large language models (LLMs), in an Amazon SageMaker endpoint. To do so, we will leverage the bitsandbytes (bnb) Int8 integration for models from the Hugging Face (HF) Hub. With these Int8 weights we can run large models that previously wouldn’t fit into our GPUs.
The code for this tutorial can be found in this GitHub repo. Please note that the bnb-Int8 for HF integration is currently in public beta whose purpose is to collect bugs that might occur for different models and setups. Consider this before you use it in a production environment. You can find more information about the beta program here.
Disclaimer: The purpose of this tutorial is to walk through the steps of setting up bnb-Int8 + HF. Because of that we will only deploy the 3B version of BLOOM, which could easily be hosted without the bnb-Int8 integration. The steps for deploying larger models will be the same but it will take significantly longer because of the time it takes to download them.
Ever since OpenAI released GPT-3 in 2020, LLMs have taken the world by storm. Within weeks plenty of impressive demos were created, which can found on the Awesome GPT-3 website. Very quickly it became clear that these LLMs are one of the most important discoveries in Natural Language Processing (NLP). This is because these models have impressive zero-shot performance, i.e. they are ready to go without any further model training. They also are very versatile (from writing legal text to code) and multi-lingual.
As a result, more and more LLMs were created by different organisations in attempts to improve the performance of LLMs and/or to make them open-source and more transparent (GPT-3 is a proprietary model). One of the latest and maybe most important models that was “born” is BigScience’s BLOOM model. From their website:
Large language models (LLMs) have made a significant impact on AI research. These powerful, general models can take on a wide variety of new language tasks from a user’s instructions. However, academia, nonprofits and smaller companies’ research labs find it difficult to create, study, or even use LLMs as only a few industrial labs with the necessary resources and exclusive rights can fully access them. Today, we release BLOOM, the first multilingual LLM trained in complete transparency, to change this status quo — the result of the largest collaboration of AI researchers ever involved in a single research project.
But even if these new LLMs are now open-source doesn’t mean that we can just download them and use them on our laptops. These models require a significant amount of disk space, RAM, and GPUs to run. This is why initiatives like bitsandbytes are important — they use nifty techniques to reduce the hardware requirements for LLMs and make it possible to run them with decent latency and within reasonable costs:
There are several ways to reduce the size and memory footprint of large AI models — some of the most important ones are knowledge distillation, weight pruning, and quantization. Note that these are not mutually exclusive and can be combined with each other. For a great example on how these techniques can be used, see chapter 8 of the NLP with Transformers book.
That being said, the bnb team focuses on the last technique, quantization. They have introduced a particular technique called block-wise quantization which you can read more about here (the paper goes slightly beyond the scope of this tutorial 😉). Their results are impressive:
So, let’s get started with deploying the BLOOM-3B model to an Amazon SageMaker (SM) endpoint. Our game plan is as follows:
- Downloading the model form the HF Model Hub
- Writing a custom inference script
- Packaging everything in a model.tar.gz file and uploading it to S3
- Deploying the model to an endpoint
- Testing the model
Downloading the model
We can download all model files using Git LFS like so:
This will download the files to our local machine.
AWS & Hugging Face have developed the SageMaker Hugging Face Inference Toolkit that makes it easy to deploy HF models on SM for inference. All we need to do is write an inference script (and the requirements.txt file) that defines how we want to load the model and generate predictions:
Note that we use the parameter load_in_8bit=True to use the bnb-Int8 integration.
We also need a requirements.txt file that ensures that the required modules will be installed in the inference container image:
The last line ensures that the latest version of the transformers library will be installed.
Upload to S3
Now we package everything in a model.tar.gz file and upload that file to S3:
Deploying the model
Now we create a representation of the model that points to the S3 location and deploy the model with one command:
This will take a few minutes.
After the model successfully deployed, we can test it. We wrote our inference script in such a way that we can pass any and all parameters for text generation, see detailed list for these parameters here. Let’s use sampling and a temperature of 0.5:
It seems to work, but we might want to adjust the parameters a bit more. Not sure if bnb-Int8 really works best on my Android phone! 😂
In this tutorial we deployed BigScience’s BLOOM-3B model to a SageMaker endpoint leveraging the bnb-Int8 integration. From here the next step could be to try and deploy even larger models — do let me know if you try it out and what your experience is. And if you encounter any unexpected behaviour with the bnb-Int8 integration, remember it’s still in Beta — please let the team know of any bugs.