But it works on my machine?
This is a classic meme in the tech community, especially for Data Scientists who want to ship their amazing machine-learning model, only to learn that the production machine has a different operating system. Far from ideal.
There is a solution thanks to these wonderful things called containers and tools to control them such as Docker.
In this post, we will dive into what containers are and how you can build and run them using Docker. The use of containers and Docker has become an industry standard and common practice for data products. As a Data Scientist, learning these tools is then an invaluable tool in your arsenal.
Docker is a service that help build, run and execute code and applications in containers.
Now you may be wondering, what is a container?
Ostensibly, a container is very similar to a virtual machine (VM). It is a small isolated environment where everything is self ‘contained’ and can be run on any machine. The primary selling point of containers and VMs is their portability, allowing your application or model to run seamlessly on any on-premise server, local machine, or on cloud platforms such as AWS.
The main difference between containers and VMs is how they use their hosts computer resources. Containers are a lot more lightweight as they do not actively partition the hardware resources of the host machine. I will not delve into the full technical details here, however if you want to understand a bit more, I have linked a great article explaining their differences here.
Docker is then simply a tool we use to create, manage and run these containers with ease. It is one of the main reasons why containers have become very popular, as it enables developers to easily deploy applications and models that run anywhere.
There are three main elements we need to run a container using Docker:
- Dockerfile: A text file that contains the instructions of how to build a docker. image
- Docker Image: A blueprint or template to create a Docker container.
- Docker Container: An isolated environment that provides everything an application or machine learning model needs to run. Includes things such as dependencies and OS versions.
There are also a few other key points to note:
- Docker Daemon: A background process (daemon) that deals with the incoming requests to docker.
- Docker Client: A shell interface that enables the user to speak to Docker through its daemon.
- DockerHub: Similar to GitHun, a place where developers can share their Docker images.
The first thing you should install is Homebrew (link here). This is dubbed as the ‘missing package manager for MacOS’ and is very useful for anyone coding on their Mac.
To install Homebrew, simply run the command given on their website:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Verify Homebrew is installed by running
Now with Homebrew installed, you can install docker by running
brew install docker. Verify docker is installed by running
which docker , the output should not rise any errors and look like this:
The final part, is it install Colima. Simply, run
install colima and verify it is installed with
which colima. Again, the output should look like this:
Now you might be wondering, what on earth is Colima?
Colima is a software package that enables container runtimes on MacOS. In more laymen terms, Colima creates the environment for containers to work on our system. To achieve this, it runs a Linux virtual machine with a daemon that Docker can communicate with using the client-server model.
Alternativetly, you can also install Docker desktop instead of Colima. However, I prefer Colima for a few reasons: its free, more lightweight and I like working in the terminal!
See this blog post here for more arguments for Colima
Below is an example of how Data Scientists and Machine Learning Engineers can deploy their model using Docker:
The first step is obviously to build their amazing model. Then, you need to wrap up all the stuff you are using to run the model, stuff like the python version and package dependencies. The final step is to use that requirements file inside the Dockerfile.
If this seems completely arbitrary to you at the moment don’t worry, we will go over this process step by step!
Let’s start by building a basic model. The provided code snippet displays a simple implementation of the Random Forest classification model on the famous Iris dataset:
This file is called
basic_rf_model.py for reference.
Create Requirements File
Now that we have our model ready, we need to create a
requirement.txt file to house all the dependencies that underpin the running of our model. In this simple example, we luckily only rely on the
scikit-learn package. Therefore, our
requirement.txt will simply look like this:
You can check the version you are running on your computer by the
scikit-learn --version command.
Now we can finally create our Dockerfile!
So, in the same directiory as the
basic_rf_model.py, create a file named
Dockerfile we will have the following:
Let’s go over line by line to see what it all means:
FROM python:3.9: This is the base image for our image
MAINTAINER email@example.com: This indicates who maintains this image
WORKDIR /src: Sets the working directory of the image to be src
COPY . .: Copy the current directory files to the Docker directory
RUN pip install -r requirements.txt: Install the requirements from
requirement.txtfile into the Docker environment
CMD ["python", "basic_rf_model.py"]: Tells the container to execute the command
python basic_rf_model.pyand run the model
Initiate Colima & Docker
The next step is setup the Docker environment: First we need to boot up Colima:
After Colima has started up, check that the Docker commands are working by running:
It should return something like this:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
This is good and means both Colima and Docker are working as expected!
docker pscommand lists all the current running containers.
Now it is time to build our first Docker Image from the
Dockerfile that we created above:
docker build . -t docker_medium_example
-t flag indicates the name of the image and the
. tells us to build from this current directory.
If we now run
docker images, we should see something like this:
Congrats, the image has been built!
After the image has been created, we can run it as a container using the
IMAGE ID listed above:
docker run bb59f770eb07
Because all it has done is run the
This tutorial is just scratching the surface of what Docker can do and be used for. There are many more features and commands to learn to understand Docker. I great detailed tutorial is given on the Docker website that you can find here.
One cool feature is that you can run the container in interactive mode and go into its shell. For example, if we run:
docker run -it bb59f770eb07 /bin/bash
You will enter the Docker container and it should look something like this:
We also used the
ls command to show all the files in the Docker working directory.
Docker and containers are fantastic tools to ensure Data Scientists’ models can run anywhere and anytime with no issues. They do this by creating small isolated compute environments that contain everything for the model to run effectively. This is called a container. It is easy to use and lightweight, rendering it a common industrial practice nowadays. In this article, we went over a basic example of how you can package your model into a container using Docker. The process was simple and seamless, so is something Data Scientists can learn and pick up quickly.
Full code used in this article can be found at my GitHub here:
(All emojis designed by OpenMoji — the open-source emoji and icon project. License: CC BY-SA 4.0)