A step-by-step guide to build and deploy a Flask app
Ever since Stability.ai released Stable Diffusion (their open-sourced text-to-image model) just a few short weeks ago, the ML community has been crazed about the doors that it opens. As an open-sourced alternative to OpenAI’s gated DALL·E 2 with comparable quality, Stable Diffusion offers something to everyone: end-users can generate images virtually for free, developers can embed the model into their service, ML engineers can investigate and modify the code, and researchers have full leeway to push the state of the art even further.
Despite the avalanche of tutorials on how to leverage Stable Diffusion, I couldn’t find a verified recipe for hosting the model myself. My goal is to issue HTTP requests to my own service from the comfort of my browser. No credit limits, no login hassle, nobody spying on my images. So I launched on a day-long quest to build and deploy a Stable Diffusion webserver on Google Cloud.
This article includes all the painful little details I had to figure out, hoping that it saves you time. Here are the high-level steps (we will dive deeper into each one of them below):
- Make sure you have enough GPU quota
- Create a virtual machine with a GPU attached
- Download Stable Diffusion and test inference
- Bundle Stable Diffusion into a Flask app
- Deploy and make your webserver publicly accessible
Since GPUs are still not cheap, Google is running a tight ship when it comes to its GPU fleet, provisioning its limited supply to those who need it the most, and those who are willing to pay. By default, free trial accounts do not have GPU quota. To check your GPU quota:
Navigation (hamburger menu) > IAM & Admin > Quotas and
CTRL+F for “GPUs (all regions)”. If your limit is 0 or your current usage percentage is 100%, you will need to request additional quota. Otherwise, you can skip to step 2 below.
To increase your quota, select the “GPUs (all regions)” row, then click on the
EDIT QUOTAS button (top-right of the console). For this tutorial, you will need a single GPU, so increase your quota by 1. Note that you will have to include a justification in your request — make sure your provide an explanation for why a CPU cannot satisfy your need. My initial request, which only included a wishy-washy note, was rejected. In my second (and successful) attempt, I explicitly communicated that I’m working with a big ML model that requires a GPU. Note that, if your request is reviewed by a human, it might take 2–3 business days; if you follow up on the ticket and explain your urgency, they might respond faster.
Once you have GPU quota, you can now create a virtual machine (VM) instance with a GPU attached.
From the navigation (hamburger menu):
Compute Engine > VM instances . Then click
CREATE INSTANCE (top left of the console). For general instructions on how to fill in this form, you can follow this official guide; here I will focus on the settings that are particularly relevant for running Stable Diffusion:
- Series: Select
- Machine type: Select
n1-standard-4. This is the cheapest option with enough memory (15GB) to load Stable Diffusion. Unfortunately, the next cheapest option (7.5GB) is not enough, and you will run out of memory when loading the model and transferring it to the GPU.
- GPU type: Expand
CPU PLATFORM AND GPUand click the
ADD GPUbutton. Select
NVIDIA Tesla T4— this is the cheapest GPU and it does the job (it has 16GB of VRAM, which meets Stable Diffusion’s requirement of 10GB). If curious, take a look at the comparison chart and the pricing chart. Note that you could make the GPU preemptible to get a better price (i.e. Google will reclaim it whenever it needs it for higher-priority jobs), but I personally find that frustrating even when just playing around.
- Image: Scroll down to
Boot diskand click on
SWITCH IMAGE. For the operating system, select
Deep Learning on Linux; for the version, select
Debian 10 based Deep Learning VM with CUDA 11.0 M95.
- Access: Assuming that you’ll want to make your server publicly available: (a) under
Identity and API access, select
Allow default accessand (b) under
Allow HTTP trafficand
Allow HTTPS traffic.
Finally, click the
CREATE button. Note that this can get quite pricey (the monthly estimate is ~$281 at the time of writing).
Once the VM instance is created, access it via SSH:
gcloud compute ssh --zone
Next, let’s verify that you can run Stable Diffusion inference locally. First, download the necessary artifacts:
# Clone the public Github repository.
git clone https://github.com/CompVis/stable-diffusion.git# Create a Python virtual environment.
conda env create -f environment.yaml
conda activate ldm
We will use HuggingFace’s
diffusers library to test inference. Create a new file called
inference.py with the following contents:
from torch import autocast
from diffusers import StableDiffusionPipelineassert torch.cuda.is_available()pipe = StableDiffusionPipeline.from_pretrained(
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt)["sample"]
Next, log into HuggingFace via the console, then run the inference script:
# Enter the access token from your HuggingFace account.python inference.py
This invocation might fail and direct you to a HuggingFace link, where you are expected to accept the terms and conditions of using Stable Diffusion (they just want you to acknowledge you’re not evil). Once you check that box, re-run the inference code (which should take about 15 seconds) and make sure you can find the generated image under
austronaut_rides_horse.png . To download it onto your machine to view it, you can use
gcloud compute scp .
Now that you verified inference works correctly, we will build a webserver as a Flask app. On each query, the server will read the
prompt parameter, run inference using the Stable Diffusion model, and return the generated image. To get started, install Flask and create a directory for the app:
pip install Flask
cd ~; mkdir flask_app
Paste this simple Flask app in a file called
from flask import Flask, request, send_file
from torch import autocast
from diffusers import StableDiffusionPipelineapp = Flask(__name__)
assert torch.cuda.is_available()pipe = StableDiffusionPipeline.from_pretrained(
image = pipe(prompt)["sample"]
img_data = io.BytesIO()
if "prompt" not in request.args:
return "Please specify a prompt parameter", 400 prompt = request.args["prompt"]
img_data = run_inference(prompt)
return send_file(img_data, mimetype='image/png')
Note that this app is very barebones and it simply returns the raw image. A more practical app would return an HTML form with an input field for the prompt, and potentially other knobs (like the desired image dimensions). GradIO and Streamlit are great libraries to build more elaborate apps.
Now verify that the Flask app runs with no errors:
This should start the server on localhost at port 5000. You won’t yet be able to access this server from a browser, since port 5000 is not accessible by default.
While Flask’s default server is fine for development, it is standard practice to deploy a Flask app in production using gunicorn. I won’t cover the reasons here, but you can read this great explanation for why gunicorn is preferred. To install it, simply run
pip install gunicorn . To bring the webserver up, run the following command:
gunicorn -b :5000 --timeout=20 app:app
-b parameter is setting the desired port. You can change this to any other port that is not in use. The
--timeout parameter sets the number of seconds before gunicorn resets its workers, assuming something went wrong. Since running a forward pass in the Stable Diffusion model takes 15 seconds on average, set the timeout to at least 20 seconds.
If you want the server to survive after you log out of the VM instance, you can use the
nohup Linux utility (i.e., “no hick-ups”):
nohup gunicorn -b :5000 --timeout=20 app:app &
The final ampersand sends the process to run in the background (so you regain control of the command line). Logs will be exported to a file called
nohup.out , usually placed in the directory where you ran the command.
Creating a Firewall rule to make the port accessible
The final step is to make requests to this server from a browser. To do that, we need to make your port accessible.
From the navigation (hamburger menu):
VPC Network > Firewall . From the top menu, click
CREATE FIREWALL RULE . In the form, set the following:
- Name: allow-stable-diffusion-access (or your preferred name)
- Logs: on
- Direction of traffic: Ingress
- Action on match: Allow
- Targets: Specified target tags
- Target tags: deeplearning-vm (This tag is automatically added to your VM when you choose the “Deep Learning on Linux” image. You could manually add another tag to your VM and reference it here.)
- Protocols and ports: TCP — 5000, or your chosen port.
Once the form is complete, click
Sending queries to the webserver from a browser
Finally, find the IP address of your VM (from the navigation menu,
COMPUTE ENGINE > VM INSTANCES and look at the “External IP” column of your VM. If the IP address is 184.108.40.2069, then your webserver is accessible at http://220.127.116.119:5000.
Remember that the server expects a parameter called
prompt , which we can send as an HTTP parameter. For the prompt “robot dancing”, here is what the URL looks like:
Make sure that the browser doesn’t automatically default to
https (instead of
http , since we don’t have an SSL certificate set up).
There are many reasons why this webserver is not ready for production use, but the biggest bottleneck is its single GPU device. Given that running inference requires 10GB of VRAM (and our GPU has a mere 15 GB memory), gunicorn cannot afford to bring up more than one worker. In other words, the server can only handle one query at a time (for which it takes 15 seconds to resolve).
For less computationally intensive tasks, the standard solution are platforms for “serverless containerized micro-services” like Google Cloud Run (GCR); AWS and Azure have analogous offerings. Developers bundle their web apps in containers (standalone computational environments that contain all necessary dependencies to run the application, like Docker) and hand them over to the cloud. GCR deploys these containers on actual machines and scales the deployment depending on demand (the number of requests per second); if necessary, GCR can allocate tens of thousands of CPUs for your service and thus make it highly available. You don’t need to worry about spinning off servers yourself, or restarting them when they die. The billing model is also convenient for the user, who ends up paying per usage (instead of having to permanently keep up a fixed number of machines).
However, as of September 2023, Google Cloud Run does not support GPUs. Given that the acquisition and operational cost of a GPU is still quite high, it is not very surprising that Google remains protective of GPU usage. One can only assume that GCR’s algorithms for autoscaling cannot prevent devices from being idle for good portions of time; while an idle CPU is not a big loss, leaving a GPU unused is a bigger opportunity cost. Also, they probably want to prevent situations in which people blindly over-scale and are faced with monstrous bills at the end of the month.
As a side note, Google Cloud Run for Anthos is starting to offer GPUs — but this is a service meant for power users / high-end customers that require inter-operability between multiple clouds and on-premise environments. It is definitely not for the ML enthusiast who wants to bring up their own Stable Diffusion web server.
While it was fun investigating the best way to serve Stable Diffusion via Google Cloud, this is not necessarily the most effective way of generating AI images. Depending on your needs, the following workflows might be more appropriate:
- For non-tech users: head over to Dreamstudio, Stability.ai’s official service, where you get some free credits.
- For ML enthusiasts who just want to play around: use Google Colab. With the free tier, you get GPU access whenever available. For 10$/month, you can upgrade to Colab Pro, which promises “faster GPUs and more memory”.
- For developers seeking to embed Stable Diffusion into their service: Call the API from Replicate, for $0.0023/second. They guarantee that 80% of the calls finish in 15 seconds, so the 80th cost percentile for an image is around $0.0345. Replicate is similar to the better-known HuggingFace, but focuses more on computer vision instead of natural language processing. For now, HuggingFace doesn’t offer their standard “Accelerated Inference” API for Stable Diffusion, but it’s most likely in the works.