Fine tune Stable Diffusion on your images using Textual Inversion

In this article, we will see how to fine-tune text to image AI model, Stable Diffusion on our own images.

Fine tuning with textual inversion can be achieved with as few as 3-5 image examples.

We will cover two ways to do this in this article.

Using Google Colab Notebooks to fine tune Stable Diffusion
Fine tuning Stable Diffusion using textual Inversion locally

Let’s cover them one by one

Table of Contents

Using Google Colab Notebooks to fine tune Stable Diffusion

The easiest way of course is Google Colab Notebooks. They can be run in your browser and you require any special hardware like GPUs.

Open this Google Colab Notebook and follow the instructions in the notebook to run it.
Next, Open the inference Notebook and Run all the cells.

Fine tuning Stable Diffusion using textual Inversion locally

Another way to fine tune Stable Diffusion on your images is using your hardware.

For this, you either need a GPU-enabled machine locally or a GPU-enabled VM in the cloud.

You will also need to have Python 3 or higher installed and should know your way around the command line.

Below are the steps:

Install Python dependencies by running this command

pip install diffusers[training] accelerate transformers

Next, configure the HuggingFace Accelerate environment by running the below command

accelerate config

Download Stable Diffusion weights

First, visit Stable Diffusion page on HuggingFace to accept the license

For the next part, you need HuggingFace access token

Next, authenticate with your token by running below command

huggingface-cli login

Fine tuning can be started using below command

export MODEL_NAME=”CompVis/stable-diffusion-v1-4″
export DATA_DIR=”path-to-dir-containing-images”

accelerate launch textual_inversion.py \
–pretrained_model_name_or_path=$MODEL_NAME –use_auth_token \
–train_data_dir=$DATA_DIR \
–learnable_property=”object” \
–placeholder_token=”<cat-toy>” –initializer_token=”toy” \
–resolution=512 \
–train_batch_size=1 \
–gradient_accumulation_steps=4 \
–max_train_steps=3000 \
–learning_rate=5.0e-04 –scale_lr \
–lr_scheduler=”constant” \
–lr_warmup_steps=0 \
–output_dir=”textual_inversion_cat”

To generate images with your new fine tuned model, run below command

from torch import autocast
from diffusers import StableDiffusionPipeline

model_id = “path-to-your-trained-model”
pipe = StableDiffusionPipeline.from_pretrained(model_id,torch_dtype=torch.float16).to(“cuda”)

prompt = “A <cat-toy> backpack”

with autocast(“cuda”):
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]

image.save(“cat-backpack.png”)

HarishGarg.com