Diffusion Transformer
Vision model based on the Diffusion Transformer architecture
Model overview
This directory contains implementation for Diffusion Transformer (DiT). Diffusion Transformer[1], as the name suggests, belongs to the class of diffusion models. However, the key difference is that it replaces the UNet architecture backbone typically used in previous diffusion models with a Transformer backbone and some modifications. This model beats the previous diffusion models in the FID-50K eval metric.
A DiT model consists of N
layers of DiT blocks. We support the following variants of DiT Block. More details can be found in the Section 3.2 of the paper Diffusion Transformer and Step 4: Training the model on CS system or GPU using run.py
In addition, we also support patch sizes of 2 (default), 4 and 8. The Patchify
block in Figure 1 takes noised latent tensor as input from dataloader and converts into patches of size patch_size
. The lower the patch size, the larger the number of patches (i.e maximum sequence length (MSL)) and hence larger the number of FLOPS.
In order to change the patch size used, for example to 4 x 4
, set model.patch_size: [4, 4]
in yaml configs provided.
During training, an image from the dataset is taken and passed through a frozen VAE Encoder Variational Auto Encoder to convert the image into a lower dimensional latent. Then, random gaussian noise is added to the latent tensor (Algorithm 1 of Denoising Diffusion Implicit Models) and passed as input to the DiT .Since the VAE Encoder is frozen and not updated during the training process, we prefetch the latents for all the images in the dataset using the script create_imagenet_latents.py. This helps save computation and memory during the training process. Refer to Step 3.
Structure of the code
configs/
: YAML configuration files.modeling_dit.py
: Defines the core modelDiT
.model.py
: The entry point to the model. DefinesDiTModel
.utils.py
: Miscellaneous scripts to parse theparams
dictionary from the YAML files.data/vision/diffusion/
: Folder containing Dataloader and preprocessing scripts.samplers/
: Folder containing samplers used in Diffusion models to sample images from checkpoints.layers/vae/
: Defines VAE(Variational Auto Encoder) model layers.layers/*
: Defines building block layers of DiT Model.display_images.py
: Utility script to display images in a folder in a grid format to look at all images at once.pipeline.py
: Defines a DiffusionPipeline object that takes in a random gaussian input and performs sampling.sample_generator.py
: Defines a Abstract Base ClassSampleGenerator
to define sample generators for diffusion models.sample_generator_dit.py
: Defines aDiTSampleGenerator
that inherits fromSampleGenerator
class and is used to generate required number of samples from DiT Model using a given trained checkpoint.
Available Configurations
Configuration | Description |
---|---|
params_dit_large_patchsize_2x2.yaml | DiT-L/2 model with ~458M parameters. Patch size 2×2, latent size 32×32×4. |
params_dit_xlarge_patchsize_2x2.yaml | DiT-XL/2 model with ~675M parameters. Patch size 2×2, latent size 32×32×4. |
params_dit_2B_patchsize_2x2.yaml | DiT-2B/2 model with ~2B parameters. Patch size 2×2, latent size 32×32×4. |
Sequence of the steps to perform
The high-level steps for training a model are relatively simple, involving data-processing and then model training and evaluation
- Step 1: ImageNet dataset download and preparation
- Step 2: Checkpoint Conversion of Pre-trained VAE.
- Step 3: Preprocessing and saving Latent tensors from images and VAE Encoder on GPU
- Step 4: Training the model on CS system or GPU using
run.py
- Step 5: Generating 50K samples from trained checkpoint on GPUs
- Step 6: Using OpenAI FID evaluation repository to compute FID score.
The steps are elaborated below:
Step 1: ImageNet dataset download and preparation
Inorder to download the ImageNet dataset, register on the ImageNet website[4]. The dataset can only be downloaded after the ImageNet website confirms the registration and sends a confirmation email. Please follow up with ImageNet support if a confirmation email is not received within a couple of days.
Download the tar files ILSVRC2012_img_train.tar
, ILSVRC2012_img_val.tar
, ILSVRC2012_devkit_t12.tar.gz
for the ImageNet dataset.
Once we have all three tar files, we would need to extract and preprocess the archives into the appropriate directory structure as described below.
Inorder to arrange the ImageNet dataset in the above format, Pytorch repository provides an easy to use script that can be found here: https://github.com/pytorch/examples/blob/main/imagenet/extract_ILSVRC.sh. Download this script and invoke it as follows to preprocess the ImageNet dataset.
We also need a meta.bin
file. The simplest way is to create it is to initialize torchvision.datasets.ImageNet
once.
Once the ImageNet dataset and folder are in the expected format, proceed to Step 2
Step 2: Checkpoint Conversion of Pre-trained VAE
The next step is to convert the pretrained checkpoint provided by StabilityAI and hosted on HuggingFace to CS namespace format. This can be done using the script vae_hf_cs.py. The script downloads the pretrained VAE checkpoint from StabilityAI in HuggingFace and converts to CS namespace based on model layers defined in dit/layers/vae. For this script, we only care about the params defined under model.vae_params
and no changes are needed except for setting the model.vae_params.latent_size
correctly.
Command to run:
Step 3: Preprocessing and saving Latent tensors from images and VAE Encoder on GPU
For training the DiT model, we prefetch the latent tensor outputs from a pretrained VAE Encoder using the script create_imagenet_latents.py
Inorder to preprocess the ImageNet dataset and create latent tensors, using a GPU or multiple GPUs is required.
Sample command is as follows for a single node with 4 GPUs. In the following command, we are using an image of size 256 x 256 and saving the latents to a folder specified by --dest_dir
. The command also specifies to log every 10 steps and use a batch size of 16 i.e 16 images are batched together and passed to the VAE Encoder on each GPU.
a. Create ImageNet Latent Tensors from VAE for train
split of dataset
b. Create ImageNet Latent Tensors from VAE for val
split of dataset
The output folder shown below for reference and will have the same format as shown in Step 1:
DiT models use horizontal flip of images as augmentation. The script also supports saving latent tensors from horizontally flipped images by passing the flag --horizontal_flip
c. Create ImageNet Latent Tensors with horizontal flip from VAE for train
split of dataset
d. Create ImageNet Latent Tensors with horizontal flip from VAE for val
split of dataset
Step 4: Training the model on CS system or GPU using run.py
IMPORTANT: See the following notes before proceeding further.
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a training run, make sure that in the YAML config file being used has the following set correctly:
- The
train_input.data_dir
parameter points to the correct dataset - The
train_input.image_size
parameter corresponds to the image_size of the dataset. - The
model.vae.latent_size
parameter corresponds size of latent tensors.- Set to
[32, 32]
for image size of256 x 256
- Set to
[64, 64]
for image size of512 x 512
- In general, set to
[floor(H / 8), floor(W / 8)]
for an image size ofH x W
- Set to
- The
model.patch_size
parameter to use different patch sizes
To use with image size 512 x 512
, please make the following changes:
train_input.image_size
: [512, 512]model.vae.latent_size
: [64, 64]train_input.transforms
(if any): changesize
params under various transforms to [512, 512]
YAML config files: Details on the configs for this model can be found in Configs included for this model
In the following example run commands, we use /path/to/yaml
, /path/to/model_dir
, and train
as placeholders for user supplied inputs.
/path/to/yaml
is a path to the YAML config file with model parameters such one of the configurations described in Configs included for this model./path/to/model_dir
is a path to the directory where we would like to store the logs and other artifacts of the run.--mode
specifies the desired mode to run the model in. Change to--mode eval
to run in eval mode.
To compile/validate, run train and eval on Cerebras System
Please follow the instructions on our quickstart in the Developer Docs.
To run train and eval on GPU/CPU
If running on a cpu or gpu, activate the environment from Python GPU Environment setup, and simply run:
Step 5: Generating 50K samples from trained checkpoint on GPUs from FID score computation
Diffusion models report Fréchet inception distance (FID)[7] metric on 50K samples generated from the trained checkpoint. In order to generate samples, we use a DDPM Sampler [2] and without guidance (model.reverse_process.guidance_scale=1.0
). Using a model.reverse_process.guidance_scale >1.0
enables classifier free guidance which trades off diversity for sample quality.
The sample generation settings can be found in model.reverse_params
in config yaml. We support two samplers cuurently, the DDPM Sampler[2] and DDIM Sampler[3]. All arguments in the __init__
of the samplers can be set in the yaml config model.reverse_params.sampler
section.
To generate samples from a trained DiT checkpoint, we use GPUs and sample_generator_dit.py.
Sample command to run on a single node with 4 GPUs to generate 50000 samples using trained DiT-XL/2. Each GPU uses a batch size of 64 and generates 64 samples at once. --num_fid_samples
controls the number of samples to generate. This script cares about the section model.reverse_params
in config yaml. Make sure that the settings are appropriate.
More information can be found by running:
The script generates a .npz
file that should be passed as input to FID score computation. Sample output looks as below:
To generate samples belonging to specific ImageNet label classes, please set model.reverse_params.pipeline.custom_labels
to a list of integer ImageNet labels. This will generate samples belonging to only these classes. For ex: if model.reverse_params.pipeline.custom_labels: [207, 360]
, then we will only generate samples belonging to golden_retriever
(label_id=207) and otter
(label_id=360) classes respectively.
Step 6: Using OpenAI FID evaluation repository to compute FID score
Now that we have the 50K samples and .npz
file generated from Step 5, we can compute FID score using ADM OpenAI script evaluator.py.
In order to compute FID score,
a. Set up a conda environment to use OpenAI evaluation script
b. Clone OpenAI guided-diffusion GitHub repository
c. Download the npz
files corresponding to reference batch of ImageNet
d. Make changes to evaluator.py
Make the following changes in evaluator.py. These are needed to account for numpy deprecations i.e replace instances of np.bool
with bool
e. Launch FID eval script with the following command
The following changes can be made to use other settings of DiT model:
- The
model.vae.latent_size
parameter corresponds size of latent tensors. This is the only param undermodel.vae_params
that needs to be changed.- Set to
[32, 32]
for image size of256 x 256
- Set to
[64, 64]
for image size of512 x 512
- Set to
[floor(H / 8), floor(W / 8)]
for image size ofH x W
- Set to
- The
model.patch_size
parameter to use different patch sizes
DataLoader Features Dictionary
DiffusionLatentImageNet1KProcessor outputs the following features dictionary with keys/values:
input
: Noised latent tensor.- Shape:
(batch_size, model.vae.latent_channels, model.vae.latent_size[0], model.vae.latent_size[1])
- Type:
torch.bfloat16
- Shape:
label
: Scalar ImageNet labels.- Shape:
(batch_size, )
- Type:
torch.int32
- Shape:
diffusion_noise
: Gaussian noise that the model should predict. Also used in creating value of keynoised_latent
.- Shape:
(batch_size, model.vae.latent_channels, model.vae.latent_size[0], model.vae.latent_size[1])
- Type:
torch.bfloat16
- Shape:
timestep
: Timestep sampled from ~Uniform(0,train_input.num_diffusion_steps
).- Shape:
(batch_size, )
- Type:
torch.int32
- Shape:
Implementation notes
There are a couple modifications to the DiT model made in this implementation:
- We use
ConvTranspose2D
instead ofLinear
layer to un-patchify the outputs. - While we support
gelu with approximation tanh
, we usegelu with no approximation
for better performance. - Inorder to use the exact model as StabilityAI pretrained VAE model, we don’t have to make any changes to the params under
model.vae_params
. The only modification we make in our implementation of VAE Model is that we use Attention Layer defined in modelzoo. - We currently do not support Kullback-Leibler(KL) loss to optimize Σ, hence the output from DiT Model includes only the noise.
- We currently support
AdaLN-Zero
variant of DiT model. Support forIn-Context
andCross-Attention
variants are planned for future releases.
Citations
[1] Scalable Diffusion Models with Transformers
[2] Denoising Diffusion Probabilistic Models
[3] Denoising Diffusion Implicit Models
[4] ImageNet Large Scale Visual Recognition Challenge
[5] Diffusion Models Beat GANs on Image Synthesis
[7] GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium