Vision model based on the Diffusion Transformer architecture
N
layers of DiT blocks. We support the following variants of DiT Block. More details can be found in the Section 3.2 of the paper Diffusion Transformer and Step 4: Training the model on CS system or GPU using run.py
In addition, we also support patch sizes of 2 (default), 4 and 8. The Patchify
block in Figure 1 takes noised latent tensor as input from dataloader and converts into patches of size patch_size
. The lower the patch size, the larger the number of patches (i.e maximum sequence length (MSL)) and hence larger the number of FLOPS.
In order to change the patch size used, for example to 4 x 4
, set model.patch_size: [4, 4]
in yaml configs provided.
During training, an image from the dataset is taken and passed through a frozen VAE Encoder Variational Auto Encoder to convert the image into a lower dimensional latent. Then, random gaussian noise is added to the latent tensor (Algorithm 1 of Denoising Diffusion Implicit Models) and passed as input to the DiT .Since the VAE Encoder is frozen and not updated during the training process, we prefetch the latents for all the images in the dataset using the script create_imagenet_latents.py. This helps save computation and memory during the training process. Refer to Step 3.
configs/
: YAML configuration files.modeling_dit.py
: Defines the core model DiT
.model.py
: The entry point to the model. Defines DiTModel
.utils.py
: Miscellaneous scripts to parse the params
dictionary from the YAML files.data/vision/diffusion/
: Folder containing Dataloader and preprocessing scripts.samplers/
: Folder containing samplers used in Diffusion models to sample images from checkpoints.layers/vae/
: Defines VAE(Variational Auto Encoder) model layers.layers/*
: Defines building block layers of DiT Model.display_images.py
: Utility script to display images in a folder in a grid format to look at all images at once.pipeline.py
: Defines a DiffusionPipeline object that takes in a random gaussian input and performs sampling.sample_generator.py
: Defines a Abstract Base Class SampleGenerator
to define sample generators for diffusion models.sample_generator_dit.py
: Defines a DiTSampleGenerator
that inherits from SampleGenerator
class and is used to generate required number of samples from DiT Model using a given trained checkpoint.Configuration | Description |
---|---|
params_dit_large_patchsize_2x2.yaml | DiT-L/2 model with ~458M parameters. Patch size 2×2, latent size 32×32×4. |
params_dit_xlarge_patchsize_2x2.yaml | DiT-XL/2 model with ~675M parameters. Patch size 2×2, latent size 32×32×4. |
params_dit_2B_patchsize_2x2.yaml | DiT-2B/2 model with ~2B parameters. Patch size 2×2, latent size 32×32×4. |
run.py
ILSVRC2012_img_train.tar
, ILSVRC2012_img_val.tar
, ILSVRC2012_devkit_t12.tar.gz
for the ImageNet dataset.
Once we have all three tar files, we would need to extract and preprocess the archives into the appropriate directory structure as described below.
meta.bin
file. The simplest way is to create it is to initialize torchvision.datasets.ImageNet
once.
model.vae_params
and no changes are needed except for setting the model.vae_params.latent_size
correctly.
create_imagenet_latents.py
--dest_dir
. The command also specifies to log every 10 steps and use a batch size of 16 i.e 16 images are batched together and passed to the VAE Encoder on each GPU.
train
split of datasetval
split of dataset--horizontal_flip
train
split of datasetval
split of datasetrun.py
train_input.data_dir
parameter points to the correct datasettrain_input.image_size
parameter corresponds to the image_size of the dataset.model.vae.latent_size
parameter corresponds size of latent tensors.
[32, 32]
for image size of 256 x 256
[64, 64]
for image size of 512 x 512
[floor(H / 8), floor(W / 8)]
for an image size of H x W
model.patch_size
parameter to use different patch sizes512 x 512
, please make the following changes:
train_input.image_size
: [512, 512]model.vae.latent_size
: [64, 64]train_input.transforms
(if any): change size
params under various transforms to [512, 512]/path/to/yaml
, /path/to/model_dir
, and train
as placeholders for user supplied inputs.
/path/to/yaml
is a path to the YAML config file with model parameters such one of the configurations described in Configs included for this model./path/to/model_dir
is a path to the directory where we would like to store the logs and other artifacts of the run.--mode
specifies the desired mode to run the model in. Change to --mode eval
to run in eval mode.model.reverse_process.guidance_scale=1.0
). Using a model.reverse_process.guidance_scale >1.0
enables classifier free guidance which trades off diversity for sample quality.
The sample generation settings can be found in model.reverse_params
in config yaml. We support two samplers cuurently, the DDPM Sampler[2] and DDIM Sampler[3]. All arguments in the __init__
of the samplers can be set in the yaml config model.reverse_params.sampler
section.
To generate samples from a trained DiT checkpoint, we use GPUs and sample_generator_dit.py.
Sample command to run on a single node with 4 GPUs to generate 50000 samples using trained DiT-XL/2. Each GPU uses a batch size of 64 and generates 64 samples at once. --num_fid_samples
controls the number of samples to generate. This script cares about the section model.reverse_params
in config yaml. Make sure that the settings are appropriate.
.npz
file that should be passed as input to FID score computation. Sample output looks as below:
model.reverse_params.pipeline.custom_labels
to a list of integer ImageNet labels. This will generate samples belonging to only these classes. For ex: if model.reverse_params.pipeline.custom_labels: [207, 360]
, then we will only generate samples belonging to golden_retriever
(label_id=207) and otter
(label_id=360) classes respectively.
.npz
file generated from Step 5, we can compute FID score using ADM OpenAI script evaluator.py.
In order to compute FID score,
npz
files corresponding to reference batch of ImageNetevaluator.py
np.bool
with bool
model.vae.latent_size
parameter corresponds size of latent tensors. This is the only param under model.vae_params
that needs to be changed.
[32, 32]
for image size of 256 x 256
[64, 64]
for image size of 512 x 512
[floor(H / 8), floor(W / 8)]
for image size of H x W
model.patch_size
parameter to use different patch sizesinput
: Noised latent tensor.
(batch_size, model.vae.latent_channels, model.vae.latent_size[0], model.vae.latent_size[1])
torch.bfloat16
label
: Scalar ImageNet labels.
(batch_size, )
torch.int32
diffusion_noise
: Gaussian noise that the model should predict. Also used in creating value of key noised_latent
.
(batch_size, model.vae.latent_channels, model.vae.latent_size[0], model.vae.latent_size[1])
torch.bfloat16
timestep
: Timestep sampled from ~Uniform(0, train_input.num_diffusion_steps
).
(batch_size, )
torch.int32
ConvTranspose2D
instead of Linear
layer to un-patchify the outputs.gelu with approximation tanh
, we use gelu with no approximation
for better performance.model.vae_params
. The only modification we make in our implementation of VAE Model is that we use Attention Layer defined in modelzoo.AdaLN-Zero
variant of DiT model. Support for In-Context
and Cross-Attention
variants are planned for future releases.