Multimodal Simple
Cerebras’ model library for implementing multimodal models
Model Overview
Multimodal Simple is our multimodal library, which can be used to instantiate many of the current state-of-the-art models such as LLaVA, CogVLM, and MM1 among others. Our implementation supports multiple images interleaved with text as input, and can generate text as output. The building blocks for this implementation are as follows:
- Vision Encoder: Process images through one or more image encoders to produce embeddings.
- Image Embedding Projector: Projects the embeddings from vision encoder into a shared latent space with LLM using MLPs.
- Language Model: Accepts the vision and language embeddings as input and produces text as output.
Structure of the code
configs/
: YAML configuration files.modeling_mmsimple.py
: Defines the core multimodal model.model.py
: The entry point to the model.run.py
: Training script. Performs training and validation.
Configuration files included for this model
We provide the following config files for LLaVA located under the configs directory.
Config File | Dataset | Notes |
---|---|---|
params_mm_llava_llama2_7b_phase1.yaml | LLaVA Visual Instruct Pretrain LCS-558K Dataset | LLaVA-7B Phase-1 with CLIP ViT image encoder, Vicuna-7B text encoder and mlp2x-gelu Feedforward network for Projector. Freeze image_model and text_model during training |
params_mm_llava_llama2_7b_phase2.yaml | LLaVA Visual Instruct 150K Dataset | LLaVA-7B Phase-2 with CLIP ViT image encoder, Vicuna-7B text encoder and mlp2x-gelu Feedforward network for Projector. Freeze image_model during training |
Model Training Approach
A common approach to build high-quality multimodal models with limited data is to initialize the vision encoder and language models from pretrained checkpoints (for instance CLIP-VIT-L-336/14 for vision and LLaMA/Mistral/Zephyr models for language). While there are many possible recipes for training the model, the high-level goals are as follows:
-
Pre-training for Feature Alignment: This involves training the randomly-initialized projector weights to align the image features with that of the LLM embeddings. Optionally, this could also involve training all the blocks — vision encoder, llm and projector together for further alignment of modalities.
-
Instruction Fine-tuning: In this stage, the model is trained to handle multimodal question-answering and dialogue.
Steps to train a model
The high-level steps for training this model are consistent with other models such as LLMs
- Dataset preparation: Download datasets of interest and process them using our data pre-processing scripts to generate H5 files
- Checkpoint preparation: Download pretrained checkpoints for vision and language models to prepare the initial checkpoint
- Training: Train the model using
run.py
- Export to HF: Convert checkpoint to HF checkpoint format
- Evaluation: Use standard multimodal benchmarks such lmms-eval or LLaVA source-repo
Step 1: Dataset Prep
Please follow instructions for data preprocessing in our documentation.
Step 2: Checkpoint Prep
Checkpoint converter script for converting vision encoder and LLM model checkpoints to CS format require the following directory structure:
Below, we describe how to setup CLIP + LLAMA3 8B as a LLaVA model. However, one can follow same approach to setup other vision and LLM models as well as multimodal architecture variants.
a. openai/clip-vit-large-patch14-336 checkpoints, config.json and preprocessor_config.json should be downloaded to a subdirectory image_model
.
b. LLAMA3 checkpoints and tokenizer files should be downloaded to a subdirectory text_model
c. Rename config.json
to config_lmsys.json
mv /path/to/pretrained/checkpoints/text_model/config.json /path/to/pretrained/checkpoints/text_model/config_lmsys.json
d. Download LLaVA-8B config.json from HuggingFace
We do steps (c) and (d) above since we need additional information about LLaVA model such as mm_projector_type
etc to build appropriate CS config yaml and checkpoint
Step 4: Convert checkpoints to CS Model Zoo format using checkpoint converter
- Checkpoint conversion script: modelzoo/tools/convert_checkpoint.py
- LLaVA Model checkpoint converter: modelzoo/tools/checkpoint_converters/mm_simple.py
- Command:
More information about checkpoint converters can be obtained by
python modelzoo/tools/convert_checkpoint.py list
Step 3: Training the model on CS system using run.py
IMPORTANT: See the following notes before proceeding further.
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a training run, make sure that the YAML configs used have the following set correctly:
-
The
train_input.data_dir
parameter points to the correct dataset -
The
train_input.img_data_dir
parameter points to the correct parent directory containing all images needed by the dataset. -
The
train_input.image_size
parameter corresponds to the image-size of the dataset. -
Also change sizes in
train_input.transforms
appropriately iftrain_input.image_size
is updated. -
The
image_model.image_size
points to the image size passed to each ViTModel -
The
image_model.patch_size
parameter to use different patch sizes within each ViTModel -
model.freeze
contains the regex patterns to freeze appropriate layers in the model -
image_model.image_layer_idx
parameter to specify the image_model encoder layer from which features are extracted for the input image.
YAML config files: Details on the configs for this model can be found in Configs included for this model
In the following example run commands, we use /path/to/yaml
, /path/to/model_dir
, and train
as placeholders for user supplied inputs.
/path/to/yaml
is a path to the YAML config file with model parameters such one of the configurations described in Configs included for this model./path/to/model_dir
is a path to the directory where we would like to store the logs and other artifacts of the run.--mode
specifies the desired mode to run the model in. Change to--mode eval
to run in eval mode.
To compile/validate, run train and eval on Cerebras System
Please follow the instructions on our quickstart in the Developer Docs.
To run train and eval on GPU/CPU
If running on a CPU or GPU, activate the environment from Python GPU Environment setup, and simply run:
Based on your training recipe, you might choose to train the multimodal in multiple phases. Please specify the params file and model directory for each run — this is consistent with any multi-stage training appraoch on Cerebras.
Step 4: Convert checkpoint to source code repository format to run eval
We perform evaluations on multimodal benchmarks using LLaVA source code repository. For this, we need to convert the checkpoints generated using the training run from Phase-2 to LLaVA source code repository format. This can be done using the command:
The above command generates two folders image_model
and text_model
under output-dir
as shown below:
-
Folder
image_model
consists of weights forvision_tower
in source repository -
Folder
text_model
consists of weights to be loaded for the Language model and projectors -
The LLaVA source code repository expects tokenizer files to be present along with the language model weights (code pointer). For this, please copy the tokenizer files into
text_model
folder. -
Also, please make sure
text_model.mm_vision_tower
points to theimage_model
path to ensure the weights fromimage_model
folder are loaded into the source codevision_tower
. This path is automatically added during checkpoint conversion. -
Rename folder
text_model
totext_model_llava
. This is since the source code repository expects the path to includellava
keyword in order to correctly load the checkpoints. (code pointers: builder.py, mm_utils.py) -
So, after the relevant tokenizer files are copied, the
output-dir
should look like below:
Step 5: Set up source code repository for benchmark evaluation and run evaluation benchmarks
- Setup LLaVA source code repository for multimodal benchmark evaluation by following instructions mentioned in Evaluation Docs.
- Instructions for creating conda environment and setting up the repository are mentioned in Installation Section
- Scripts to run various benchmarks are provided here
- Pass
text_model_llava
folder path to--model-path
in eval scripts in LLaVA source code repository
Implementation notes
The following modifications and assumptions are made in this implementation:
- This implementation brings in support for multiple images per sample, interleaving of image and text, as well as placement of images at arbitrary position within the sample. However, currently we do not have checkpoint converter for these new features since there is no one HF model that supports these multimodal features.
- This implementation assumes that the H5 files for the dataset are created with the release 2.3 data preprocessor, and is not backward compatible with the H5 datasets produced with the previous release (2.2).
- We currently expect all the images under a single parent folder and the relative path of images from different datasets are written under
image_key
in the H5 files generated. For example: