Model Overview
Multimodal Simple is our multimodal library, which can be used to instantiate many of the current state-of-the-art models such as LLaVA, CogVLM, and MM1 among others. Our implementation supports multiple images interleaved with text as input, and can generate text as output. The building blocks for this implementation are as follows:- Vision Encoder: Process images through one or more image encoders to produce embeddings.
- Image Embedding Projector: Projects the embeddings from vision encoder into a shared latent space with LLM using MLPs.
- Language Model: Accepts the vision and language embeddings as input and produces text as output.
Structure of the code
configs/: YAML configuration files.modeling_mmsimple.py: Defines the core multimodal model.model.py: The entry point to the model.run.py: Training script. Performs training and validation.
Configuration files included for this model
We provide the following config files for LLaVA located under the configs directory.| Config File | Dataset | Notes |
|---|---|---|
| params_mm_llava_llama2_7b_phase1.yaml | LLaVA Visual Instruct Pretrain LCS-558K Dataset | LLaVA-7B Phase-1 with CLIP ViT image encoder, Vicuna-7B text encoder and mlp2x-gelu Feedforward network for Projector. Freeze image_model and text_model during training |
| params_mm_llava_llama2_7b_phase2.yaml | LLaVA Visual Instruct 150K Dataset | LLaVA-7B Phase-2 with CLIP ViT image encoder, Vicuna-7B text encoder and mlp2x-gelu Feedforward network for Projector. Freeze image_model during training |
Model Training Approach
A common approach to build high-quality multimodal models with limited data is to initialize the vision encoder and language models from pretrained checkpoints (for instance CLIP-VIT-L-336/14 for vision and LLaMA/Mistral/Zephyr models for language). While there are many possible recipes for training the model, the high-level goals are as follows:- Pre-training for Feature Alignment: This involves training the randomly-initialized projector weights to align the image features with that of the LLM embeddings. Optionally, this could also involve training all the blocks — vision encoder, llm and projector together for further alignment of modalities.
- Instruction Fine-tuning: In this stage, the model is trained to handle multimodal question-answering and dialogue.
Steps to train a model
The high-level steps for training this model are consistent with other models such as LLMs- Dataset preparation: Download datasets of interest and process them using our data pre-processing scripts to generate H5 files
- Checkpoint preparation: Download pretrained checkpoints for vision and language models to prepare the initial checkpoint
- Training: Train the model using
run.py - Export to HF: Convert checkpoint to HF checkpoint format
- Evaluation: Use standard multimodal benchmarks such lmms-eval or LLaVA source-repo
Step 1: Dataset Prep
Please follow instructions for data preprocessing in our documentation.Step 2: Checkpoint Prep
Checkpoint converter script for converting vision encoder and LLM model checkpoints to CS format require the following directory structure:image_model.
b. LLAMA3 checkpoints and tokenizer files should be downloaded to a subdirectory text_model
c. Rename config.json to config_lmsys.json
mv /path/to/pretrained/checkpoints/text_model/config.json /path/to/pretrained/checkpoints/text_model/config_lmsys.json
d. Download LLaVA-8B config.json from HuggingFace
We do steps (c) and (d) above since we need additional information about LLaVA model such as mm_projector_type etc to build appropriate CS config yaml and checkpoint
Step 4: Convert checkpoints to CS Model Zoo format using checkpoint converter
- Checkpoint conversion script: modelzoo/tools/convert_checkpoint.py
- LLaVA Model checkpoint converter: modelzoo/tools/checkpoint_converters/mm_simple.py
- Command:
python modelzoo/tools/convert_checkpoint.py list
Step 3: Training the model on CS system using run.py
IMPORTANT: See the following notes before proceeding further.
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a training run, make sure that the YAML configs used have the following set correctly:
-
The
train_input.data_dirparameter points to the correct dataset -
The
train_input.img_data_dirparameter points to the correct parent directory containing all images needed by the dataset. -
The
train_input.image_sizeparameter corresponds to the image-size of the dataset. -
Also change sizes in
train_input.transformsappropriately iftrain_input.image_sizeis updated. -
The
image_model.image_sizepoints to the image size passed to each ViTModel -
The
image_model.patch_sizeparameter to use different patch sizes within each ViTModel -
model.freezecontains the regex patterns to freeze appropriate layers in the model -
image_model.image_layer_idxparameter to specify the image_model encoder layer from which features are extracted for the input image.
/path/to/yaml, /path/to/model_dir, and train as placeholders for user supplied inputs.
/path/to/yamlis a path to the YAML config file with model parameters such one of the configurations described in Configs included for this model./path/to/model_diris a path to the directory where we would like to store the logs and other artifacts of the run.--modespecifies the desired mode to run the model in. Change to--mode evalto run in eval mode.
To compile/validate, run train and eval on Cerebras System
Please follow the instructions on our quickstart in the Developer Docs.To run train and eval on GPU/CPU
If running on a CPU or GPU, activate the environment from Python GPU Environment setup, and simply run:Step 4: Convert checkpoint to source code repository format to run eval
We perform evaluations on multimodal benchmarks using LLaVA source code repository. For this, we need to convert the checkpoints generated using the training run from Phase-2 to LLaVA source code repository format. This can be done using the command:image_model and text_model under output-dir as shown below:
-
Folder
image_modelconsists of weights forvision_towerin source repository -
Folder
text_modelconsists of weights to be loaded for the Language model and projectors -
The LLaVA source code repository expects tokenizer files to be present along with the language model weights (code pointer). For this, please copy the tokenizer files into
text_modelfolder. -
Also, please make sure
text_model.mm_vision_towerpoints to theimage_modelpath to ensure the weights fromimage_modelfolder are loaded into the source codevision_tower. This path is automatically added during checkpoint conversion. -
Rename folder
text_modeltotext_model_llava. This is since the source code repository expects the path to includellavakeyword in order to correctly load the checkpoints. (code pointers: builder.py, mm_utils.py) -
So, after the relevant tokenizer files are copied, the
output-dirshould look like below:
Step 5: Set up source code repository for benchmark evaluation and run evaluation benchmarks
- Setup LLaVA source code repository for multimodal benchmark evaluation by following instructions mentioned in Evaluation Docs.
- Instructions for creating conda environment and setting up the repository are mentioned in Installation Section
- Scripts to run various benchmarks are provided here
- Pass
text_model_llavafolder path to--model-pathin eval scripts in LLaVA source code repository
Implementation notes
The following modifications and assumptions are made in this implementation:- This implementation brings in support for multiple images per sample, interleaving of image and text, as well as placement of images at arbitrary position within the sample. However, currently we do not have checkpoint converter for these new features since there is no one HF model that supports these multimodal features.
- This implementation assumes that the H5 files for the dataset are created with the release 2.3 data preprocessor, and is not backward compatible with the H5 datasets produced with the previous release (2.2).
- We currently expect all the images under a single parent folder and the relative path of images from different datasets are written under
image_keyin the H5 files generated. For example:

