LLaVA
Multimodal model that connects a vision encoder to a language model through instruction tuning on GPT-4-generated image-text data.
Model Description
LLaVA (Large Language and Vision Assistant) is a multimodal model that integrates a vision encoder with a language model via a lightweight projector module, enabling end-to-end visual and language understanding. It accepts both image and text inputs and generates text-based outputs, making it suitable for instruction-following, question answering, and general-purpose visual dialogue tasks.
The architecture consists of three components:
- A vision encoder initialized from pretrained OpenAI CLIP-ViT-L/14-336px.
- A language model initialized from Vicuna weights.
- A projector module, implemented as a multi-layer perceptron (MLP), which maps image embeddings into the language model’s token embedding space.
Training occurs in two distinct phases:
- Feature Alignment Pretraining: Only the projector module is trained in this phase. Its weights are updated to align the image features from the vision encoder with the word embeddings of the language model.
- Instruction Finetuning:The model is trained on instruction-following data to enable chatbot capabilities. During this phase, the language model and projector are typically finetuned, while the vision encoder remains frozen.
Code Structure
The code for this model is located in the llava
directory within the ModelZoo. Here’s how it’s organized:
configs/
: Contains YAML configuration files used for training, evaluation, and instruction fine-tuning.model.py
: Defines the top-level LLaVA model class, encapsulating the vision encoder, projector, and language model modules, and orchestrating forward passes for training and inference.modeling_llava.py
: Implements core building blocks including the projector MLP, model loading utilities, and integration logic to bridge image features into the LLM token embedding space.
Available Configurations
Configuration | Description |
---|---|
params_llava_v1p5_pretrain_13b_phase1_MSL2K.yaml | Phase 1: Feature alignment pretraining for 13B model (MSL=2048). |
params_llava_v1p5_pretrain_13b_phase2_MSL2K.yaml | Phase 2: Instruction fine-tuning for 13B model (MSL=2048). |
params_llava_v1p5_pretrain_7b_phase1_MSL2K.yaml | Phase 1: Feature alignment pretraining for 7B model (MSL=2048). |
params_llava_v1p5_pretrain_7b_phase2_MSL2K.yaml | Phase 2: Instruction fine-tuning for 7B model (MSL=2048). |
Dataset Download and Preprocessing
Please follow the instructions here for datasets to be downloaded from HuggingFace Datasets.
Since all datasets on the HuggingFace Hub are Git repositories, the datasets can be downloaded locally by running git clone
:
We provide script preprocess_dataset.py
to further pre-process some of the Phase-1 and Phase-2 datasets into the correct LLaVA jsonl formats. Please see the help message to see which datasets are covered and find more details in the below sections regarding each individual datasets. Additionally, we provide an additional utility option convert_json2jsonl
to convert a folder of json files to jsonl files - the latter will be the input format that the later HDF5 processing scripts will act on.
Note: We currently expect all the images under a single parent folder and the relative path of images from different datasets are written under image_key
in the H5 files generated.
For example:
Phase-1: Pre-training for Feature alignment datasets:
LLaVA Visual Instruct Pretrain LCS-558K Dataset
- Download from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
- Dataset and images can be directly downloaded from HuggingFace Hub Datasets using
git clone git@hf.co:datasets/liuhaotian/LLaVA-Pretrain
- No further preprocessing is required
ShareGPT4V-PT Dataset
- Download the dataset
share-captioner_coco_lcs_sam_1246k_1107.json
from HuggingFace here. - This dataset consists of 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2 million with a caption model trained on this subset.
- Images for this dataset can be downloaded by following the instructions here
- No further preprocessing is required
Synthdog-EN Dataset
- Download the dataset from HuggingFace: https://huggingface.co/datasets/naver-clova-ix/synthdog-en
- The images for this dataset are present in the parquet files
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/synthdog-en
- Use
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments.
- Example command:
- Place the downloaded files at
Phase-2: Instruction Finetuning datasets
LLaVA Visual Instruct 150K Dataset
- Download the dataset
llava_v1_5_mix665k.json
from https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json - Images corresponding to this dataset can be downloaded by following instructions here
- No further preprocessing is required
ShareGPT4V-SFT Dataset:
- Download the dataset
sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
from HuggingFace here. - This dataset is built by replacing 23K image-text pairs related to the image captioning task in LLaVA-mix-665K with a equivalent subset in collected GPT4V-generated high-quality image-text pairs.
- Images for this dataset can be downloaded by following the instructions here
- No further preprocessing is required
ChartQA Dataset
- Download the dataset from HuggingFaceHub Datasets from the link: https://huggingface.co/datasets/ahmed-masry/ChartQA
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/ChartQA_Dataset
- Use
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
DVQA Dataset
- The dataset can be downloaded by following instructions mentioned in https://github.com/kushalkafle/DVQA_dataset?tab=readme-ov-file#download-links
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/DVQA
- Use
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
AI2 Diagram Dataset
- Download the dataset and images using
- Steps for preprocessing dataset:
- Unzip the downloaded zip file to
/<path>/ai2d
- Use
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Unzip the downloaded zip file to
ArxivQA Dataset
- The dataset can be downloaded from https://huggingface.co/datasets/MMInstruction/ArxivQA
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/ArxivQA
- Use
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
ArxivCap Dataset
- The dataset can be downloaded from https://huggingface.co/datasets/MMInstruction/ArxivCap.
- We process and use only figures with captions. Any subfigures are not included.
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/ArxivCAP
- Use
preprocess_dataset.py
to process the data into LLaVa jsonl format. Please see the required input arguments:
- Example command:
- Place the downloaded files at
DocVQA Dataset
- Download the dataset following instructions here
- Registration is required to download the dataset under
Single Document Visual Question Answering
- Steps for preprocessing dataset:
- Place the downloaded files at
/<path>/DocVQA
- Use
preprocess_dataset.py
to process each subset into LLaVa jsonl format. Please see the required input arguments:
- Example commands:
- Place the downloaded files at
Sequence of the steps to perform
The high-level steps for training a model are relatively simple, involving data-processing and then model training and evaluation
- Step 1: Dataset download and preprocessing
- Step 2: Generate H5 files for training
- Generate files for Phase-1 (Pre-training for Feature alignment) stage
- Generate files for Phase-2 (Instruction Finetuning) stage
- Step 3: Download pretrained checkpoints for Phase 1
- Step 4: Convert checkpoints to CS Model Zoo format using checkpoint converter
- Step 5: Training the model on CS system or GPU using
run.py
- To compile/validate, run train and eval on Cerebras System
- To run train and eval on GPU/CPU
- Phase-1 training
- Phase-2 training
- Step 6: Convert checkpoint to source code repository format to run eval
- Step 7: Set up source code repository for benchmark evaluation and run evaluation.
The steps are elaborated below:
Step 1: Dataset download and preparation
Please follow instructions in Dataset download and preprocessing to setup datasets for the appropriate phase for H5 file generation
Step 2: Generate H5 files for training
The next step is to generate H5 files that are used by the model during training using LlavaHDF5MapDataProcessor. We use create_hdf5_dataset.py
to create preprocessed dataset files. Further details on usage and instructions can be found here.
Generate files for Phase-1 (Pre-training for Feature alignment) stage
Refer to LlavaPhaseOnePreprocessor and config file for Phase-1 H5 file generation: llava_phase_1_preproc.yaml.
Please update the following fields in llava_phase_1_preproc.yaml
appropriately:
-
setup.input_dir
: Input data directory containing jsonl files. -
setup.output_dir
: Output directory to save generated H5 files -
setup.processes
: Adjust based on cores available for parallel processing -
processor.tokenizer_type
: Tokenizer to use -
processor.max_sequence_length
: Maximum sequence length that the model is trained on. This includes the token positions to be used for image data features as well. This means that the number of tokens available for text tokens isprocessor.max_sequence_length - dataset.num_patches - 1(BOS token)
-
dataset.num_patches
: Number of patches obtained after the image is patchified. This is computed based on the following: -
dataset.image_dir
: Parent directory where all the images are present. Used along with the relative path under theimage_key
field in jsonl to check that images exist, and throw out examples with no image.
Command to generate H5 files for Phase 1 and shuffle
Shuffle samples across H5 files
More information about shuffling script can be found here
Generate files for Phase-2 (Instruction Finetuning) stage
Refer to LlavaPhaseTwoPreprocessor and config file for Phase-2 H5 file generation: llava_phase_2_preproc.yaml.
The fields to be updated include
- All fields mentioned for Phase-1 above
- Please note the field
dataset.system_prompt_style
:vicuna_v1
. This is used to transform the instruction finetuning dataset into thevicuna_v1
format with appropriate system message andUSER
andASSISTANT
values. Note that we currently supportvicuna_v1
only. - Support for other formats such as
llama
,zephyr
is planned for future releases.
Command to generate H5 files for Phase 2
Shuffle samples across H5 files
More information about shuffling script can be found here
Step 3: Download pretrained checkpoints for Phase-1
Checkpoint converter script for converting CLIP-ViT and Vicuna checkpoints to CS format require the following directory structure:
a. openai/clip-vit-large-patch14-336 checkpoints, config.json and preprocessor_config.json should be downloaded to a subdirectory image_model
.
b. lmsys/vicuna-7b-v1.5 checkpoints and tokenizer files should be downloaded to a subdirectory text_model
c. Rename config.json
to config_lmsys.json
d. Download LLaVA-7B config.json from HuggingFace
We do steps (c) and (d) above since we need additional information about LLaVA model such as mm_projector_type
etc to build appropriate CS config yaml and checkpoint
Note: In case of LLaVA-13B
- The image model remains the same, so you follow the same step (a) as above and download from openai/clip-vit-large-patch14-336.
- Download
text_model
from lmsys/vicuna-13b-v1.5 in step (b) - Rename
config.json
toconfig_lmsys.json
, same as step (c) - Download LLaVA-13B config.json in step (d)
Step 4: Convert checkpoints to CS Model Zoo format using checkpoint converter
- Checkpoint conversion script: modelzoo/tools/convert_checkpoint.py
- LLaVA Model checkpoint converter: modelzoo/tools/checkpoint_converters/llava.py
- Command:
More information about checkpoint converters can be obtained by
python modelzoo/tools/convert_checkpoint.py list
Step 5: Training the model on CS system or GPU using run.py
IMPORTANT: See the following notes before proceeding further.
Parameter settings in YAML config file: The config YAML files are located in the configs directory. Before starting a training run, make sure that the YAML configs used have the following set correctly:
-
The
train_input.data_dir
parameter points to the correct dataset -
The
train_input.img_data_dir
parameter points to the correct parent directory containing all images needed by the dataset. -
The
train_input.image_size
parameter corresponds to the image-size of the dataset. -
Also change sizes in
train_input.transforms
appropriately iftrain_input.image_size
is updated. -
The
model.image_model.image_size
points to the image size passed to ViTModel -
The
model.image_model.patch_size
parameter to use different patch sizes -
model.freeze
contains the regex patterns to freeze appropriate layers in the model -
model.image_feature_select_layer_idx
parameter to specify the image_model encoder layer from which features are extracted for the input image. -
model.image_start_idx
parameter: This parameter should be set based on thedata_params.json
file that is saved when H5 files are generated usingcreate_hdf5_dataset.py
. In general,- Phase-1:
model.image_start_idx: 1
- Phase-2 with
dataset.system_prompt_style: vicuna_v1
:model.image_start_idx: 35
- Phase-1:
YAML config files: Details on the configs for this model can be found in the configs included for this model
In the following example run commands, we use /path/to/yaml
, /path/to/model_dir
, and train
as placeholders for user supplied inputs.
/path/to/yaml
is a path to the YAML config file with model parameters such one of the configurations described in the configs included for this model./path/to/model_dir
is a path to the directory where we would like to store the logs and other artifacts of the run.--mode
specifies the desired mode to run the model in. Change to--mode eval
to run in eval mode.
To run train and eval on GPU/CPU
If running on a CPU or GPU, activate the environment from Python GPU Environment setup, and simply run:
In case of LLaVA, training is a two stage process as below:
- Phase-1 (Pre-training for Feature alignment) stage
- To launch this phase, we initialize the model using converted checkpoint from Step-4
- Command:
- Phase-2 (Instruction Finetuning) stage
- When instruction finetuning, the model is initialized from the checkpoint from Phase-1.
- Command:
- Note: The yaml from Phase-2 should only load model states from Phase-1 checkpoint by setting the yaml flag
runconfig.load_checkpoint_states: "model"
Step 6: Convert checkpoint to source code repository format to run eval
We perform evaluations on multimodal benchmarks using LLaVA source code repository. For this, we need to convert the checkpoints generated using the training run from Phase-2 to LLaVA source code repository format. This can be done using the command:
The above command generates two folders image_model
and text_model
under output-dir
as shown below:
-
Folder
image_model
consists of weights forvision_tower
in source repository -
Folder
text_model
consists of weights to be loaded for the Language model and projectors -
The LLaVA source code repository expects tokenizer files to be present along with the language model weights (code pointer). For this, please copy the tokenizer files into
text_model
folder. -
Also, please make sure
text_model.mm_vision_tower
points to theimage_model
path to ensure the weights fromimage_model
folder are loaded into the source codevision_tower
. This path is automatically added during checkpoint conversion. -
Rename folder
text_model
totext_model_llava
. This is since the source code repository expects the path to includellava
keyword in order to correctly load the checkpoints. (code pointers: builder.py, mm_utils.py) -
So, after the relevant tokenizer files are copied, the
output-dir
should look like below:
Step 7: Set up source code repository for benchmark evaluation and run evaluation benchmarks
- Setup LLaVA source code repository for multimodal benchmark evaluation by following instructions mentioned in Evaluation Docs.
- Instructions for creating conda environment and setting up the repository are mentioned in Installation Section
- Scripts to run various benchmarks are provided here
- Pass
text_model_llava
folder path to--model-path
in eval scripts in LLaVA source code repository
DataLoader Features Dictionary
LlavaHDF5MapDataProcessor outputs the following features dictionary with keys/values:
image_data
: Image tensor- Shape:
(batch_size, model.num_channels, model.image_model.image_size[0], model.image_model.image_size[1])
- Type:
torch.float16
- Shape:
labels
: Text input tokens to be predicted by model- Shape:
(batch_size, model.text_model.max_sequence_length)
- Type:
torch.int32
- Shape:
key_padding_mask
: Mask indicating the positions of image_tokens. Used in conjunction with causal attention mask(generated on the fly).- Shape:
(batch_size, model.text_model.max_sequence_length)
- Type:
torch.int32
1
at positions where we DO NOT want to ATTEND,0
otherwise
- Shape:
text_input_ids
: Tensor indicating input text tokens. These include<pad>
token inserted in[model.image_start_idx: model.image_start_idx+num_patches]
positions- Shape:
(batch_size, model.text_model.max_sequence_length)
- Type:
torch.int32
- Shape:
loss_mask
: Mask indicating positions to consider when computing loss.- Shape:
(batch_size, model.text_model.max_sequence_length)
- Type:
torch.int32
1
at positions where we want to compute loss,0
otherwise
- Shape:
Implementation notes
The following modifications and assumptions are made in this implementation:
- Phase-2 instruction finetuning data includes samples which contain text-only data. For these cases, we pass a dummy image and make sure we do not attend to these dummy image features using
src_key_padding_mask
from the dataloader. - Our preprocessing scripts and model defintions all assume that that image occurs at a fixed location with the context length. This is specified in the model definition using
model.image_start_idx
parameter in the yaml - We currently support datasets which contain single image per sample.
- We currently do not support interleaving of multiple images with text.
- We currently expect all the images under a single parent folder and the relative path of images from different datasets are written under
image_key
in the H5 files generated. For example:
References
- LLaVA-v1: Visual Instruction Tuning
- LLaVA-v1.5: Improved Baselines with Visual Instruction Tuning
- LLaVA source code repository
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
- SynthDog-EN: OCR-Free Document Understanding Transformer
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
- DVQA: Understanding Data Visualizations via Question Answering
- AI2D: A Diagram Is Worth A Dozen Images
- ArxivQA & ArxivCap: Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models
- DocVQA: A Dataset for VQA on Document Images