Overview
This tutorial teaches you about Cerebras essentials like data preprocessing and training scripts, config files, and checkpoint conversion tools. To understand these concepts, you’ll fine-tune Meta’s Llama 3 8B on a small dataset consisting of documents and their summaries.
In this quickstart guide, you will:
- 
Setup your environment
- 
Pre-process a small dataset
- 
Port a trained model from Hugging Face
- 
Fine-tune and evaluate a model
- 
Test your model on downstream tasks
- 
Port your model to Hugging Face
In this tutorial, you will train your model for a short while on a small dataset. A high quality model requires a longer training run, as well as a much larger dataset.
Prerequisites
To begin this guide, you must have:
Step 1: Setup
Set Environment Variables
Start by saving common paths in environment variables for easy access, including:
- 
The parent directory above Model Zoo
- 
The location of training scripts (in this case, Llama 3)
- 
The location of data preprocessing scripts
- 
The location of scripts for converting checkpoints to and from Hugging Face
export MODELZOO_PARENT=$(pwd)
export MODELZOO_DATA=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/data_preparation/data_preprocessing
export MODELZOO_MODEL=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/models/nlp/llama
export MODELZOO_TOOLS=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/tools
export MODELZOO_COMMON=$MODELZOO_PARENT/modelzoo/src/cerebras/modelzoo/common
Create Model Directory
Create a dedicated folder for assets (data/model configs) and generated files (processed data files, checkpoints, logs, etc.):
mkdir finetuning_tutorial
Copy Training and Eval Configs
Copy sample configs into your folder. You will use these to control Model Zoo scripts for efficient training and evaluation of large models.
cp modelzoo/src/cerebras/modelzoo/tutorials/finetuning/* finetuning_tutorial
Create Data Config
- 
Copy the code block below.
- 
Create a YAML file with it. Name this file train_data_config.yaml.
- 
Place the file in the finetuning_tutorialmodel directory you created earlier.
############################################
## Fine-Tuning Tutorial Train Data Config ##
############################################
setup:
    data:
        type: "huggingface"
        source: "autoevaluate/xsum-sample"
        split: "train"
    mode: "finetuning"
    output_dir: "finetuning_tutorial/train_data"
    processes: 1
processing:
    custom_tokenizer: cerebras.modelzoo.data_preparation.data_preprocessing.custom_tokenizer_example.CustomLlama3Tokenizer:CustomLlama3Tokenizer
    tokenizer_params:
        pretrained_model_name_or_path: "baseten/Meta-Llama-3-tokenizer"
    write_in_batch: True
    read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:prompt_completion_text_read_hook"
    read_hook_kwargs:
        prompt_key: "document"
        completion_key: "summary"
    use_ftfy: True
When using the Hugging Face CLI to download a dataset, you may encounter the following error:KeyError: 'tags'This issue occurs due to an outdated version of the huggingface_hub package. To resolve it, update the package to version 0.26.1 by running:pip install --upgrade huggingface_hub==0.26.1
{
    "document": "In Wales, councils are responsible for funding..",
    "summary": "As Chancellor George Osborne announced...",
    "id": "35821725"
}
Inspect Model Config (optional)
Take a look at your model config:
cat finetuning_tutorial/model_config.yaml
#######################################
## Fine-Tuning Tutorial Model Config ##
#######################################
trainer:
  init:
    model_dir: finetuning_tutorial/model
    backend:
      backend_type: CSX
      cluster_config:
        num_csx: 1
    callbacks:
    - ComputeNorm: {}
    checkpoint:
      steps: 18
    logging:
      log_steps: 1
    loop:
      eval_steps: 5
      max_steps: 18
    model:
      attention_dropout_rate: 0.0
      attention_module: multiquery_attention
      attention_type: scaled_dot_product
      dropout_rate: 0.0
      embedding_dropout_rate: 0.0
      embedding_layer_norm: false
      extra_attention_params:
        num_kv_groups: 8
      filter_size: 14336
      fp16_type: cbfloat16
      hidden_size: 4096
      initializer_range: 0.02
      layer_norm_epsilon: 1.0e-05
      loss_scaling: num_tokens
      loss_weight: 1.0
      max_position_embeddings: 8192
      mixed_precision: true
      nonlinearity: swiglu
      norm_type: rmsnorm
      num_heads: 32
      num_hidden_layers: 32
      pos_scaling_factor: 1.0
      position_embedding_type: rotary
      rope_theta: 500000.0
      rotary_dim: 128
      share_embedding_weights: false
      use_bias_in_output: false
      use_ffn_bias: false
      use_ffn_bias_in_attention: false
      use_projection_bias_in_attention: false
      vocab_size: 128256
    optimizer:
      AdamW:
        betas:
        - 0.9
        - 0.95
        correct_bias: true
        weight_decay: 0.01
    precision:
      enabled: true
      fp16_type: cbfloat16
      log_loss_scale: true
      loss_scaling_factor: dynamic
      max_gradient_norm: 1.0
    schedulers:
    - CosineDecayLR:
        end_learning_rate: 1.0e-05
        initial_learning_rate: 5.0e-05
        total_iters: 18
    seed: 1
  fit:
    train_dataloader:
       batch_size: 8
       data_dir: train_data
       data_processor: GptHDF5MapDataProcessor
       num_workers: 8
       persistent_workers: true
       prefetch_factor: 10
       shuffle: true
       shuffle_seed: 1337
    val_dataloader: &id001
       batch_size: 1
       data_dir: valid_data
       data_processor: GptHDF5MapDataProcessor
       num_workers: 8
       shuffle: false
    ckpt_path: finetuning_tutorial/from_hf/pytorch_model_to_cs-2.3.mdl
  validate:
    val_dataloader: *id001
  validate_all:
    val_dataloaders: *id001
Inspect Evaluation Config (optional)
Take a look at your evaluation config:
cat finetuning_tutorial/eeh_config.yaml
#############################################################
## Fine-Tuning Tutorial Eleuther Evaluation Harness Config ##
#############################################################
trainer:
  init:
    backend:
      backend_type: CSX
      cluster_config:
        num_csx: 1
    model:
      model_name: llama
      attention_dropout_rate: 0.0
      attention_module: multiquery_attention
      attention_type: scaled_dot_product
      dropout_rate: 0.0
      embedding_dropout_rate: 0.0
      embedding_layer_norm: false
      extra_attention_params:
        num_kv_groups: 8
      filter_size: 14336
      fp16_type: cbfloat16
      hidden_size: 4096
      initializer_range: 0.02
      layer_norm_epsilon: 1.0e-05
      loss_scaling: num_tokens
      loss_weight: 1.0
      max_position_embeddings: 8192
      mixed_precision: true
      nonlinearity: swiglu
      norm_type: rmsnorm
      num_heads: 32
      num_hidden_layers: 32
      pos_scaling_factor: 1.0
      position_embedding_type: rotary
      rope_theta: 500000.0
      rotary_dim: 128
      share_embedding_weights: false
      use_bias_in_output: false
      use_ffn_bias: false
      use_ffn_bias_in_attention: false
      use_projection_bias_in_attention: false
      vocab_size: 128256
    callbacks:
    - EleutherEvalHarness:
        eeh_args:
          tasks: winogrande
          num_fewshot: 0
        keep_data_dir: false
        batch_size: 4
        shuffle: false
        max_sequence_length: 8192
        num_workers: 1
        data_dir: finetuning_tutorial/eeh
        eos_id: 128001
        pretrained_model_name_or_path: baseten/Meta-Llama-3-tokenizer
        flags:
          csx.performance.micro_batch_size: null
winogrande on a single CSX system.
If you are interested, you can learn more about validating models using the Eleuther or BigCode Evaluation Harness in our documentation.
Step 2: Preprocess data
Preprocess Training and Validation Data
Use your data configs to preprocess your “train” and “validation” datasets:
python $MODELZOO_DATA/preprocess_data.py \
  --config finetuning_tutorial/train_data_config.yaml
python $MODELZOO_DATA/preprocess_data.py \
  --config finetuning_tutorial/valid_data_config.yaml
finetuning_tutorial/train_data/ and finetuning_tutorial/valid_data/ (see the output_dir parameter in your data configs).
Inspect Preprocessed Data (optional)
Once you’ve preprocessed your data, you can visualize the outcome:
python $MODELZOO_DATA/tokenflow/launch_tokenflow.py \
  --output_dir finetuning_data/train_data
http://172.31.48.239:5000. Copy and paste this into your browser to launch TokenFlow, a tool for interactively visualizing whether loss and attention masks were applied correctly:
Step 3: Port Model from Hugging Face
Download Checkpoint and Configs
Create a dedicated folder for the checkpoint and configuration files you’ll be downloading from Hugging Face.
mkdir finetuning_tutorial/from_hf
wget -P finetuning_tutorial/from_hf https://huggingface.co/McGill-NLP/Llama-3-8B-Web/resolve/main/pytorch_model.bin
wget -P finetuning_tutorial/from_hf https://huggingface.co/McGill-NLP/Llama-3-8B-Web/resolve/main/config.json
finetuning_tutorial/from_hf directory:
- 
config.json:
The model’s configuration file.
- 
pytorch_model.bin:
The model’s weights.
Convert Checkpoint and Configs
You can now convert the files to a format compatible with Model Zoo:
python $MODELZOO_TOOLS/convert_checkpoint.py \
  convert \
  --model llama \
  --src-fmt hf \
  --tgt-fmt cs-current \
  --config finetuning_tutorial/from_hf/config.json \
  finetuning_tutorial/from_hf/pytorch_model.bin
finetuning_tutorial/from_hf folder should now contain:
- 
pytorch_model_to_cs-2.3.mdl: The converted model checkpoint.
- 
config_to_cs-2.3.yaml: The converted configuration file.
While you will not need to do this in this quickstart since it has already been done for you, as a final step, you will usually pointckpt_path in your finetuning_tutorial/model_config.yaml to the location of this converted checkpoint.
Step 4: Train and Evaluate Model
Modify Configs
Set train_dataloader.data_dir and val_dataloader.data_dir in your model config to the absolute paths of your preprocessed data:
sed -i "s|data_dir: train_data|data_dir: ${MODELZOO_PARENT}/finetuning_tutorial/train_data|" \
  finetuning_tutorial/model_config.yaml
sed -i "s|data_dir: valid_data|data_dir: ${MODELZOO_PARENT}/finetuning_tutorial/valid_data|" \
  finetuning_tutorial/model_config.yaml
Submit Training Job
Train your model by passing your updated model configs, the location of important directories, and python packages to a run script. Click here for more information.
python $MODELZOO_MODEL/run.py CSX \
  --mode train_and_eval \
  --params finetuning_tutorial/model_config.yaml \
  --mount_dirs $MODELZOO_PARENT $MODELZOO_PARENT/modelzoo \
  --python_paths $MODELZOO_PARENT/modelzoo/src \
Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO:   Finished sending initial weights
INFO:   | Train Device=CSX, Step=50, Loss=8.31250, Rate=69.37 samples/sec, GlobalRate=69.37 samples/sec
INFO:   | Train Device=CSX, Step=100, Loss=7.25000, Rate=68.41 samples/sec, GlobalRate=68.56 samples/sec
...
finetuning_tutorial/model folder (see the model_dir parameter in your model config). These include:
- 
Checkpoints
- 
TensorBoard event files
- 
Run logs
- 
A copy of the model config
Inspect Training Logs (optional)
Monitor your training during the run or visualize TensorBoard event files afterwards:
tensorboard --bind_all --logdir="finetuning_tutorial/model"
Run Evaluation Tasks
After training, you can test your model on downstream tasks:
python $MODELZOO_COMMON/run_eleuther_eval_harness.py CSX \
  --params finetuning_tutorial/eeh_config.yaml \
  --checkpoint_path finetuning_tutorial/model/checkpoint_0.mdl \
  --mount_dirs $MODELZOO_PARENT/finetuning_tutorial/eeh \
  --python_paths $MODELZOO_PARENT/modelzoo/src \
...
2024-06-24 11:06:49,596 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=632, Rate=34.38 samples/sec, GlobalRate=34.34 samples/sec
2024-06-24 11:06:49,625 INFO:   | EleutherAI Eval Device=CSX, GlobalStep=1, Batch=633, Rate=34.39 samples/sec, GlobalRate=34.35 samples/sec
2024-06-24 11:07:04,333 INFO:   <cerebras.modelzoo.trainer.extensions.eleuther.lm_eval_harness.EleutherLM object at 0x7ff89ae3ceb0> (None), gen_kwargs: ({'temperature': None, 'top_k': None, 'top_p': None, 'max_tokens': None}), limit: None, num_fewshot: None, batch_size: None
2024-06-24 11:07:04,696 INFO:
|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|------:|------|-----:|------|-----:|---|-----:|
|winogrande|      1|none  |     0|acc   |0.7284|±  |0.0141|
Step 5: Port Model to Hugging Face
Convert Checkpoint and Configs
Once you train (and evaluate) your model, you can port it to Hugging Face to generate outputs:
python $MODELZOO_TOOLS/convert_checkpoint.py \
  convert \
  --model llama \
  --src-fmt cs-auto \
  --tgt-fmt hf \
  --config finetuning_tutorial/model_config.yaml \
  --output-dir finetuning_tutorial/to_hf \
  finetuning_tutorial/model/checkpoint_0.mdl
This will create both Hugging Face config files and a converted checkpoint under `finetuning_tutorial/to_hf`.
### Validate checkpoint and configs (optional)[#](#validate-checkpoint-and-configs-optional "Permalink to this headline")
You can now generate outputs using Hugging Face:
```Bash
pip install 'transformers[torch]'
Python 3.8.16 (default, Mar 18 2024, 18:27:40)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
>>> from transformers import pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("baseten/Meta-Llama-3-tokenizer")
>>> config = AutoConfig.from_pretrained("finetuning_tutorial/to_hf/model_config_to_hf.json")
>>> model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="finetuning_tutorial/to_hf/checkpoint_0_to_hf.bin", config = config)
>>> text = "Generative AI is "
>>> pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
>>> generated_text = pipe(text, max_length=50, do_sample=False, no_repeat_ngram_size=2, eos_token_id=pipeline.tokenizer.eos_token_id, pad_token_id=pipeline.tokenizer.eos_token_id)[0]
>>> print(generated_text['generated_text'])
>>> exit()
As a reminder, in this quickstart, you did not train your model for very long. A high quality model requires a longer training run, as well as a much larger dataset.
Conclusion
Congratulations! In this tutorial, you followed an end-to-end workflow to fine-tune a model on a Cerebras system and learn about essential tools and scripts.
As part of this, your learned how to:
- 
Setup your environment
- 
Pre-process a small dataset
- 
Port a trained model from Hugging Face
- 
Fine-tune and evaluate a model
- 
Test your model on downstream tasks
- 
Port your model to Hugging Face
What’s Next?