Overview
One of the advantages of the CS-X is the ability to train a Large Language Model (LLM) with a large context window (also known as sequence lengths). Larger context windows enable LLMs to handle larger inputs. Sometimes it is advantageous to train with a shorter context window (e.g., in instruction tuning) for efficiency. Follow the steps in this guide to train a language model with a large (> 2048) or a small (< 2048) context window.Procedure
For large context window
1. Data processing: when creating your dataset using the script filecreate_hdf5_dataset.py
, change the value of the argument --max_seq_length
to the desired value.
For example, for a sequence length of 4096 tokens:
max_sequence_length
and the max_position_embeddings
to the desired value in the model’s configuration YAML file.
For example:
For small context window
For example, a model that was pretrained on a sequence length of 2048 tokens may be further instruction fine-tuned on a dataset with a sequence length of 256 tokens. In this case assume the longest sequence in the dataset has a length of 256 tokens. Since the sequences of this dataset will be padded, not packed, training with a shorter sequence length is more efficient than padding every sample in the dataset all the way to 2048 tokens. 1. Data processing: when calling the script filecreate_hdf5_dataset.py
, change the value of the argument --max_seq_length
to the desired value.
max_sequence_length
to the desired value in the model’s configuration yaml
file.
For example:
Implementation notes
Note when training a pretrained model with a smaller context window, themax_position_embeddings
parameter of the pretrained model remains the same. Only the max_sequence_length
parameter needs to be changed.
On the other hand, when training with a large context window, both the max_position_embeddings
and max_sequence_length
parameters need to be changed.