Token Generators

Token generators convert raw data into tokenized formats suitable for machine learning models, ensuring efficient and effective data processing. This guide covers the configuration of pre-built and custom token generators, along with examples and use cases.

Pre-Built Token Generators

Cerebras Model Zoo provides a comprehensive suite of pre-built token generators tailored to support various stages and tasks in the development of LLMs. The initialization of these token generators is dependent on the mode parameter that is specified in the config file (refer to Modes.

Flags Supported by Pre-Built Token Generators

Pretraining Parameters

This section lists parameters that can be used for PretrainingTokenGenerator.

Flag	Default Value	Description
pack_sequences	True	Concatenate a document smaller than maximum sequence length with other documents, instead of filling it with Padding token.
inverted_mask	False	If False, 0 represents masked positions. If True 1 represents masked positions.
seed	0	Random seed used for generating short sequences
short_seq_prob	0.0	Probability of creating sequences which are shorter than the maximum sequence length.
split_text_to_tokenize	False	Whether to split the text into smaller chunks before tokenization. This is helpful for very long documents with tokenizers such as Llama tokenizer which performs quadratically in the text length.
chunk_len_to_split	2000	Length of the text chunks to split the text into before tokenization for slower tokenizers. Could be optionally used with the above flag split_text_to_tokenize. Without the previous flag, this argument will be ignored.
remove_bos_in_chunks	False	Whether to remove the BOS token from the beginning of the chunks. Set this to True when using split_test_to_tokenize and chunk_len_to_split to avoid having multiple BOS tokens in the middle of the text. Not applicable to all tokenizers.

In this case, it also uses all the config paramaters that are used by PretrainingTokenGenerator, in addition to the ones specified below.

Flag	Default Value	Description
mlm_fraction	0.15	Fraction of tokens to be masked in MLM tasks.
mlm_with_gather	False	MLM processing mode. When set to True the length of the returned labels is equal to mlm_fraction * msl, else it is equal to msl
ignore_index	-100	Required when mlm_with_gather is set to False. Presence of ignore_index value at a position in the labels indicates that this position will not be used for loss calculation.
excluded_tokens	[‘<cls>’, ‘<pad>’, ‘<eos>’, ‘<unk>’, ‘<null_1>’, ‘<mask>’]	Tokens to be excluded when masking. Provided only through YAML config.

VSL Finetuning Token Generator Parameters

This section lists down parameters that can be used for VSLFineTuningTokenGenerator.

VSLFineTuningTokenGenerator also uses the config paramaters that are used by FineTuningTokenGenerator, in addition to the ones specified below.

Flag	Default Value	Description
use_vsl	True	Generate examples with multiple sequences packed together
position_ids_dtype	int32	dtype of token position ids.

NoteIncreasing the read chunk size will increase the packing factor of VSL. So, the user needs to figure out the tradeoff between higher packing and processing time depending on the dataset’s packing factor.

VSL Pretraining Token Generator Parameters

This section lists down parameters that can be used for VSLPretrainingTokenGenerator. use_vsl needs to be set to True in the train_input or eval_input section of the model config.

VSLPretrainingTokenGenerator also uses the config paramaters that are used by PretrainingTokenGenerator, in addition to the ones specified below.

Flag	Default Value	Description
use_vsl	True	Generate examples with multiple sequences packed together
fold_long_doc	True	Fold documents larger than max_seq_length into multiple sequences, instead of dropping them.

DPO Token Generator Parameters

This section lists down parameters that can be used for DPOTokenGenerator.

Flag	Default Value	Description
max_prompt_length	512	If the sequence exceeds the `max_seq_length`, this parameters caps the prompt length to the specified limit.
response_delimiter	`<response>`	This is used to set the separator between `prompt` and `response`. The user need not set this value for general use-case.

Multimodal Pretraining Token Generator Parameters

This section lists down parameters that can be used for MultiModalPretrainingTokenGenerator.

Flag	Default Value	Description
max_num_img	1	Maximum number of images allowed in one preprocessed sequence. Sequences with more than max_num_img images will be discarded
num_patches	None	Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images.
is_multimodal	False	Whether the dataset is multimodal (text plus images) or text only. Set it to True for multimodal tasks.

Multimodal Token Generator Parameters

This section lists down parameters that can be used for multimodal token generators.

Flag	Default Value	Description
max_num_img	1	Maximum number of images allowed in one preprocessed sequence. Sequences with more than max_num_img images will be discarded
num_patches	1	Number of patches to represent an image. This is determined by the patch-size (in pixels) of the image-encoder, and the pixel count of the input images.
is_multimodal	False	Whether the dataset is multimodal (text plus images) or text only. Set it to True for multimodal tasks.

Supported Token Generators - Pretraining Mode

PretrainingTokenGenerator: General-purpose pretraining on large text corpora. When training_objective is set to mlm, it does MLM task processing. For multimodal pretraining, is_multimodal is set to True.
FIMTokenGenerator: Designed for fill-in-the-middle tasks. Initialized when training_objective is set to fim in the config file.
VSLPretrainingTokenGenerator: For visual and language pretraining. Initialized when use_vsl is set to True in the config file.

Supported Token Generators - Finetuning Mode

FinetuningTokenGenerator: General-purpose fine-tuning. For multimodal finetuning, is_multimodal is set to True.
VSLFinetuningTokenGenerator: Fine-tuning for visual and language tasks. Initialized when use_vsl is set to True in the config file.

Other Supported Token Generators

DPOTokenGenerator: Focused on direct preference optimization (DPO) during token generation. Initialized when mode is set to dpo.
NLGTokenGenerator: Optimized for natural language generation tasks. Initialized when mode is set to nlg.

Custom Token Generators

In addition to pre-built token generators, the Model Zoo allows users to implement custom token generators. This enables arbitrary transformations of the input data before tokenization. To use custom token generators, ensure the configuration file is properly set up. Follow these steps: 1. Ensure that the mode param is set to custom, in order to be able to specify your own token generator. 2. Specify the path to the custom token generator class in the config file, in the token_generator param, within the setup section. This would look like:

mode: "custom"
token_generator: "<path/to/custom-generator-class>"

The token_generator path should be specified with the class name being separated with a colon : from the module name, for the custom token generator be instantiated correctly.

Class Implementation Guidelines

The custom token generator must adhere to the following guidelines: 1. The constructor’s signature must be as follows:

def __init__(
    self, params: Dict[str, Any], tokenizer: Any, eos_id: int, pad_id: int
):
    """
    Args:
        params (Dict[str, Any]): Parameters for the dataset and processing.
        tokenizer (Any): Tokenizer to use for tokenization.
        eos_id (int): End-of-sequence token ID.
        pad_id (int): Padding token ID.
    """

2. The custom token generator must implement an encode method, which tokenizes and encodes the data according to the user definition. For more examples on how the encode method looks like, refer to the code of pre-built token generators that are present in Model Zoo. 3. The signature of the encode method is given below, where it takes in a semantic_data_array:

def encode(
    self, semantic_data_array: List[Dict[str, Any]]
) -> Tuple[Dict[str, Any], Dict[str, int]]:

Conclusion

Configuring token generators is an important step in the preprocessing pipeline for machine learning tasks on Cerebras Systems. By leveraging the comprehensive suite of pre-built token generators provided by Cerebras ModelZoo, you can efficiently handle various stages and tasks in the development of large language models. Additionally, the flexibility to implement custom token generators allows for tailored transformations of input data, meeting specific project requirements. The introduction of on-the-fly data processing further enhances the preprocessing workflow by reducing storage needs and increasing adaptability during training and evaluation. The examples provided for pretraining and fine-tuning configurations illustrate how to set up these processes seamlessly. Finally, the TokenFlow utility offers an invaluable tool for visualizing and debugging preprocessed data, ensuring data integrity and facilitating error detection. By following the guidelines and leveraging the tools outlined in this guide, you can optimize your preprocessing pipeline, leading to more efficient training and improved performance of your machine learning models on Cerebras Systems.

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

Pre-Built Token Generators

Flags Supported by Pre-Built Token Generators

Supported Token Generators - Pretraining Mode

Supported Token Generators - Finetuning Mode

Other Supported Token Generators

Custom Token Generators

Class Implementation Guidelines

Conclusion

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

Documentation Index

​Pre-Built Token Generators

​Flags Supported by Pre-Built Token Generators

​Supported Token Generators - Pretraining Mode

​Supported Token Generators - Finetuning Mode

​Other Supported Token Generators

​Custom Token Generators

​Class Implementation Guidelines

​Conclusion

Pre-Built Token Generators

Flags Supported by Pre-Built Token Generators

Supported Token Generators - Pretraining Mode

Supported Token Generators - Finetuning Mode

Other Supported Token Generators

Custom Token Generators

Class Implementation Guidelines

Conclusion