The preprocessing section in the train_input and eval_input sections of the YAML configuration file enables on-the-fly data preprocessing during training and/or evaluation, which reduces turnaround time and storage requirements when running experiments on smaller datasets.

The parameters for preprocessing are the same as those used for offline data preprocessing, applying the same algorithms and techniques.

For multibox runs, sharding is based on the number of input files in the input directory, which should be greater than or equal to the product of the number of systems and workers per system.

Below are examples for pretraining and fine-tuning configurations. We currently support OTF preprocessing for pretraining and finetuning on text-only and multimodal datasets.

Example Text-Only Configurations

Use the tabs to view examples for text-only pretraining and finetuning:

train_input:
    preprocessing:
        data_processor: "RawDatasetProcessor"
        processing:
            custom_tokenizer: gpt2tokenizer
            tokenizer_params:
                encoder_file: /path/to/gpt2-encoder.json
                vocab_file: /path/to/gpt2-vocab.bpe
            sep_token: None
            batch_size: 4
            max_seq_length: 256
            seed: 0
            read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:text_read_hook"
            read_hook_kwargs:
                text_key: "text"
        dataset:
            pack_sequences: True
            use_vsl: False
        setup:
            data:
                source: /path/to/text_data
                type: local
            mode: pretraining
            processes: 1

Multimodal Example Configurations

Use the tabs to view examples for multimodal pretraining and finetuning:

train_input:
    preprocessing:
        data_processor: "MultimodalRawDatasetProcessor"
        processing:
            custom_tokenizer: gpt2tokenizer
            tokenizer_params:
                encoder_file: /path/to/gpt2-encoder.json
                vocab_file: /path/to/gpt2-vocab.bpe
            batch_size: 4
            max_seq_length: 256
            seed: 0
            read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:pretraining_image_captions_hook"
            read_hook_kwargs:
                image_key: "image"
                caption_key: "caption"

        dataset:
            image_dir : /path/to/image_files
            is_multimodal: True
        setup:
            data:
                source: /path/to/text_data
                type: local
            mode: pretraining
            processes: 1

We don’t support global shuffling with on-the-fly processing. This feature will be released in subsequent releases.