On-The-Fly Data Processing

Example Text-Only Configurations
Multimodal Example Configurations

The preprocessing section in the train_input and eval_input sections of the YAML configuration file enables on-the-fly data preprocessing during training and/or evaluation, which reduces turnaround time and storage requirements when running experiments on smaller datasets. The parameters for preprocessing are the same as those used for offline data preprocessing, applying the same algorithms and techniques. For multibox runs, sharding is based on the number of input files in the input directory, which should be greater than or equal to the product of the number of systems and workers per system. Below are examples for pretraining and fine-tuning configurations. We currently support OTF preprocessing for pretraining and finetuning on text-only and multimodal datasets.

Example Text-Only Configurations

Use the tabs to view examples for text-only pretraining and finetuning:

train_input:
    preprocessing:
        data_processor: "RawDatasetProcessor"
        processing:
            custom_tokenizer: gpt2tokenizer
            tokenizer_params:
                encoder_file: /path/to/gpt2-encoder.json
                vocab_file: /path/to/gpt2-vocab.bpe
            sep_token: None
            batch_size: 4
            max_seq_length: 256
            seed: 0
            read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:text_read_hook"
            read_hook_kwargs:
                text_key: "text"
        dataset:
            pack_sequences: True
            use_vsl: False
        setup:
            data:
                source: /path/to/text_data
                type: local
            mode: pretraining
            processes: 1

Multimodal Example Configurations

Use the tabs to view examples for multimodal pretraining and finetuning:

train_input:
    preprocessing:
        data_processor: "MultimodalRawDatasetProcessor"
        processing:
            custom_tokenizer: gpt2tokenizer
            tokenizer_params:
                encoder_file: /path/to/gpt2-encoder.json
                vocab_file: /path/to/gpt2-vocab.bpe
            batch_size: 4
            max_seq_length: 256
            seed: 0
            read_hook: "cerebras.modelzoo.data_preparation.data_preprocessing.hooks:pretraining_image_captions_hook"
            read_hook_kwargs:
                image_key: "image"
                caption_key: "caption"

        dataset:
            image_dir : /path/to/image_files
            is_multimodal: True
        setup:
            data:
                source: /path/to/text_data
                type: local
            mode: pretraining
            processes: 1

We don’t support global shuffling with on-the-fly processing. This feature will be released in subsequent releases.

Custom Tokenizer Online Shuffling in HDF5 File Storage

⌘I

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

On-The-Fly Data Processing

Example Text-Only Configurations

Multimodal Example Configurations

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

​Example Text-Only Configurations

​Multimodal Example Configurations

Example Text-Only Configurations

Multimodal Example Configurations