Local Data
-
Set
typetolocal. -
Use
sourceto provide the path to the input directory.
Preprocess Subdirectories
You can optionally preprocess subdirectories within your input directory as separate datasets. This enables more flexible data management for large-scale pretraining tasks. There are two supported options:- Use
top_level_as_subsets: Trueto automatically treat each top-level folder in your input directory as a separate dataset. Each top-level directory is treated as a subset and a separate output folder will be created underoutput_dirwith its respective preprocessed HDF5 files. Defaults toFalseif not specified. - Use
subsets: [list]to manually specify which subfolders to preprocess. Only the folders listed in subsets will be preprocessed and each subset will have its own output folder underoutput_dir.
Hugging Face Data
-
Set
typetohuggingface. -
Use
sourceto specify the dataset name from the Hugging Face hub. -
Use
splitto specify the dataset split.
load_dataset API.
When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the
load_dataset documentation here.Hugging Face Data
View example configs for various use cases here.

