Local Data
-
Set
type
tolocal
. -
Use
source
to provide the path to the input directory.
Preprocess Subdirectories
You can optionally preprocess subdirectories within your input directory as separate datasets. This enables more flexible data management for large-scale pretraining tasks. There are two supported options:- Use
top_level_as_subsets: True
to automatically treat each top-level folder in your input directory as a separate dataset. Each top-level directory is treated as a subset and a separate output folder will be created underoutput_dir
with its respective preprocessed HDF5 files. Defaults toFalse
if not specified. - Use
subsets: [list]
to manually specify which subfolders to preprocess. Only the folders listed in subsets will be preprocessed and each subset will have its own output folder underoutput_dir
.
Hugging Face Data
-
Set
type
tohuggingface
. -
Use
source
to specify the dataset name from the Hugging Face hub. -
Use
split
to specify the dataset split.
load_dataset
API.
When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the
load_dataset
documentation here.Hugging Face Data
View example configs for various use cases here.