Configure Input Data
Learn how to configure your input data for preprocessing—whether you’re working with a single directory of data or organizing large datasets into subsets.
You can configure local or Hugging Face data as input for preprocessing. In this guide you’ll learn how to define your data source, specify optional parameters like subsets or splits, and structure your config file to support flexible, scalable preprocessing workflows.
Local Data
-
Set
type
tolocal
. -
Use
source
to provide the path to the input directory.
For example:
Preprocess Subdirectories
You can optionally preprocess subdirectories within your input directory as separate datasets. This enables more flexible data management for large-scale pretraining tasks.
There are two supported options:
- Use
top_level_as_subsets: True
to automatically treat each top-level folder in your input directory as a separate dataset. Each top-level directory is treated as a subset and a separate output folder will be created underoutput_dir
with its respective preprocessed HDF5 files. Defaults toFalse
if not specified. - Use
subsets: [list]
to manually specify which subfolders to preprocess. Only the folders listed in subsets will be preprocessed and each subset will have its own output folder underoutput_dir
.
Use the tabs to view examples:
Hugging Face Data
-
Set
type
tohuggingface
. -
Use
source
to specify the dataset name from the Hugging Face hub. -
Use
split
to specify the dataset split.
The preprocessing pipeline passes these parameters to the Hugging Face load_dataset
API.
When calling the API, parameters are passed as keyword arguments and they must conform to the specifications outlined by HuggingFace. Refer to the load_dataset
documentation here.
For example: