torch.utils.data.DataLoader
). The key argument for this DataLoader is the “Dataset”, which specifies the source of data. PyTorch supports two primary types of Datasets:
In the Cerebras Model Zoo, dataloaders extend these base types to implement additional functionalities. For instance, the
- Map-style datasets (
Dataset
) is a map from indices/keys to data samples. So, ifdataset[idx]
is accessed, that readsidx-th
from a directory on disk.- Iterable-style datasets (
IterableDataset
) represents an iterable over data samples. This is very suitable where random reads are expensive or even improbable, and where the batch size depends on the fetched data. So, ifiter(dataset)
is called, returns a stream of data from a database, or remote server, or even logs generated in real time.
BertCSVDynamicMaskDataProcessor
(code) extends IterableDataset
and BertClassifierDataProcessor
(code) extends Dataset
.
Properties of PyTorch Dataloader
For comprehensive details on the properties of PyTorch dataloaders, refer to this page.Cerebras Model Zoo Dataloaders
The Cerebras Model Zoo includes several example dataloaders that extendIterableDataset
and add functionalities like input encoding and tokenization. Notable examples are:
- BertCSVDataProcessor - Reads
CSV
files containing the input text tokens andMLM
andNSP
features- GptHDF5MapDataProcessor - A
HDF5
map style dataset processor to read fromHDF5
format for GPT pre-training- T5DynamicDataProcessor - Reads text files containing the input text tokens, adds extra ids for language modelling task on the fly
Creating a Custom Dataloader with PyTorch
To create your own dataloader keep in mind these tips:- Ensure coherence between the dataloader output and the neural network model input: If you are using a model from the Cerebras Model Zoo, refer to the README file of the model to understand the required data format. For example, if using GPT-2, ensure your input function produces the features dictionary.
-
Utilize Cerebras-supported file types: Create your dataset by extending one of the native dataset types. The Cerebras ecosystem supports files of types
HDF5
,CSV
, andTXT
. Other file types are not tested and may not be supported.