Read Hooks
This guide details various read hooks you can use to convert different types of raw input data into HDF5 format for machine learning tasks on Cerebras Systems.
What is a Read Hook?
We use read hooks to convert from different input formats to HDF5 format. Pre-built hooks are provided for standard input formats and masking schemes.
Our end-to-end data processing pipeline, pictured above, consists of three stages:
Read
Data is sourced from either a local file store or the Hugging Face Hub and processed by the Reader module, which applies a read_hook
before converting the data into a SemanticDataArray
.
Tokenize
The data is then transformed using one of three token generators—pre-training, fine-tuning, or custom—depending on the use case.
Write
The processed tokens undergo internal processing via the Writer module and are stored in HDF5 format, forming the TokenFlow
. This structured approach ensures efficient handling and tokenization of input data for downstream tasks.
This guide explores how to configure and utilize pre-built read hooks to streamline data preprocessing across different data types and platforms, enabling more robust and adaptable machine learning pipelines.
By mastering read hooks, you’ll gain the ability to:
-
Seamlessly integrate local and HuggingFace data sources
-
Customize data loading for specific tasks
-
Optimize preprocessing efficiency
-
Enhance overall model performance
See Write a Custom Read Hook to learn how to create your own.
Fine-tuning LLaVA Hook
This read hook processes conversation data to format it for fine-tuning LLaVA models. It looks for conversation turns, optional system prompts, and images. It requires keys for conversation data and image paths.
Pretraining Text Hook
This read hook extracts and processes plain text data for reading tasks. It requires a key to extract text from input data.
Pretraining Image Captions Hook
This read hook prepares data for image captioning pretraining tasks by extracting image paths and captions.
NLG Hook
This read hook processes natural language generation (NLG) data, organizing context and completion information into a structured format. It requires context and completion keys.
Prompt Completion Text Hook
This read hook formats prompt and completion text into a structured list. It requires prompt and completion keys.
Chat Hook
This read hook transforms chat data into a semantic data array, distinguishing between user and assistant roles. Assumes data is in conversation format and requires a key for multi-turn content if the data is not in OpenAI ChatML format.
DPO Hook
This read hook structures data for Direct Preference Optimization (DPO) tasks, organizing prompts, chosen responses, and rejected responses into semantic data array. Requires keys for prompt, chosen, and rejected data. The implementation can be found here.
Prompt Completion Chat Hook
This read hook processes prompt and completion data as a single turn chat and creates a semantic data array format. The implementation can be found here.
Fine-Tuning Image Captions Hook
Processes fine-tuning image captions data into a semantic data array format. Requires keys for image and caption data. The hook implementation can be found here.
Fine-Tuning LLaVA Hook Prompt Completion
This read hook transforms conversation data for fine-tuning LLaVA, alternating between prompt and completion roles. Requires keys for conversation data and image paths. The hook implementation can be found here.
Important Considerations
-
Handling keys to read data: The
read_hook_kwargs
property must have data keys with the suffix_key
to segregate these from other parameters forread_hook_kwargs
. These keys will be exclusively used to read data from the input while other parameters which are not data keys will be used to create semantic data array from the read hooks. -
Space Handling: When combining custom regions, the data processor does not add any spaces or separators between the regions. Space handling must be managed within the read hooks. When creating custom semantic regions, ensure there is a leading space at the start of each region (except the first) to prevent the merging of words from neighboring regions.
-
Multimodal Datasets: When working with multimodal datasets, if images are provided as URLs, the hooks should download the images and generate image paths to be used by the multimodal models.
-
Separator Handling With Prompt Completion Read Hook: The token generator adds a separator token between
prompt
andcompletion
semantic regions. The tokenizer’ssep_token
attribute is used as a separator token if present; else we use<|sep|>
.
What’s Next?
Now that you’ve configured your input data and understand how to process it using various read hooks, the next step is to set up your token generators. Token generators play a crucial role in the preprocessing pipeline, as they convert raw data into tokenized formats suitable for machine learning models.