Model Description

T5 (Text-To-Text Transfer Transformer) is a sequence-to-sequence model that frames all NLP tasks as text-to-text problems. Originally introduced by Raffel et al. in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, this model enables a single architecture to be applied across translation, summarization, classification, question answering, and more.

This implementation follows the T5.1.1 variant, which focuses on self-supervised pretraining using the C4 dataset, excluding supervised datasets during pretraining. T5 modifies the standard Transformer block architecture by reordering normalization and residual connections, as illustrated below.

T5’s key contributions include:

  • Proposing a unified text-to-text format for all NLP tasks (Section 2.4)
  • Comparing encoder-decoder vs. decoder-only variants (Section 3.2)
  • Evaluating different training objectives including denoising and language modeling (Section 3.3)

Code Structure

The code for this model is located in the t5 directory and reuses generic components for interfacing with training scripts and configuration systems.

  • configs/: YAML configuration files specifying training and model hyperparameters.
  • model.py: Wrapper for initializing and interfacing with the T5 model.
  • t5_model.py: Main model implementation including encoder-decoder structure and forward logic.
  • utils.py: Utility functions for config parsing and data handling.

Available Configurations

ConfigurationDescription
t5_small.yamlT5-Small: d_kv=64, num_heads=6, encoder_num_hidden_layers=8.
t5_base.yamlT5-Base: d_kv=64, num_heads=12, encoder_num_hidden_layers=12.
t5_3B.yamlT5-3B: d_kv=128, num_heads=32, encoder_num_hidden_layers=24.
t5_11B.yamlT5-11B: d_kv=128, num_heads=128, encoder_num_hidden_layers=24.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

Implementation Notes

This implementation includes the following deviations from the original T5.1.1 spec:

  1. Optimizer: Adafactor is not currently supported. We use AdamW, which may lead to slightly higher final loss.
  2. Normalization: We use LayerNorm instead of the originally proposed RMSNorm due to hardware support constraints.

References