GPT-3
A decoder-only transformer language model, scaled to billions of parameters, trained on autoregressive next-token prediction with support for µP scaling and Cerebras-optimized workflows.
Model Description
GPT-3 is a decoder-only transformer language model architecture designed for large-scale autoregressive pretraining. It extends GPT-2 with significantly more parameters (ranging from 1.3B to 175B) and introduces architectural refinements such as sparse attention layers, used in alternating blocks to reduce compute costs during training. However, this implementation uses the GPT-2-style dense attention in all layers.
Training occurs on next-token prediction using large text corpora like The PILE, with inputs represented as token sequences padded and masked to a fixed maximum sequence length.
Code Structure
The code for this model is located in the gpt3
directory within ModelZoo. Here’s how it’s organized:
configs/
: Contains YAML configuration files for various GPT-3-sized models.run.py
: Training and evaluation entry point. Accepts CLI arguments for mode, config path, checkpointing, and output directories.
Our implementation of GPT-3 is built on top of our GPT-2 backbone. For more details, see gpt2_model.py
.
Available Configurations
The 1.3b(xl), 2.7b, 6.7b and 13b configs above show an example of setting micro batch size explicitly in the train_input section of the config. Without this setting, the best micro batch size search will be performed automatically during compilation which could take long time for larger models.
Model Input Tensor Specifications
Input Name | Shape | Data Type | Description |
---|---|---|---|
input_ids | (batch_size, max_sequence_length) | torch.int32 | Token IDs, padded to full sequence length. |
attention_mask | (batch_size, max_sequence_length) | torch.int32 | 1s for valid tokens, 0s for padding. |
labels | (batch_size, max_sequence_length) | torch.int32 | Targets for language modeling (same as inputs). |
These are generated using GptHDF5DataProcessor.py
, which consumes PILE-formatted datasets and outputs .h5
files via preprocess_data.py
.
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.
For a complete list of Cerebras ModelZoo CLI commands, see the command reference.
Advanced Features
This implementation supports:
-
µP (Maximal Update Parametrization): For hyperparameter transfer from small proxy models to large target models.
See µP Tutorial. -
Cerebras-GPT Recipes: Prebuilt configs under
configs/Cerebras_GPT/
to reproduce results from the Cerebras-GPT blog.