Model Description
GPT-3 is a decoder-only transformer language model architecture designed for large-scale autoregressive pretraining. It extends GPT-2 with significantly more parameters (ranging from 1.3B to 175B) and introduces architectural refinements such as sparse attention layers, used in alternating blocks to reduce compute costs during training. However, this implementation uses the GPT-2-style dense attention in all layers. Training occurs on next-token prediction using large text corpora like The PILE, with inputs represented as token sequences padded and masked to a fixed maximum sequence length.Code Structure
The code for this model is located in thegpt3
directory within ModelZoo. Here’s how it’s organized:
configs/
: Contains YAML configuration files for various GPT-3-sized models.run.py
: Training and evaluation entry point. Accepts CLI arguments for mode, config path, checkpointing, and output directories.
Our implementation of GPT-3 is built on top of our GPT-2 backbone. For more details, see
gpt2_model.py
.Available Configurations
Cerebras-GPT
Cerebras-GPT
Configuration | Description |
---|---|
111m.yaml | 111M parameter model using standard parametrization. |
111m_mup.yaml | 111M parameter model with Maximal Update Parametrization (µP). |
256m.yaml | 256M parameter model using standard parametrization. |
256m_mup.yaml | 256M parameter model with µP. |
590m.yaml | 590M parameter model using standard parametrization. |
590m_mup.yaml | 590M parameter model with µP. |
1p3b.yaml | 1.3B parameter model (GPT-3 XL equivalent). |
1p3b_mup.yaml | 1.3B parameter model with µP. |
2p7b.yaml | 2.7B parameter model. |
2p7b_mup.yaml | 2.7B parameter model with µP. |
6p7b.yaml | 6.7B parameter model. |
13b_bs720.yaml | 13B parameter model, batch size 720. |
13b_bs1080.yaml | 13B parameter model, batch size 1080. |
Sparsity
Sparsity
Configuration | Description |
---|---|
params_gpt3_125m_rigl75.yaml | 125M parameter model with 75% sparsity using RigL pruning. |
params_gpt3_125m_set75.yaml | 125M parameter model with 75% sparsity using SET pruning. |
params_gpt3_125m_static75.yaml | 125M parameter model with 75% fixed sparse weights. |
params_gpt3_125m_sparsewide-ift_dense.yaml | 125M dense model for sparsewide-IFT comparison. |
params_gpt3_125m_sparsewide-ift_rigl75.yaml | 125M model with 75% RigL sparsity in sparsewide-IFT setup. |
params_gpt3_125m_sparsewide-ift_static50.yaml | 125M model with 50% static sparsity in sparsewide-IFT setup. |
params_gpt3_6p7b_vspdf_phase1.yaml | 6.7B sparse model for VSPDF Phase 1 training. |
params_gpt3_6p7b_vspdf_phase2.yaml | 6.7B sparse model for VSPDF Phase 2 training. |
params_gpt3_6p7b_vspdf_dart.yaml | 6.7B model with DART sparsity applied for VSPDF fine-tuning. |
The 1.3b(xl), 2.7b, 6.7b and 13b configs above show an example of setting micro batch size explicitly in the train_input section of the config. Without this setting, the best micro batch size search will be performed automatically during compilation which could take long time for larger models.
Model Input Tensor Specifications
Input Name | Shape | Data Type | Description |
---|---|---|---|
input_ids | (batch_size, max_sequence_length) | torch.int32 | Token IDs, padded to full sequence length. |
attention_mask | (batch_size, max_sequence_length) | torch.int32 | 1s for valid tokens, 0s for padding. |
labels | (batch_size, max_sequence_length) | torch.int32 | Targets for language modeling (same as inputs). |
GptHDF5DataProcessor.py
, which consumes PILE-formatted datasets and outputs .h5
files via preprocess_data.py
.
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.Advanced Features
This implementation supports:-
µP (Maximal Update Parametrization): For hyperparameter transfer from small proxy models to large target models.
See µP Tutorial. -
Cerebras-GPT Recipes: Prebuilt configs under
configs/Cerebras_GPT/
to reproduce results from the Cerebras-GPT blog.