GPT-J & GPT-Neox
Decoder-only language models by EleutherAI, designed for high-throughput training and capable zero-shot performance on a range of natural language tasks.
Model Description
GPT-J and GPT-Neox are families of decoder-only language models developed by EleutherAI and trained on the Pile dataset — a curated mixture of diverse text sources. Both models are designed to be efficient, flexible, and performant in zero-shot settings without the need for task-specific fine-tuning.
GPT-J
GPT-J is a 6B parameter auto-regressive transformer with architectural similarities to GPT-3. It introduces a parallel decoder block where attention and feed-forward layers are computed in parallel and added together, improving throughput by approximately 15% compared to traditional sequential transformer blocks. This design is especially beneficial for distributed training and single-device setups where minimizing cross-device communication is critical.
GPT-J also adopts Rotary Position Embeddings (RoPE) — applying them to 25% of the features while using sinusoidal embeddings for the remainder. This hybrid approach balances convergence speed with long-context modeling capabilities. Additionally, GPT-J employs dense attention, prioritizing simplicity and training stability at this scale over sparse alternatives.
GPT-Neox
GPT-Neox shares the same architecture as GPT-J with a few refinements:
- Untied LayerNorm: Each transformer block uses two independent layer normalization layers instead of a shared one.
- Enhanced Tokenizer: Tokenizers were retrained on the Pile and optimized for whitespace handling, repeated tokens, and programming languages — making Neox more robust for structured text like code.
These design choices allow GPT-Neox to generalize well across a broad range of domains and sequence lengths, including natural language and code generation tasks.
Code Structure
The code for these models is located in the /gptj
directory within ModelZoo. Here’s how it’s organized:
/configs
: Contains YAML configuration files for GPT-J./continuous_pretraining/configs
: Contains configs for continuous pretraining of GPT-J.model.py
: The implementation of the GPT-Neox model.gptneox/model.py
: The implementation of the GPT-J model.
Our implementations of GPT-J and GPT-Neox are built on top of our GPT-2 backbone. For more details, see gpt2_model.py
.
Available Configurations
Configuration | Description |
---|---|
params_gptj_6B.yaml | Standard 6B parameter GPT-J model. |
params_gptj_6B_muP.yaml | GPT-J model configured with μ-parameterization for scaling. |
params_gptj_6B_TRC2.yaml | GPT-J 6B model for continued pretraining on TRC2 datasets. |
params_gpt_neox_20B.yaml | GPT-Neox model with 20B parameters. |
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.
For a complete list of Cerebras ModelZoo CLI commands, see the command reference.
References
- Wang, S. et al. (2021). Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX
- Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
- Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding
- Brown, T. et al. (2020). Language Models are Few-Shot Learners
- EleutherAI (2021). Rotary Embeddings: A Relative Revolution
- Black, S. et al. (2022). GPT-NeoX-20B: An Open-Source Autoregressive Language Model