GPT-J & GPT-Neox

Model Description

GPT-J and GPT-Neox are families of decoder-only language models developed by EleutherAI and trained on the Pile dataset — a curated mixture of diverse text sources. Both models are designed to be efficient, flexible, and performant in zero-shot settings without the need for task-specific fine-tuning.

GPT-J

GPT-J is a 6B parameter auto-regressive transformer with architectural similarities to GPT-3. It introduces a parallel decoder block where attention and feed-forward layers are computed in parallel and added together, improving throughput by approximately 15% compared to traditional sequential transformer blocks. This design is especially beneficial for distributed training and single-device setups where minimizing cross-device communication is critical.

GPT-J also adopts Rotary Position Embeddings (RoPE) — applying them to 25% of the features while using sinusoidal embeddings for the remainder. This hybrid approach balances convergence speed with long-context modeling capabilities. Additionally, GPT-J employs dense attention, prioritizing simplicity and training stability at this scale over sparse alternatives.

GPT-Neox

GPT-Neox shares the same architecture as GPT-J with a few refinements:

Untied LayerNorm: Each transformer block uses two independent layer normalization layers instead of a shared one.
Enhanced Tokenizer: Tokenizers were retrained on the Pile and optimized for whitespace handling, repeated tokens, and programming languages — making Neox more robust for structured text like code.

These design choices allow GPT-Neox to generalize well across a broad range of domains and sequence lengths, including natural language and code generation tasks.

Code Structure

The code for these models is located in the /gptj directory within ModelZoo. Here’s how it’s organized:

/configs: Contains YAML configuration files for GPT-J.
/continuous_pretraining/configs: Contains configs for continuous pretraining of GPT-J.
model.py: The implementation of the GPT-Neox model.
gptneox/model.py: The implementation of the GPT-J model.

Our implementations of GPT-J and GPT-Neox are built on top of our GPT-2 backbone. For more details, see gpt2_model.py.

Available Configurations

Configuration	Description
`params_gptj_6B.yaml`	Standard 6B parameter GPT-J model.
`params_gptj_6B_muP.yaml`	GPT-J model configured with μ-parameterization for scaling.
`params_gptj_6B_TRC2.yaml`	GPT-J 6B model for continued pretraining on TRC2 datasets.
`params_gpt_neox_20B.yaml`	GPT-Neox model with 20B parameters.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References

Wang, S. et al. (2021). Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX
Shoeybi, M. et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding
Brown, T. et al. (2020). Language Models are Few-Shot Learners
EleutherAI (2021). Rotary Embeddings: A Relative Revolution
Black, S. et al. (2022). GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Model Description

GPT-J

GPT-Neox

Code Structure

Available Configurations

Workflow

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​GPT-J

​GPT-Neox

​Code Structure

​Available Configurations

​Workflow

​References

Model Description

GPT-J

GPT-Neox

Code Structure

Available Configurations

Workflow

References