Model Description

GPT-J and GPT-Neox are families of decoder-only language models developed by EleutherAI and trained on the Pile dataset — a curated mixture of diverse text sources. Both models are designed to be efficient, flexible, and performant in zero-shot settings without the need for task-specific fine-tuning.

GPT-J

GPT-J is a 6B parameter auto-regressive transformer with architectural similarities to GPT-3. It introduces a parallel decoder block where attention and feed-forward layers are computed in parallel and added together, improving throughput by approximately 15% compared to traditional sequential transformer blocks. This design is especially beneficial for distributed training and single-device setups where minimizing cross-device communication is critical.

GPT-J also adopts Rotary Position Embeddings (RoPE) — applying them to 25% of the features while using sinusoidal embeddings for the remainder. This hybrid approach balances convergence speed with long-context modeling capabilities. Additionally, GPT-J employs dense attention, prioritizing simplicity and training stability at this scale over sparse alternatives.

GPT-Neox

GPT-Neox shares the same architecture as GPT-J with a few refinements:

  • Untied LayerNorm: Each transformer block uses two independent layer normalization layers instead of a shared one.
  • Enhanced Tokenizer: Tokenizers were retrained on the Pile and optimized for whitespace handling, repeated tokens, and programming languages — making Neox more robust for structured text like code.

These design choices allow GPT-Neox to generalize well across a broad range of domains and sequence lengths, including natural language and code generation tasks.

Code Structure

The code for these models is located in the /gptj directory within ModelZoo. Here’s how it’s organized:

Our implementations of GPT-J and GPT-Neox are built on top of our GPT-2 backbone. For more details, see gpt2_model.py.

Available Configurations

ConfigurationDescription
params_gptj_6B.yamlStandard 6B parameter GPT-J model.
params_gptj_6B_muP.yamlGPT-J model configured with μ-parameterization for scaling.
params_gptj_6B_TRC2.yamlGPT-J 6B model for continued pretraining on TRC2 datasets.
params_gpt_neox_20B.yamlGPT-Neox model with 20B parameters.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References