Transformer

On this page

Model Description
Code Structure
Available Configurations
Workflow
Implementation Notes
References

Model Description

This implementation reproduces the original Transformer model architecture introduced in Attention Is All You Need. It was first applied to English–German translation on the WMT16 dataset and introduced the now-standard building blocks of modern NLP models: multi-head self-attention, layer normalization, feed-forward networks, residual connections, and positional embeddings. While this implementation shares much of its foundation with the T5 model, it includes important differences in architecture, datasets, model sizes, and training objectives. In particular, this model uses learned absolute positional embeddings rather than relative encodings, and the training task is translation rather than general sequence-to-sequence learning.

Code Structure

The code for this model is located in the transformer directory. It reuses shared infrastructure where possible, especially components from the T5 implementation.

configs/: YAML configuration files for various Transformer model sizes and training setups.
data_preparation/nlp/transformer/: Scripts for preprocessing the WMT16 English–German dataset.

Available Configurations

Configuration	Description
`transformer_base.yaml`	Base Transformer model with `d_kv=64`, `num_heads=8`, and `encoder_num_hidden_layers=6`.
`transformer_large.yaml`	Large Transformer model with `d_kv=64`, `num_heads=16`, and `encoder_num_hidden_layers=6`.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

Implementation Notes

This implementation includes two deviations from the original Transformer paper:

Optimizer: Adafactor is not currently supported. We use AdamW instead, which may result in slightly higher final training loss.
Positional Embeddings: Learned absolute position embeddings are used rather than fixed sinusoidal embeddings. This simplification can slightly degrade performance but simplifies implementation.

References

T5 DINOv2

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Model Description

Code Structure

Available Configurations

Workflow

Implementation Notes

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​Code Structure

​Available Configurations

​Workflow

​Implementation Notes

​References

Model Description

Code Structure

Available Configurations

Workflow

Implementation Notes

References