ESM-2
Protein language model trained on UniRef50, using a masked language modeling objective to learn evolutionary and structural properties of proteins.
Model Description
ESM-2 (Evolutionary Scale Modeling) is a family of transformer-based protein language models developed by Meta AI’s Fundamental AI Research (FAIR) Protein Team. Trained on large-scale protein sequence datasets such as UniRef50, ESM-2 learns representations that encode structural and functional information about proteins without requiring evolutionary alignments.
This implementation supports a range of ESM-2 model sizes, including variable sequence length (VSL) support for improved efficiency with shorter protein sequences. Models are pretrained using a masked language modeling objective similar to BERT.
Code Structure
The code for this model is located in the esm2
directory within the ModelZoo. It reuses shared training infrastructure and custom data processors optimized for protein sequence modeling.
configs/
: YAML configuration files for training various ESM-2 model sizes.model.py
: Top-level wrapper for initializing ESM-2 model instances and integrating with training.esm2_pretrain_models.py
: Core model architecture implementation.utils.py
: Helper utilities for config parsing and data formatting.
Available Configurations
Configuration | Description |
---|---|
params_esm2_t12_35M_UR50D.yaml | ESM-2 model with 12 layers and ~35M parameters. |
params_esm2_t33_650M_UR50D.yaml | ESM-2 model with 33 layers and ~650M parameters. |
params_esm2_t33_650M_UR50D_vsl.yaml | ESM-2 650M model with Variable Sequence Length (VSL) enabled for efficient training. |
params_esm2_t36_3B_UR50D.yaml | ESM-2 model with 36 layers and ~3B parameters. |
params_esm2_t48_15B_UR50D.yaml | ESM-2 model with 48 layers and ~15B parameters. |
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.
For a complete list of Cerebras ModelZoo CLI commands, see the command reference.