Model Description

ESM-2 (Evolutionary Scale Modeling) is a family of transformer-based protein language models developed by Meta AI’s Fundamental AI Research (FAIR) Protein Team. Trained on large-scale protein sequence datasets such as UniRef50, ESM-2 learns representations that encode structural and functional information about proteins without requiring evolutionary alignments.

This implementation supports a range of ESM-2 model sizes, including variable sequence length (VSL) support for improved efficiency with shorter protein sequences. Models are pretrained using a masked language modeling objective similar to BERT.

Code Structure

The code for this model is located in the esm2 directory within the ModelZoo. It reuses shared training infrastructure and custom data processors optimized for protein sequence modeling.

  • configs/: YAML configuration files for training various ESM-2 model sizes.
  • model.py: Top-level wrapper for initializing ESM-2 model instances and integrating with training.
  • esm2_pretrain_models.py: Core model architecture implementation.
  • utils.py: Helper utilities for config parsing and data formatting.

Available Configurations

ConfigurationDescription
params_esm2_t12_35M_UR50D.yamlESM-2 model with 12 layers and ~35M parameters.
params_esm2_t33_650M_UR50D.yamlESM-2 model with 33 layers and ~650M parameters.
params_esm2_t33_650M_UR50D_vsl.yamlESM-2 650M model with Variable Sequence Length (VSL) enabled for efficient training.
params_esm2_t36_3B_UR50D.yamlESM-2 model with 36 layers and ~3B parameters.
params_esm2_t48_15B_UR50D.yamlESM-2 model with 48 layers and ~15B parameters.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning.

For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References