Failing To Automatically Load Checkpoints

Explanation

Cerebras ModelZoo PyTorch runs have a feature (enabled by default) to auto-load the last available checkpoint in the model_dir if a --checkpoint_path is not explicitly provided. It is important to note that only a specific checkpoint naming scheme is checked to find the latest checkpoint. All files in the format checkpoint_<step>.mdl are checked in the model_dir. If one or more are found, the file with the highest value of <step> is chosen and model weights are initialized with that checkpoint. This feature can be turned off by setting runconfig.autoload_last_checkpoint to False in the params yaml file.

Work around

You can either

Provide a checkpoint inside model_dir with the naming format checkpoint_<step>.mdl, or
Specify checkpoint path by using the --checkpoint_path flag.

Failed Mount Directory During Execution Failure To Trace Due To Functionalization Error

On this page

Explanation
Work around

Explanation

Work around

You can either

Provide a checkpoint inside model_dir with the naming format checkpoint_<step>.mdl, or
Specify checkpoint path by using the --checkpoint_path flag.

Failed Mount Directory During Execution Failure To Trace Due To Functionalization Error

On this page

Explanation
Work around

Failing To Automatically Load Checkpoints

Explanation

Work around

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

Failing To Automatically Load Checkpoints

Explanation

Work around

​Explanation

​Work around

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

​Explanation

​Work around

Explanation

Work around

Explanation

Work around