Learn how to configure automatic job restart in your Trainer config.
autorestart
parameter to your Trainer config file:
max_num_restarts
is specified in the autorestart config. Learn more about launching jobs here.
max_num_restarts
: The maximum number of automatic restarts allowed without any progress (new checkpoints) before the run is considered failed. This parameter is required.min_num_csx
: The minimum number of CSX systems with which to perform a restart. Defaults to 1 if not specified. In the event of a faulty system in the cluster, this feature will automatically remove that system from the usable pool and, unless a replacement system is found, restart with the remaining number of systems. This config establishes a lower bound on the number of systems from the usable pool with which to perform a restart./model_dir/<timestamped_dir>_restartable/run.log
.
For multiphase trainer configs, each phase will have a separate timestamped restartable directory.
num_csx - 1
and num_csx - 2
system configurations. As a result, you may see log messages like this:[restartable_trainer.py:319] Prefetching compile for num_csx=1 completed successfully.
vocab_size
, the system will exhaust max_num_restarts
before failing.max_num_restarts
you’ve specified: