Restart Behavior
When Automatic Job Restart is enabled:- The system resets the failure count when a run makes progress (i.e., captures a new checkpoint).
- For multiphase trainer configurations, each trainer config runs sequentially with autorestart. Subsequent configs only run if the previous one succeeds.
Prerequisites
Before using Automatic Job Restart, ensure that Checkpointing is configured. Restarts will commence from the last-saved checkpoint.Configuration
Add theautorestart
parameter to your Trainer config file:
max_num_restarts
is specified in the autorestart config. Learn more about launching jobs here.
Parameters
max_num_restarts
: The maximum number of automatic restarts allowed without any progress (new checkpoints) before the run is considered failed. This parameter is required.- (Optional)
min_num_csx
: The minimum number of CSX systems with which to perform a restart. Defaults to 1 if not specified. In the event of a faulty system in the cluster, this feature will automatically remove that system from the usable pool and, unless a replacement system is found, restart with the remaining number of systems. This config establishes a lower bound on the number of systems from the usable pool with which to perform a restart.
Log Files and Monitoring
Restart logs are stored in/model_dir/<timestamped_dir>_restartable/run.log
.
For multiphase trainer configs, each phase will have a separate timestamped restartable directory.
When automatic restart is enabled, the system will prefetch extra compile jobs in the background for
This helps speed up restarts after system failures and no action is required.
num_csx - 1
and num_csx - 2
system configurations. As a result, you may see log messages like this:[restartable_trainer.py:319] Prefetching compile for num_csx=1 completed successfully.
This helps speed up restarts after system failures and no action is required.
Limitations
- Validation runs will only restart from scratch in the event of failures.
- During training and eval runs, if a failure occurs during validation, restart will resume from the next training loop.
- If a dataloader loads tokens out of bounds of the model’s
vocab_size
, the system will exhaustmax_num_restarts
before failing. - The system cannot automatically restart runs that are deadlocked at the system level.
Non-Recoverable Failures
The following errors will not trigger automatic restarts, regardless of how manymax_num_restarts
you’ve specified:
- Compilation or lowering failures
- Invalid user configurations
- Failed assertions in Model Zoo