Launch a Job
Learn how to launch a job on a Cerebras cluster.
Prerequisites
Before launching a job:
-
Complete the instructions in our setup and installation guide (including activating your virtual environment).
-
Preprocess your data. For general guidance, see the Data Preprocessing documentation or follow our Data Preprocessing Quickstart guide.
CLI Arguments and Flags
Use the Model Zoo CLI to launch a training, validation, or upstream and downstream validation job:
Required Arguments
Flag | Description |
---|---|
CSX | Specifies that the target device for execution is a Cerebras Cluster. |
--params <path/to/params.yaml> | Path to a YAML file containing model/run configuration options. |
Optional Arguments
Flag | Description | Default |
---|---|---|
--compile_only | Compiles the model by matching to Cerebras kernels and mapping to hardware. No execution occurs. Compile artifacts are stored in the specified --compile_dir . For training with a pre-compiled model, use the same --compile_dir . Cannot be used with --validate_only . | None |
--validate_only | Performs lightweight compilation to validate model compatibility with Cerebras kernels. Does not map to hardware or execute. Cannot be used with --compile_only . | None |
--model_dir <path/to/model> | Directory for storing model checkpoints and TensorBoard events files. | $CWD/model_dir |
--compile_dir <path/to/dir> | Directory for storing compile artifacts in the Cerebras cluster. | None |
--num_csx <1,2,4,8,16> | Number of CS-X systems to use for training. | 1 |
--job_priority | Sets the job priority. Valid inputs are p1, p2, and p3. Learn more about how jobs are prioritized here. | p2 |
--validate_only
performs a lightweight compilation to check model compatibility with Cerebras kernels, while cszoo validate
verifies that a model’s configuration meets expected requirements.
For a more comprehensive list, use the help command:
Launch a Job
Validate the Job (optional)
To verify your model’s compatibility, use the --validate_only
flag. This performs a quick compatibility check without executing a full run:
Compile the Model
Generate executable files for your model using the --compile_only
flag. This step typically takes 15-60 minutes:
Speed up subsequent runs by reusing compiled artifacts. Just use the same —compile_dir path for both compilation and execution.
Execute the Job
To run the job:
Here is an example of a typical output log for a training job:
-
Validation jobs are run on a single CS-X system. For multi-CS-X training, use the
--num_csx
flag. -
Monitor jobs using the csctl tool or the Grafana dashboard.
Time Estimation Metrics
There are two time estimation metrics to help you track training and eval progress:
-
LoopTimeRemaining
displays the estimated time remaining in your current operation loop, where a loop is a single training iteration, a single validation dataloader execution, or an eval harness run. -
TimeRemaining
shows the estimated total time remaining for your entire run, whether it’s a complete training session (fit
) or a validation run (validate
orvalidate_all
).
Understanding Time Estimates
When your run begins, the system needs to observe all different stages (training, checkpointing, validation, etc.) before it can provide a complete time estimate.
During this initial period, you’ll see + ??
appended to the TimeRemaining
metric. The initial estimate is optimistic since it doesn’t account for stages that haven’t been measured yet.
Once all stages have been observed at least once, the + ??
indicator will disappear, and you’ll receive more accurate time estimates.
These metrics are displayed consistently across CSX, CPU, and GPU hardware.
Output Files and Artifacts
The <model_dir>
directory contains all run results and artifacts, including:
- Checkpoints for model training progress.
- TensorBoard events, which can be viewed using:
- Configuration files in
<model_dir>/train
or<model_dir>/eval
. - Run logs in
<model_dir>/cerebras_logs/latest/run.log
or<model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log
.
Cancel a Job
To cancel a job: