Launch a Job

Prerequisites

Before launching a job:

Complete the instructions in our setup and installation guide (including activating your virtual environment).
Preprocess your data. For general guidance, see the Data Preprocessing documentation or follow our Data Preprocessing Quickstart guide.

CLI Arguments and Flags

Use the Model Zoo CLI to launch a training, validation, or upstream and downstream validation job:

cszoo fit params_model.yaml [additional-args]

Required Arguments

Flag	Description
`CSX`	Specifies that the target device for execution is a Cerebras Cluster.
`--params <path/to/params.yaml>`	Path to a YAML file containing model/run configuration options.

Optional Arguments

Flag	Description	Default
`--compile_only`	Compiles the model by matching to Cerebras kernels and mapping to hardware. No execution occurs. Compile artifacts are stored in the specified `--compile_dir`. For training with a pre-compiled model, use the same `--compile_dir`. Cannot be used with `--validate_only`.	`None`
`--validate_only`	Performs lightweight compilation to validate model compatibility with Cerebras kernels. Does not map to hardware or execute. Cannot be used with `--compile_only`.	`None`
`--model_dir <path/to/model>`	Directory for storing model checkpoints and TensorBoard events files.	`$CWD/model_dir`
`--compile_dir <path/to/dir>`	Directory for storing compile artifacts in the Cerebras cluster.	`None`
`--num_csx <1,2,4,8,16>`	Number of CS-X systems to use for training.	`1`
`--job_priority`	Sets the job priority. Valid inputs are p1, p2, and p3. Learn more about how jobs are prioritized here.	`p2`

--validate_only performs a lightweight compilation to check model compatibility with Cerebras kernels, while cszoo validate verifies that a model’s configuration meets expected requirements.

For a more comprehensive list, use the help command:

cszoo fit --h

Validate the Job (optional)

To verify your model’s compatibility, use the --validate_only flag. This performs a quick compatibility check without executing a full run:

cszoo fit params_model.yaml \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --validate_only

Compile the Model

Generate executable files for your model using the --compile_only flag. This step typically takes 15-60 minutes:

  cszoo fit params_model.yaml \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --model_dir <model_dir> \
      --compile_only

Speed up subsequent runs by reusing compiled artifacts. Just use the same —compile_dir path for both compilation and execution.

Execute the Job

To run the job:

cszoo fit params_model.yaml \
      CSX \
      --params params.yaml \
     --num_csx=1 \
      --model_dir <model_dir> \

Here is an example of a typical output log for a training job:

Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO:   Finished sending initial weights
INFO:   | Train Device=CSX, Step=50, Loss=8.31250, Rate=69.37 samples/sec, GlobalRate=69.37 samples/sec, LoopTimeRemaining=0:13:21, TimeRemaining=0:13:21
INFO:   | Train Device=CSX, Step=100, Loss=7.25000, Rate=68.41 samples/sec, GlobalRate=68.56 samples/sec, LoopTimeRemaining=0:11:53, TimeRemaining=0:11:53
INFO:   | Train Device=CSX, Step=150, Loss=6.53125, Rate=68.31 samples/sec, GlobalRate=68.46 samples/sec, LoopTimeRemaining=0:10:24, TimeRemaining=0:10:24
INFO:   | Train Device=CSX, Step=200, Loss=6.53125, Rate=68.54 samples/sec, GlobalRate=68.51 samples/sec, LoopTimeRemaining=0:08:55, TimeRemaining=0:08:55
INFO:   | Train Device=CSX, Step=250, Loss=6.12500, Rate=68.84 samples/sec, GlobalRate=68.62 samples/sec, LoopTimeRemaining=0:07:24, TimeRemaining=0:07:24
INFO:   | Train Device=CSX, Step=300, Loss=5.53125, Rate=68.74 samples/sec, GlobalRate=68.63 samples/sec, LoopTimeRemaining=0:06:00, TimeRemaining=0:06:00
INFO:   | Train Device=CSX, Step=350, Loss=4.81250, Rate=68.01 samples/sec, GlobalRate=68.47 samples/sec, LoopTimeRemaining=0:04:29, TimeRemaining=0:04:29
INFO:   | Train Device=CSX, Step=400, Loss=5.37500, Rate=68.44 samples/sec, GlobalRate=68.50 samples/sec, LoopTimeRemaining=0:02:59, TimeRemaining=0:02:59
INFO:   | Train Device=CSX, Step=450, Loss=6.43750, Rate=68.43 samples/sec, GlobalRate=68.49 samples/sec, LoopTimeRemaining=0:01:28, TimeRemaining=0:01:28
INFO:   | Train Device=CSX, Step=500, Loss=5.09375, Rate=66.71 samples/sec, GlobalRate=68.19 samples/sec, LoopTimeRemaining=0:00:00, TimeRemaining=0:00:00
INFO:   Training completed successfully!
INFO:   Processed 60500 sample(s) in 887.2672743797302 seconds.

Validation jobs are run on a single CS-X system. For multi-CS-X training, use the --num_csx flag.
Monitor jobs using the csctl tool or the Grafana dashboard.

Time Estimation Metrics

There are two time estimation metrics to help you track training and eval progress:

LoopTimeRemaining displays the estimated time remaining in your current operation loop, where a loop is a single training iteration, a single validation dataloader execution, or an eval harness run.
TimeRemaining shows the estimated total time remaining for your entire run, whether it’s a complete training session (fit) or a validation run (validate or validate_all).

Understanding Time Estimates

When your run begins, the system needs to observe all different stages (training, checkpointing, validation, etc.) before it can provide a complete time estimate. During this initial period, you’ll see + ?? appended to the TimeRemaining metric. The initial estimate is optimistic since it doesn’t account for stages that haven’t been measured yet. Once all stages have been observed at least once, the + ?? indicator will disappear, and you’ll receive more accurate time estimates. These metrics are displayed consistently across CSX, CPU, and GPU hardware.

Output Files and Artifacts

The <model_dir> directory contains all run results and artifacts, including:

Checkpoints for model training progress.
TensorBoard events, which can be viewed using:

tensorboard --logdir <model_dir> --bind_all

Configuration files in <model_dir>/train or <model_dir>/eval.
Run logs in <model_dir>/cerebras_logs/latest/run.log or <model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log.

Cancel a Job

To cancel a job:

csctl cancel job <jobid>

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Launch a Job

Prerequisites

CLI Arguments and Flags

Required Arguments

Optional Arguments

Launch a Job

Time Estimation Metrics

Understanding Time Estimates

Output Files and Artifacts

Cancel a Job

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Prerequisites

​CLI Arguments and Flags

​Required Arguments

​Optional Arguments

​Launch a Job

​Time Estimation Metrics

​Understanding Time Estimates

​Output Files and Artifacts

​Cancel a Job

Prerequisites

CLI Arguments and Flags

Required Arguments

Optional Arguments

Launch a Job

Time Estimation Metrics

Understanding Time Estimates

Output Files and Artifacts

Cancel a Job