Prerequisites

Before launching a job:

CLI Arguments and Flags

Use the Model Zoo CLI to launch a training, validation, or upstream and downstream validation job:

cszoo fit params_model.yaml [additional-args]

Required Arguments

FlagDescription
CSXSpecifies that the target device for execution is a Cerebras Cluster.
--params <path/to/params.yaml>Path to a YAML file containing model/run configuration options.

Optional Arguments

FlagDescriptionDefault
--compile_onlyCompiles the model by matching to Cerebras kernels and mapping to hardware. No execution occurs. Compile artifacts are stored in the specified --compile_dir. For training with a pre-compiled model, use the same --compile_dir. Cannot be used with --validate_only.None
--validate_onlyPerforms lightweight compilation to validate model compatibility with Cerebras kernels. Does not map to hardware or execute. Cannot be used with --compile_only.None
--model_dir <path/to/model>Directory for storing model checkpoints and TensorBoard events files.$CWD/model_dir
--compile_dir <path/to/dir>Directory for storing compile artifacts in the Cerebras cluster.None
--num_csx <1,2,4,8,16>Number of CS-X systems to use for training.1
--job_prioritySets the job priority. Valid inputs are p1, p2, and p3. Learn more about how jobs are prioritized here.p2

--validate_only performs a lightweight compilation to check model compatibility with Cerebras kernels, while cszoo validate verifies that a model’s configuration meets expected requirements.

For a more comprehensive list, use the help command:

cszoo fit --h

Launch a Job

1

Validate the Job (optional)

To verify your model’s compatibility, use the --validate_only flag. This performs a quick compatibility check without executing a full run:

cszoo fit params_model.yaml \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --validate_only
2

Compile the Model

Generate executable files for your model using the --compile_only flag. This step typically takes 15-60 minutes:

  cszoo fit params_model.yaml \
      CSX \
      --params params.yaml \
      --num_csx=1 \
      --model_dir <model_dir> \
      --compile_only

Speed up subsequent runs by reusing compiled artifacts. Just use the same —compile_dir path for both compilation and execution.

3

Execute the Job

To run the job:

cszoo fit params_model.yaml \
      CSX \
      --params params.yaml \
     --num_csx=1 \
      --model_dir <model_dir> \

Here is an example of a typical output log for a training job:

Transferring weights to server: 100%|██| 1165/1165 [01:00<00:00, 19.33tensors/s]
INFO:   Finished sending initial weights
INFO:   | Train Device=CSX, Step=50, Loss=8.31250, Rate=69.37 samples/sec, GlobalRate=69.37 samples/sec, LoopTimeRemaining=0:13:21, TimeRemaining=0:13:21
INFO:   | Train Device=CSX, Step=100, Loss=7.25000, Rate=68.41 samples/sec, GlobalRate=68.56 samples/sec, LoopTimeRemaining=0:11:53, TimeRemaining=0:11:53
INFO:   | Train Device=CSX, Step=150, Loss=6.53125, Rate=68.31 samples/sec, GlobalRate=68.46 samples/sec, LoopTimeRemaining=0:10:24, TimeRemaining=0:10:24
INFO:   | Train Device=CSX, Step=200, Loss=6.53125, Rate=68.54 samples/sec, GlobalRate=68.51 samples/sec, LoopTimeRemaining=0:08:55, TimeRemaining=0:08:55
INFO:   | Train Device=CSX, Step=250, Loss=6.12500, Rate=68.84 samples/sec, GlobalRate=68.62 samples/sec, LoopTimeRemaining=0:07:24, TimeRemaining=0:07:24
INFO:   | Train Device=CSX, Step=300, Loss=5.53125, Rate=68.74 samples/sec, GlobalRate=68.63 samples/sec, LoopTimeRemaining=0:06:00, TimeRemaining=0:06:00
INFO:   | Train Device=CSX, Step=350, Loss=4.81250, Rate=68.01 samples/sec, GlobalRate=68.47 samples/sec, LoopTimeRemaining=0:04:29, TimeRemaining=0:04:29
INFO:   | Train Device=CSX, Step=400, Loss=5.37500, Rate=68.44 samples/sec, GlobalRate=68.50 samples/sec, LoopTimeRemaining=0:02:59, TimeRemaining=0:02:59
INFO:   | Train Device=CSX, Step=450, Loss=6.43750, Rate=68.43 samples/sec, GlobalRate=68.49 samples/sec, LoopTimeRemaining=0:01:28, TimeRemaining=0:01:28
INFO:   | Train Device=CSX, Step=500, Loss=5.09375, Rate=66.71 samples/sec, GlobalRate=68.19 samples/sec, LoopTimeRemaining=0:00:00, TimeRemaining=0:00:00
INFO:   Training completed successfully!
INFO:   Processed 60500 sample(s) in 887.2672743797302 seconds.

  • Validation jobs are run on a single CS-X system. For multi-CS-X training, use the --num_csx flag.

  • Monitor jobs using the csctl tool or the Grafana dashboard.

Time Estimation Metrics

There are two time estimation metrics to help you track training and eval progress:

  1. LoopTimeRemaining displays the estimated time remaining in your current operation loop, where a loop is a single training iteration, a single validation dataloader execution, or an eval harness run.

  2. TimeRemaining shows the estimated total time remaining for your entire run, whether it’s a complete training session (fit) or a validation run (validate or validate_all).

Understanding Time Estimates

When your run begins, the system needs to observe all different stages (training, checkpointing, validation, etc.) before it can provide a complete time estimate.

During this initial period, you’ll see + ?? appended to the TimeRemaining metric. The initial estimate is optimistic since it doesn’t account for stages that haven’t been measured yet.

Once all stages have been observed at least once, the + ?? indicator will disappear, and you’ll receive more accurate time estimates.

These metrics are displayed consistently across CSX, CPU, and GPU hardware.

Output Files and Artifacts

The <model_dir> directory contains all run results and artifacts, including:

  • Checkpoints for model training progress.
  • TensorBoard events, which can be viewed using:
tensorboard --logdir <model_dir> --bind_all
  • Configuration files in <model_dir>/train or <model_dir>/eval.
  • Run logs in <model_dir>/cerebras_logs/latest/run.log or <model_dir>/cerebras_logs/<train|eval>/<timestamp>/run.log.

Cancel a Job

To cancel a job:

csctl cancel job <jobid>