Learn how to use the csctl CLI tool to manage and monitor jobs.
csctl
command-line interface (CLI) tool allows you to manage jobs on the cluster directly from the terminal of your user node.
To learn more about the available commands, run:
csctl --help
jobID
:
csctl
commands. You can view the job ID in your terminal as each job is run:
<model_dir>/cerebras_logs/run_meta.json
file, which contains two sections: compile_jobs
and execute_jobs
.
For example, the compile job will show under compile_jobs
while the training job and some additional log information will show under execute_jobs
:
execute_jobs
.
To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see
jobID
, query information about status of a job in the system:
Flag | Default | Description |
---|---|---|
-o | table | Output Format: table, json, yaml |
-d, -debug | 0 | Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format. |
jobID
:
label
command:
Field | Description |
---|---|
Name | jobID identification |
Age | Time since job submission |
Duration | How long the job ran |
Phase | One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED |
Systems | CS-X systems used in this job |
User | User that starts this job |
Labels | Customized labels by user |
Dashboard | Grafana dashboard link for this job |
-l
to return jobs that match with the given set of labels:
-m
:
-a
:
grep
to view which jobs are queued versus running and how many systems are occupied.
grep 'RUNNING'
will show a list of jobs that are currently running on the cluster.
For example:
grep 'QUEUED'
will show a list of jobs that are currently queued.
For example:
priority_value
is p1, p2, or p3:
Flag | Default Value | Description |
---|---|---|
-b, –binaries | False | Include binary debugging artifacts |
-h, –help | Informative message for log-export |
CPU
and MEM
columns are only relevant for nodes, and system-in-use
is only relevant for CS-X systems. The CPU
percentage is scaled so that 100% indicates that all CPU cores are fully utilized.
For example:
Flag | Description |
---|---|
-e, -error-only | Only show nodes/systems in an error state |
-n, -node-only | Only show nodes, omit the system list |
-s, -system-only | Only show CS-X systems, omit the node list |