Job Monitoring CLI

The csctl command-line interface (CLI) tool allows you to manage jobs on the cluster directly from the terminal of your user node. To learn more about the available commands, run: csctl --help

csctl --help
Cerebras cluster command line tool.

Usage:
  csctl [command]

Available Commands:
  cancel             Cancel job
  check-volumes      Check volume validity on this usernode
  clear-worker-cache Clear the worker cache
  config             View csctl config files
  get                Get resources
  job                Job management commands
  label              Label resources
  log-export         Gather and download logs.
  types              Display resource types

Flags:
       --csconfig string    config file /opt/cerebras/config_v2 (default "/opt/cerebras/config_v2")
   -d, --debug int          higher debug values will display more fields in output objects
   -h, --help               help for csctl
   -n, --namespace string   configure csctl to talk to different user namespaces
   --version            version for csctl

Use "csctl [command] --help" for more information about a command.

Job IDs

Each training job submitted to the cluster launches two sequential jobs, each with their own jobID:

a compilation job (this runs first)
an execution job (this runs once the compliation job is done)

Job IDs are a required argument for most csctl commands. You can view the job ID in your terminal as each job is run:

Extracting the model from framework. This might take a few minutes.
WARNING:root:The following model params are unused: precision_opt_level, loss_scaling
2023-02-05 02:00:00,450 INFO:   Compiling the model. This may take a few minutes.
2023-02-05 02:00:00,635 INFO:   Initiating a new compile wsjob against the cluster server.
2023-02-05 02:00:00,761 INFO:   Compile job initiated
...
2023-02-05 02:02:00,899 INFO:   Ingress is ready.

2023-02-05 02:02:00,899 INFO:   Cluster mgmt job handle: {'job_id': 'wsjob-aaaaaaaaaa000000000', 'service_url': 'cluster-server.cerebras.local:443', 'service_authority': 'wsjob-aaaaaaaaaa000000000-coordinator-0.cluster-server.cerebras.local', 'compile_dir_absolute_path': '/cerebras/cached_compile/cs_0000000000111111'}

2023-02-05 02:02:00,901 INFO:   Creating a framework GRPC client: cluster-server.cerebras.local:443
2023-02-05 02:07:00,112 INFO:   Compile successfully written to cache directory: cs_000000000011111
2023-02-05 02:07:30,118 INFO:   Compile for training completed successfully!
2023-02-05 02:07:30,120 INFO:   Initiating a new execute wsjob against the cluster server.
2023-02-05 02:07:30,248 INFO:   Execute job initiated
...
2023-02-05 02:08:00,321 INFO:   Ingress is ready.

2023-02-05 02:08:00,321 INFO:   Cluster mgmt job handle: {'job_id': 'wsjob-bbbbbbbbbbb11111111', 'service_url': 'cluster-server.cerebras.local:443', 'service_authority': 'wsjob-bbbbbbbbbbb11111111-coordinator-0.cluster-server.cerebras.local', 'compile_artifact_dir': '/cerebras/cached_compile/cs_0000000000111111'}

...

Job IDs are also recorded in the <model_dir>/cerebras_logs/run_meta.json file, which contains two sections: compile_jobs and execute_jobs. For example, the compile job will show under compile_jobs while the training job and some additional log information will show under execute_jobs:

{
     "compile_jobs": [
        {

                    "id": "wsjob-aaaaaaaaaa000000000",

                    "log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000",
                    "start_time": "2023-02-05T02:00:00Z",
        },
     ]
}

After the training job is scheduled, additional log information and the jobID of the training job will show under execute_jobs. To correlate between compilation job and training job, you can correlate between the available time of the compilation job and the start time of the training job. For this example, you will see

{
    "compile_jobs": [
        {

            "id": "wsjob-aaaaaaaaaa000000000",

            "log_path": "/cerebras/workdir/wsjob-aaaaaaaaaa000000000",
            "start_time": "2023-02-05T02:00:00Z",
            "cache_compile": {
                "location": "/cerebras/cached_compile/cs_0000000000111111",

                "available_time": "2023-02-05T02:02:00Z"

            }
        }
    ],
    "execute_jobs": [
        {

            "id": "wsjob-bbbbbbbbbbb11111111",

            "log_path": "/cerebras/workdir/wsjob-bbbbbbbbbbb11111111",

            "start_time": "2023-02-05T02:02:00Z"

        }
    ]
}

Using the jobID, query information about status of a job in the system:

csctl [-d int] get job <jobID> [-o json|yaml]

where:

Flag	Default	Description
-o	table	Output Format: table, json, yaml
-d, -debug	0	Debug level. Choosing a higher level of debug prints more fields in the output objects. Only applicable to json or yaml output format.

csctl -d0 get job wsjob-000000000000 -oyaml
meta:
  createTime: "2022-12-07T05:10:16Z"
  labels:
    label: customed_label
    user: user1
  name: wsjob-000000000000
  type: job
spec:
  user:
    gid: "1001"
    uid: "1000"
  volumeMounts:
  - mountPath: /data
    name: data-volume-000000
    subPath: ""
  - mountPath: /dev/shm
    name: dev-shm
    subPath: ""
status:
  phase: SUCCEEDED
  systems:
  - systemCS2_1

Compilation jobs do not require CS-X resources, but they do require resources on the server nodes. We allow only one concurrent compilation running in the cluster. Execution jobs require CS-X resources and will be queued up until those resources are available. Compilation and execution jobs have different job IDs.

Cancel Jobs

You can cancel any compilation or execution job with the jobID:

csctl cancel job <jobID>

Cancelling a job releases all resources and sets the job to a cancelled state. In 1.8, this command might cause the client logs to print

cerebras.appliance.errors.ApplianceUnknownError: Received unexpected gRPC error (StatusCode.UNKNOWN) : 'Stream removed' while monitoring Coordinator for Runtime server errors

This is expected.

Label Jobs

Label a job with the label command:

csctl label job wsjob-000000000000 framework=pytorch

Run the same command again to remove a label.

Track Queue

Obtain a full list of running and queued jobs on the cluster:

csctl get jobs

By default, this command produces a table including:

Field	Description
Name	jobID identification
Age	Time since job submission
Duration	How long the job ran
Phase	One of QUEUED, RUNNING, SUCCEDED, FAILED, CANCELLED
Systems	CS-X systems used in this job
User	User that starts this job
Labels	Customized labels by user
Dashboard	Grafana dashboard link for this job

For example:

csctl get jobs
NAME                AGE  DURATION  PHASE      SYSTEMS                   USER  LABELS             DASHBOARD
wsjob-000000000001  18h  20s       RUNNING    systemCS2_1, systemCS2_2  user2 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002   1h  6m25s     QUEUED                               user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
wsjob-000000000003  10m  2m01s     QUEUED                               user1 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003

Directly executing the command prints out a long list of current and past jobs. Use -l to return jobs that match with the given set of labels:

csctl get jobs -l model=neox,team=ml
NAME                AGE  DURATION  PHASE      SYSTEMS                   USER  LABELS             DASHBOARD
wsjob-000000000002   1h  6m25s     QUEUED                               user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002

If you only want to see your own jobs, use -m:

csctl get jobs -m
NAME                AGE  DURATION  PHASE      SYSTEMS                   USER  LABELS             DASHBOARD
wsjob-000000000003  10m  2m01s     QUEUED                               user1 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003

To also include completed and failed jobs, use -a:

csctl get jobs -a
NAME                AGE  DURATION  PHASE      SYSTEMS                   USER  LABELS             DASHBOARD
wsjob-000000000000  43h  6m27s     SUCCEEDED  systemCS2_1               user1 model=gpt3xl       https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000000
wsjob-000000000001  18h  20s       RUNNING    systemCS2_1, systemCS2_2  user2 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002   1h  6m25s     QUEUED                               user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
wsjob-000000000003  10m  2m01s     QUEUED                               user1 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003

These filter options can be combined. For example, to see your complete job history:

csctl get jobs -a -m
NAME                AGE  DURATION  PHASE      SYSTEMS                   USER  LABELS             DASHBOARD
wsjob-000000000000  43h  6m27s     SUCCEEDED  systemCS2_1               user1 model=gpt3xl       https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000000
wsjob-000000000003  10m  2m01s     QUEUED                               user1 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003

You also can use grep to view which jobs are queued versus running and how many systems are occupied. grep 'RUNNING' will show a list of jobs that are currently running on the cluster. For example:

csctl get jobs | grep 'RUNNING'
wsjob-000000000001  18h  20s       RUNNING    systemCS2_1, systemCS2_2  user2 model=gpt3-tiny    https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001

grep 'QUEUED' will show a list of jobs that are currently queued. For example:

csctl get jobs | grep 'QUEUED'
wsjob-000000000002   1h  6m25s     QUEUED                               user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002

Get Configured Volumes

Get a list of mounted volumes on the cluster:

csctl get volume

NAME                  TYPE  CONTAINERPATH  SERVER       SERVERPATH  READONLY
training-data-volume  nfs   /ml            10.10.10.10  /ml         false

Update Job Priority

Update the priority for a given job, where priority_value is p1, p2, or p3:

csctl job set-priority <wsjob_id> -n <namespace> <priority_value>

For example:

$ csctl job set-priority wsjob-xxxxxx -n mynamespace p2

This updates the job’s priority to P2.

Redundancy Pools

If you’re running a larger job with Auto Job Restart enabled, you can create a redundacy pool to ensure job continuity and system reliability. The system will utilize the redundacy pool only when:

Local session systems or nodes become unhealthy
Local session lacks sufficient resources to complete the job

In those cases, resources from the redundant pool are used to complete the job.

Limitations and Compatibility Requirements

Jobs must fit within the session’s total system capacity. For example, a session with 2 total systems cannot submit a 3-system job, even if redundancy systems are available.
You cannot submit jobs directly to the redundant pool. If you need a redundant system for debugging purposes, system administrators can manually move redundancy systems to a debug session.
The redundancy pool session operates in a restricted mode and is only accessible by jobs from permitted sessions.

Redundancy pool systems are filtered based on system version matching:

Systems must align with the local session’s cluster version.
Cluster administrators must maintain version consistency across all session systems.
If a local session contains a system with version 2.4.1, for example, the redundancy pool will select systems with the same version.

Create Redundancy Pool

To create a redundancy pool, run the following command:

csctl session create --redundant

Then enable the pool:

# enable pool-a/pool-b as qa-master's redundancy 
csctl session update <session-name> --redundancy-sessions=<pool-a,pool-b...> 

# clear qa-master's redundancy 
csctl session update <session-name> --redundancy-sessions=""

To list all sessions in redundant mode:

# list all sessions in redundant mode only 
csctl session list --redundant
NAME                     SYSTEMS  NODEGROUPS  CRD-NODES  REDUNDANCY     LAST-ACTIVE
test1                    0        0           0          *redundant      >7 days
test2                    0        0           0          *redundant      >7 days

Export Logs

To download logs for a specific job:

csctl log-export <jobID> [-b]

with optional flags:

Flag	Default Value	Description
-b, –binaries	False	Include binary debugging artifacts
-h, –help		Informative message for log-export

For example:

csctl log-export wsjob-example-0
Gathering log data within cluster...
Starting a fresh download of log archive.
Downloaded 0.55 MB.
Logs archive: ./wsjob-example-0.zip

These logs are useful when debugging a job failure with Cerebras support.

Worker SSD Cache

To speed up the process of large amount of input data, we allow the users to stage their data in the worker nodes’ local SSD cache. This cache is shared among different users.

Get Worker Cache Usage

Use this command to obtain the current worker cache usage on each worker node:

csctl get worker-cache
NODE       DISK USAGE
worker-01  57.86%
worker-02  50.84%
worker-03  49.47%
worker-04  63.56%
worker-05  63.56%
worker-06  63.71%
worker-07  63.22%
worker-09  65.80%

Clear Worker Cache

If the cache is full, use the clear command to delete the contents of all caches on all nodes.

csctl get worker-cache
Worker caches cleared successfully

Cluster Status

Check the status and system load of all CS-X systems and all cluster nodes:

csctl get cluster

In this table, note that the CPU and MEM columns are only relevant for nodes, and system-in-use is only relevant for CS-X systems. The CPU percentage is scaled so that 100% indicates that all CPU cores are fully utilized. For example:

csctl get cluster
NAME               TYPE             CPU     MEM     SYSTEM-IN-USE  JOBID                         JOBLABELS     STATE  NOTES
systemf103         system           n/a     n/a     InUse          wsjob-jcvs23zpsxopvu9ymd2e5u  wsjob-label=  ok
systemf116         system           n/a     n/a     InUse          wsjob-jcvs23zpsxopvu9ymd2e5u  wsjob-label=  ok
cs-swx001-sx-sr18  broadcastreduce  22.17%  14.20%  n/a            n/a                                         ok
cs-wse002-mg-sr01  management       3.23%   9.45%   n/a            n/a                                         ok
cs-wse005-mx-sr04  memory           13.00%  12.93%  n/a            n/a                                         ok

You can filter the output with the following options:

Flag	Description
-e, -error-only	Only show nodes/systems in an error state
-n, -node-only	Only show nodes, omit the system list
-s, -system-only	Only show CS-X systems, omit the node list

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Job IDs

Cancel Jobs

Label Jobs

Track Queue

Get Configured Volumes

Update Job Priority

Redundancy Pools

Limitations and Compatibility Requirements

Create Redundancy Pool

Export Logs

Worker SSD Cache

Get Worker Cache Usage

Clear Worker Cache

Cluster Status

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Job IDs

​Cancel Jobs

​Label Jobs

​Track Queue

​Get Configured Volumes

​Update Job Priority

​Redundancy Pools

​Limitations and Compatibility Requirements

​Create Redundancy Pool

​Export Logs

​Worker SSD Cache

​Get Worker Cache Usage

​Clear Worker Cache

​Cluster Status

Job IDs

Cancel Jobs

Label Jobs

Track Queue

Get Configured Volumes

Update Job Priority

Redundancy Pools

Limitations and Compatibility Requirements

Create Redundancy Pool

Export Logs

Worker SSD Cache

Get Worker Cache Usage

Clear Worker Cache

Cluster Status