Cerebras provides multiple Grafana dashboards to help you monitor your cluster.

Depending on your role, you may see some or all of the following dashboards described here.

Cluster Admin Dashboards

These dashboards are intended for administrators and cluster operators.

You’ll see a cluster-wide view of available resources (CSx systems, nodes, switches, links) and their health. Additionally, these dashboards highlight cluster problems at a glance, categorizing them as errors (requiring immediate attention) or warnings (potential issues).

Click on resource links to view a resources’ respective dashboard with associated metrics.

Session Admin Dashboards

These dashboards are intended for session owners who need to monitor the utilization and health of a session’s resources and job statuses.

You’ll see a session-specific view of the job queue, available resources (CSx systems, nodes, switches, links), and their health.

Click on resource links to navigate to specific Resource Details dashboards or Job Admin dashboards to access detailed metrics and job status info.

Job Admin Dashboards

These dashboards are intended for users who want to monitor the status of a job and troubleshoot, if needed.

You’ll see a single job in the dashboard with metrics on system utilization and the resources in use. The Job Debug dashboard highlights potential causes of failures and the Job Network Debug dashboard drills down on network resources in use by the job.

Click on resource links to navigate to Resource Details dashboards and their associated metrics.

Resource Details Dashboards

These dashboards are intended for administrators and cluster operators who want to view specific cluster resources and the ports/links that connect them.

You’ll see detailed metrics for cluster resources including CSx systems, nodes, servers, and connecting ports. The metrics cover CPU utilization, memory availability, thermal readings, network performance, link states, and device error counters.

Access Grafana

Ensure you have access to the user node in the Cerebras Wafer-Scale cluster. If you encounter system configuration issues, contact Cerebras Support.

Set Up Port Forwarding

Run the following command to start a port-forwarding SSH session through the user node from your machine:

ssh -L 8443:grafana.<cluster-name>.<domain_name>:443 myUser@usernode

This command forwards traffic through local port 8443. You can use any unoccupied port on your machine.

Get Access Credentials

  1. Ask your system administrator to set up the Grafana database. URLs follow this format: grafana.<CLUSTER-NAME>.<DOMAIN_NAME>. For example: grafana.mb-systemf102.cerebras.com.
  2. Obtain authentication credentials (username and password) from your system administrator.

Add the Grafana TLS Certificate

The Grafana TLS certificate is located at /opt/cerebras/certs/grafana_tls.crt on the user node. This certificate is copied during the user node installation.

Download it to your local machine and add it to your browser’s keychain.

On Chrome with MacOS

  1. Open Preferences > Privacy and Security > Security > Manage Certificates.
  2. Add grafana-tls.crt to the System keychain and set it to Always Trust.

Update the Hosts File

Edit your local machine’s /etc/hosts file to point the user node’s IP address to Grafana:

<USERNODE_IP> grafana.<cluster-name>.<domain_name>

Open Grafana

Navigate to the Grafana dashboards using the following URL:

https://grafana.<cluster-name>.<domain_name>