Cluster Monitoring With Grafana

Grafana dashboards empower you to visualize, query, and explore vital metrics through dynamic graphs and charts. Dive deeper with targeted queries, uncover hidden patterns through interactive exploration, and gain rich context from integrated access to system logs and traces. Cerebras offers you two Cerebras-tailored Grafana Dashboards:

ML Admin dashboard This dashboard is designed to help users and administrators visualize the overall state of the Cerebras Wafer-Scale cluster
ML User dashboard This dashboard is specialized for monitoring and managing individual jobs running on the Cerebras Wafer-Scale cluster

ML Admin Dashboard

The ML Admin dashboard shows the overall state of the cluster. The following figure displays Cerebras’s Wafer-scale ML Admin dashboard:

It includes the following:

CS-X Status Overall CS-X system status and jobs running on the cluster
Node Summary Overall CPU/Memory/Network bandwidth information on different types of nodes
Individual NodeGroup Summary Overall CPU/Memory/Network bandwidth information for nodes inside a nodegroup

ML User Dashboard

The ML User Dashboard provides job-level metrics, logs, and traces, allowing users to closely monitor the progress and resource utilization of their specific jobs. The following figure displays Cerebras’s ML User dashboard:

The following list describes the various panes in the dashboard:

Overview Displays the overview of memory/cpu/network bandwidth numbers for all replicas of selected job
Server summary by replica type, all nodegroups Displays summary CPU/Memory/Network bandwidth for each replica type in all nodegroups
Server summary by replica type, individual nodegroup Displays summary CPU/Memory/Network bandwidth for each replica type in a single nodegroup
Replica view Displays memory/cpu/network bandwidth numbers for each replica_id of this replica_type in each chart. Replica_type represents a type of service process for a given job. It can be one of these types: weight, command, activation, broadcastreduce, chief, worker, coordinator. Replica_id corresponds to the specific replica for a job and a replica type
Assigned nodes Displays physical nodes statuses that are assigned to the chosen replica_type and replica_id
MemX performance Shows iteration-based performance, iteration time, cross-iteration time, and backward iteration time

There are various filters users can select:

wsjob Indicates the ID of the weight-streaming run, which is used to select between different runs on a particular system
replica_type Allows selecting between the activation, weight, and all server metrics
nodegroup Selects a nodegroup to show server summaries

Other fields available that are useful are the model, job_type, and the replica_id.

Prerequisites

You have access to the user node in the Cerebras Wafer-Scale cluster. Contact Cerebras Support for any system configuration issues. You can run a port-forwarding SSH session through the user node from your machine with this command:

$ ssh -L 8443:grafana.<cluster-name>.<domain>.com:443 myUser@usernode

This command uses the local port 8443 to forward the traffic. You can choose any unoccupied port on your machine.

Getting Access

Links are accessible from the General/Cerebras tab. The following figure displays a Cerebras dashboard:

Ask your system administrator to set up the Grafana database. URLs come in the format: grafana.CLUSTER-NAME.DOMAIN.com For example: grafana.mb-systemf102.cerebras.com
Get authentication credentials for Grafana (username and password) from your system administrator.
Add the Grafana TLS certificate to your browser keychain. The Grafana TLS certificate is located at /opt/cerebras/certs/grafana_tls.crt on the user node. This certificate is copied during user node installation process. Download this certificate to your local machine and add this certificate to your browser keychain.

On Chrome with Mac OS

Go to Preferences > Privacy and Security > Security > Manage Certificates.
Add grafana-tls.crt into System keychain certificates. Make sure to set Always Trust when using this certificate.
Next, edit your local machine’s /etc/hosts file to point the IP of the user node to Grafana: <USERNODE_IP> grafana.<cluster-name>.<domain>.com
Finally, navigate to the Grafana dashboards with the following URL: https://grafana.<cluster-name>.<domain>.com

Viewing Performance Metrics with the ML User Dashboard

You can view cluster iteration-performance metrics by tracking update times across the weight servers. Our current dashboard implementation shows iteration time, forward-iteration time, backward-iteration time, cross-iteration time, and input starvation.

Iteration time Indicates the time from the end of iteration “i-1” on the weight servers to the end of iteration “i” on the weight servers.
Forward-iteration time Indicates the time spent in iteration “i” during the forward pass.
Backward-iteration time Indicates the time spent in iteration “i” during the backward pass.
Cross-iteration time Indicates the time between the last gradient received of an iteration to the first weight sent. A high value indicates an optimizer performance bottleneck.
Input starvation Indicates the time spent waiting on the framework to receive activations.

These statistics are shown in the following image and can be used to identify performance bottlenecks in the training process:

Viewing Utilization Metrics with the ML User Dashboard

The following figure shows the overview status for a job, including the list of CS-X, start and end time, memory/cpu/network usage for different replicas in a job:

The Overview, Server summary by replica, all nodegroups, and Server summary by replica, individual nodegroup display memory/cpu/network bandwidth numbers relevant to a job, with different granularity levels. The Overview show the metrics for all replicas in a job, and systems are used by the job. The two Server summary by replica panes show the metrics in all nodegroups, or an individual nodegroup. The Replica view metric displays memory/cpu/network bandwidth numbers for each replica_id of this replica_type in each chart. Replica_type represents a type of service process for a given job. It can be one of these types: weight, command, activation, broadcastreduce, chief, worker, and coordinator.

Egress bandwidth indicates each supporting server’s maximum and mean network egress speeds. This might be helpful information to monitor whether jobs are network-bound via the transmission speeds of a lagging node.

The following figure shows that weight servers achieve a maximum network transmit speed of ~33 MB/s:

Ingress bandwidth denotes the ingress speeds for each supporting server. For example, in this instance, the weight servers have an average ingress speed of around 20 MB/s.

The following figure shows the ingress bandwidth metric:

CPU usage shows the CPU percentage utilization for each weight-server. In this case, the CPUs are only 5-7% utilized.

The following figure shows the cpu usage metric:

Memory usage indicates the maximum and mean amounts of memory each weight server uses over time. This can be useful in debugging whether the weight servers are memory-bound. For more information on memory requirements, visit resource_parallel_compile.

The following figure shows the memory usage metric:

You can use the Grafana interface to show individual metrics for a particular node that runs a replica. For example, these are the views for CPU and memory usage for the node that runs weight-2 replica:

The following figure shows the cpu usage per node metric:

The following figure shows the memory usage per node metric:

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

Cluster Monitoring With Grafana

ML Admin Dashboard

ML User Dashboard

Prerequisites

Getting Access

Viewing Performance Metrics with the ML User Dashboard

Viewing Utilization Metrics with the ML User Dashboard

Getting Started

Concepts

Model Zoo

CS Torch

Cluster Monitoring

Fundamentals

Support

​ML Admin Dashboard

​ML User Dashboard

​Prerequisites

​Getting Access

​Viewing Performance Metrics with the ML User Dashboard

​Viewing Utilization Metrics with the ML User Dashboard

ML Admin Dashboard

ML User Dashboard

Prerequisites

Getting Access

Viewing Performance Metrics with the ML User Dashboard

Viewing Utilization Metrics with the ML User Dashboard