-
CLI for job monitoring(csctl): The csctl tool is designed to provide comprehensive job monitoring capabilities. With csctl, you can:
- Inspect submitted jobs, gathering information about their status and configuration.
- Assign labels to jobs for better organization and categorization.
- Retrieve details about mounted volumes, which are crucial for data access.
- Export logs to investigate job execution and troubleshoot issues efficiently.
- Job priority: The job priority feature allows users to prioritize jobs in the Cerebras Wafer-Scale cluster based on priority buckets and values, enhancing job scheduling beyond the FIFO approach. Users can assign and adjust job priorities during submission or post-submission, with administrative controls over priority modifications, facilitating more efficient and organized job scheduling and execution.
- Cluster monitoring with Grafana: Cerebras offers a Grafana dashboard that offers a visual representation of job resource usage and relevant software and hardware errors tied to specific jobs. Grafana provides an intuitive interface for tracking and analyzing cluster performance and job metrics.
- slurm-integration: Cerebras has implemented a lightweight integration with the Slurm workload manager. Slurm is a job scheduler and resource manager widely used in high-performance computing environments. This integration streamlines job submission and management within the Cerebras Wafer-Scale cluster, allowing for efficient resource allocation and job scheduling.
- resource_parallel_compile: You can define resource requirements for parallel training and compilation jobs running inside the Cerebras Wafer-Scale cluster. This includes setting limits on memory and CPU usage to ensure efficient resource allocation and prevent resource contention.