The cluster enforces specific limits on memory and CPU requests to facilitate parallel compiles and training jobs. These limits can be adjusted based on your requirements.
Identify Queued Jobs
When the cluster’s resources are fully utilized, any newly submitted jobs will be queued until capacity becomes available. Your Python client code will display messages like this:
INFO: Poll ingress status: Waiting for job running, current job status: Queueing, msg: job queueing, waiting for lock grant. Cluster status: 3 execute job(s) queued before current job, systems in use: 1
INFO: Poll ingress status: Waiting for job running, current job status: Queueing, msg: job queueing, waiting for lock grant. Cluster status: 2 execute job(s) queued before current job, systems in use: 1
You can obtain a full list of running and queued jobs on the cluster with the csctl tool:
csctl get jobs
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-000000000001 18h 20s RUNNING systemCS2_1, systemCS2_2 user2 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000001
wsjob-000000000002 1h 6m25s QUEUED user2 model=neox,team=ml https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000002
wsjob-000000000003 10m 2m01s QUEUED user1 model=gpt3-tiny https://grafana.cerebras.local/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-000000000003
Detect OOM Failures
When a job fails due to an out of memory (OOM) error, your client logs will contain messages like:
reason: "OOMKilled"
message: "Pod: job-operator.wsjob-kqsejfzmjxefkf9vyruztv-coordinator-0 exited with code 137 The pod was killed due to an out of memory (OOM) condition where the current memory limit is 32Gi."
You can also view OOM events in the wsjob dashboard.
Fig. 15 OOM software error in the wsjob dashboard
Identifying Resource Capacity Failures
Jobs requesting resources beyond the cluster’s capacity will fail immediately with scheduling errors like:
reason=SchedulingFailed object=wsjob-cd2ghxfqh7ksoev79rxpvs message='cluster lacks requested capacity: requested 1 node[role:management]{cpu:32, mem:200Gi} but 1 exists with insufficient capacity {cpu:64, mem:128Gi}
Troubleshooting OOM Errors
If your job fails with an OOM error, particularly in the coordinator component, you can increase the memory allocation in the runconfig
section of your yaml configuration file:
runconfig:
compile_crd_memory_gi: 100
execute_crd_memory_gi: 120
wrk_memory_gi: 120
For diagnostic purposes, you can temporarily remove memory limits by setting the value to -1 and observe the maximum memory usage in Grafana:
runconfig:
compile_crd_memory_gi: -1
Use unlimited memory settings with caution, as this can impact other users’ jobs running on the same node. A job without limits can potentially consume all available system memory.