Cannot Load Checkpoints in GPUs

When trying to load a model trained on a Cerebras cluster onto a GPU, there’s an incompatibility between formats.

Models trained on a Cerebras cluster are in HDF5 format, but when attempting to load the model on a GPU, the system expects the checkpoint to be in pickle format.

Workaround

Learn how to convert between checkpoint file formats in our Convert Cerebras Checkpoints for GPUs guide.

Custom PyTorch Script Causes Infinite Loop or Multiple Compilation Jobs

When using a custom PyTorch training/eval script, the script gets stuck in an infinite loop, or multiple compliliation jobs are launched.

Workaround

This issue occurs because the script lacks an if __name__ == “__main__” guard. During execution, subprocesses may be created (e.g., for weight transfer or surrogate jobs), which can cause the entire module to run unintentionally.

To prevent this, wrap your script’s main logic inside an if __name__ == “__main__” block.

Error Parsing Metadata

When compiling or running models, you may see this error message intermittently:

Error parsing metadata: error=invalid value key=content-type value=text/html

This error is a bug in GRPC.

Workaround

The error itself does not affect the outcome of a run, but you can disble the error message by setting this environment variable:

$ export GRPC_VERBOSITY=NONE

This will hide all log messages and remove ALL logs coming from GRPC. This has not been thoroughly validated.

Error Receiving Activation

When trying to run your own model, you may encounter this error:

cerebras.appliance.errors.ApplianceUnknownError: Ran into error while receiving activation tensor <custom-call ...> for runtime iteration ...

This error has many possible causes, but one common issue relates to how the dataloader is structured in your run script.

When running custom models, the dataloader must be in a separate file within the same directory as the main execution or model script. If the dataloader is defined within the run script, the input workers may fail to pickle the input function from the __main__ module, leading to this error.

Workaround

Place the dataloader in a separate script rather than being defined within the main training script. Below is an example of an appropriate directory structure:

$ ls user_directory
--> run_model.py (entry script that is used to start the run using "python run_model.py ...")
--> dataloader.py (script containing the dataloaders and input functions)

Failed Mount Directory During Execution

When running a training job, it fails with the following error:

ERROR:   Uncaught exception:
Traceback (most recent call last):
 [...]
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
       status = StatusCode.INTERNAL
       details = "Please contact the Cerebras Support Team. Poll ingress failed during wsjob initialization: job-operator/wsjob-00000 has failed because total 1 replica(s) failed: [name:"wsjob-000000-worker-0" lastTimestamp:"2023-01-01 00:00:00 +0000 UTC" reason:"FailedMount" message:"Unable to attach or mount volumes: unmounted volumes=[home-volume-111111111 training-data-volume-22222 workdir-volume], unattached volumes=[kube-api-access home-volume-11111111111 training-data-volume-22222 workdir-volume worker-cache-partition-ro-volume worker-cache-dir-volume worker-dev-shm-volume cfg]: timed out waiting for the condition"]"

Workaround

This error is under investigation. In some cases, rerunning the command solves the issue. If you are still encountering this error, contact Cerebras for assistance.

Automatic Checkpoint Loading Failure

When adding custom checkpoints in model_dir, they aren’t automatically loaded during runs. This is because the checkpoint naming convention doesn’t match the expected format.

The auto-load feature searches for files named checkpoint_<step>.mdl in your model_dir, loading the one with the highest <step> value. This feature is enabled by default but can be disabled by setting runconfig.autoload_last_checkpoint to False in your params YAML.

Workaround

Either:

  • Rename your checkpoint to follow the checkpoint_<step>.mdl format
  • Explicitly specify the checkpoint path using the --checkpoint_path flag

Functionalization Error

When tracing a model for Cerebras hardware, you might encounter the following error:

RuntimeError: false INTERNAL ASSERT FAILED at “aten/src/ATen/RegisterFunctionalization_1.cpp”:11608, please report a bug to PyTorch. mutating a non-functional tensor with a functional tensor is not allowed. Please ensure that all of your inputs are wrapped inside of a functionalize() call.

This happens because:

  • In-place operations aren’t allowed in the compute graph
  • Cerebras uses “functionalization” to convert in-place operations to non-in-place alternatives
  • For this to work, all tensors must be on the same device - specifically the device associated with the Cerebras backend

Workaround

To fix this error, ensure all tensors are on the same backend device by creating new tensors directly on the backend device:

backend = cstorch.backend("CSX")
...
@cstorch.trace
def training_step(inputs, targets):
    ...
    new_tensor_1 = torch.tensor([1, 2, 3], device=backend.torch_device)
    new_tensor_2 = torch.tensor([1, 2, 3]).to(backend.torch_device)
    ...

Or, by moving existing tensors to the backend device:

@cstorch.trace
def training_step(inputs, targets):
    ...
    new_tensor_1 = torch.tensor([1, 2, 3], device=inputs.device)
    new_tensor_2 = torch.tensor([1, 2, 3]).to(inputs.torch_device)
    ...

Input Starvation

If your dataloader isn’t keeping up with your model during a run, you’ll encounter the following error:

WARNING: Input starvation detected
Please check dataloader throughput

If the issue persists, you’ll encounter an additional error:

ERROR:   Declaring stall due to input starvation, no change in status for 630 secs

Workaround

To fix this issue, you’ll need to speed up your data pipeline. See Creating Custom Dataloaders to learn about improving the performance of your dataloader and view examples.

Module Not Found

There are two ModuleNotFound errors you may encounter:

Core Python Module Errors When trying to use certain built-in Python modules like bz2, users may receive errors about missing core modules (bz2, sqlite3). For example:

Traceback (most recent call last):
import bz2
File "/usr/local/lib/python3.8/bz2.py", line 19, in <module>
    from _bz2 import BZ2Compressor, BZ2Decompressor
ModuleNotFoundError: No module named '_bz2'

This happens because the Python installation was compiled from source without all necessary system dependencies.

User-Installed Package Errors You may encounter a ModuleNotFoundError for Python packages that are installed on your local machine but unavailable in the Cerebras environment.

Our Custom Worker Container Workflow attempts to import your dependencies into Cerebras appliances, with a fallback that mounts packages from your virtual environment.

Workaround

For core python module errors, install the missing system packages (bzip2-devel, sqlite-devel) and rebuild Python, or use a pre-built Python binary instead.

For user-installed package errors:

  1. Disable the Custom Worker Container Workflow (see instructions here).
  2. Install packages in your virtual environment with pip.
  3. Copy the custom package directory from venv/lib/python3.8/site-packages/<package_name> to a NFS-mountable location. Only copy the custom packages, not the entire virtual environment.
  4. Add this location to --mount_dirs and its parent to --python_paths when running jobs.

Loss Complilation Issues with Autogen

When creating custom losses, you might encounter compilation failures.

Workaround

Wrap your custom loss class with the @autogen_loss decorator, which enables AutoGen to handle the compilation of these custom losses efficiently.

from cerebras_pytorch/src/cerebras/pytorch/nn/modules import autogen_loss

@autogen_loss

class CustomLoss(nn.Module):

   def __init__(self, ...):

Model is Too Large to Fit in Memory

If you encounter the following error, this means the memory requirements are too large to fit on the device:

Model is too large to fit in memory. This can happen because of a large batch size, large input tensor dimensions, or other network parameters. Please refer to the Troubleshooting section in the documentation for potential workarounds

Workaround

  • On transformer models, compile again with the batch size set to 1 using one CS-2 system to determine if the specified maximum sequence length is feasible.

  • You can try a smaller batch size per device or enable batch tiling (only on transformer models) by setting the micro_batch_size parameter in the train_input or eval_input section of your model’s yaml file (see working_with_microbatches).

  • If you ran with batch tiling with a specific micro_batch_size value, you can try compiling with a decreased micro_batch_size. The Using “explore” to Search for a Near-Optimal Microbatch Size flow can recommend performant micro batch sizes that will fit in memory.

  • On CNN models where batch tiling isn’t supported, try manually decreasing the batch size and/or the image/volume size.

  • For more information on working with batch tiling and selecting performant micro_batch_size values, see our tutorial on automatic microbatching.

  • The batch_size parameter set on the yaml configuration is the global batch size. This means that the batch size per CS-2 system is computed as the global batch size divided by the number of CS-2s used.

Numerical Issues

During low-precision training (POL=1), particularly with large output vocabularies (30,000-60,000 words), the final projection layer, converting internal representations to words, frequently exhibits accuracy issues.

During the backward pass, the final projection layer accumulates a large number of values (equal to the vocabulary size) for each output, using low-precision 16-bit arithmetic. This extensive accumulation can introduce inaccuracies, hindering convergence. Additionally, the inputs to this layer typically originate from a softmax cross-entropy layer, whose non-normal distribution deviates significantly from the typical normal distributions observed in most layers, further contributing to inaccuracy on the backward pass.

Workaround

To mitigate potential convergence issues arising from numerical instability in the final projection layer during low-precision training (POL=1), a per-layer setting of POL=0 should be applied to this specific layer.

This ensures the highest numerical precision for the final projection while maintaining the performance advantages of POL=1 throughout the rest of the model. This modification has already been incorporated into the Model Zoo variants of Cerebras large language models.

Throughput Spike After Saving Checkpoints

If you notice a throughput spike after saving checkpoints, this is a known artifact of reporting throughput on the user node, caused by the asynchronous nature of execution on Wafer-Scale Cluster. For more details and understanding of this behavior, please refer to Measure throughput of your model.

Out of Memory Errors and System Resources

View our guide on troubleshooting memory errors and system resource issues for more information.

Vocabulary Size

If you encounter the following error, your vocabulary size may be too large:

RuntimeError: [enforce fail at alloc_cpu.cpp:66] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 9120000000000 bytes

Vocabularies up to one million tokens are supported, but it may take up to 90 minutes to compile. Large vocabulary sizes have not been fully tested on models with 2.7 billion parameter models or more.

When using extremely small vocabularies (fewer than 4 tokens), compilation errors may occur.