- Loss Scaling: The loss value is multiplied by a scaling factor before backpropagation. This artificially inflates the gradients, preventing them from shrinking to zero in FP16.
- Backpropagation: Gradients are computed using the scaled loss value, ensuring their magnitude remains high during backpropagation.
- Unscaling: After backpropagation, the weight updates are divided by the same scaling factor used in step 1. This reverses the artificial inflation, ensuring accurate updates to the network weights.
- Prevents gradient vanishing: Maintains gradient information during backpropagation, leading to improved training progress.
- Improves training stability: Reduces divergence and stalling, leading to smoother convergence.
- Simplifies mixed-precision training: Eliminates the need for manual loss scale tuning.
- Boosts performance: Can achieve faster training times with less memory usage compared to full FP32 training.
Supported Precision
Dynamic Loss Scaling should be used when thefp16_type
is either float16
or cbfloat16
. It is not needed for bfloat16
. For supported precision formats on Cerebras Wafer-Scale cluster, see Control numerical precision level.
Enable Dynamic Loss Scaling
Dynamic Loss Scaling is available for training models withcbfloat16
precision. This can improve training speed and stability.
To activate the Dynamic Loss Scaling functionality, set the value of the loss_scaling_factor
in the Trainer YAML configuration under the precision settings::
bfloat16
training without loss scaling, you need to include the --load_checkpoint_states
flag (or its equivalent in your run configuration) to make sure the parameters are loaded correctly from the params.yaml
file.
Once you’ve loaded your model and trained it with the new dynamic loss scaling, any checkpoints you save afterwards will automatically include this feature and won’t require the special flag anymore.
Enable Dynamic Loss Scaling with Module
Dynamic Loss Scaling offers flexible configuration through thecstorch.amp.GradScaler
module. Supported parameters include:
-
loss_scale
: Set to “dynamic” to activate dynamic scaling. -
initial_loss_scale
: Defines the starting scaling factor. Default value:2e15
. -
steps_per_increase
: Controls the frequency of scaling factor increments. Default value:2000
. -
min_loss_scale
: Sets the lower bound for the scaling factor. Default value:2e-14
. -
max_loss_scale
: Sets the upper bound for the scaling factor. Default value:2e15
. -
max_gradient_norm
: For dynamic loss scaling with global gradient clipping.
amp.GradScaler
constructor during initialization by passing the appropriate arguments:
amp.GradScaler
to automatically adjust the loss value (scaling up or down) before feeding it to the optimizer. This helps maintain numerical stability and can improve training speed. See the code below: