cerebras.pytorch.optim | Contains all Cerebras compliant Optimizer classes. |
class cerebras.pytorch.optim.Optimizer
(params, defaults, enable_global_step=False)
[source]#
Bases: cerebras.pytorch.optim.optimizer.torch.optim.Optimizer
, abc.ABC
The abstract Cerebras base optimizer class.
Enforces that the preinitialize method is implemented wherein the optimizer state should be initialized ahead of time
Parameters:
- params (Union[Iterabletorch.Tensor,Iterable[Dict[str, Any]]]) – Specifies what Tensors should be optimized.
- defaults (Dict[str, Any]) – a dict containing default values of optimization options (used when a parameter group doesn’t specify them).
- enable_global_step (bool) – If True, the optimizer will keep track of the global step for each parameter.
increment_global_step(p)
[source]#
Increases the global steps by 1 and returns the current value of global step tensor in torch.float32 format.
state_dict(_*args_, _**kwargs_)
[source]#
load_state_dict(_state_dict_)
[source]#
register_zero_grad_pre_hook(_hook_)
[source]#
Register an optimizer zero_grad pre hook which will be called before optimizer zero_grad. It should have the following signature:
optimizer
argument is the optimizer instance being used. If args and kwargs are modified by the pre-hook, then the transformed values are returned as a tuple containing the new_args and new_kwargs.
Parameters:
hook (Callable) – The user defined hook to be registered.
Returns: a handle that can be used to remove the added hook by calling handle.remove()
Return type: torch.utils.hooks.RemovableHandle
register_zero_grad_post_hook
(hook)
[source]#
Register an optimizer zero_grad post hook which will be called after optimizer zero_grad. It should have the following signature:
optimizer
argument is the optimizer instance being used.
Parameters:
hook (Callable) – The user defined hook to be registered.
Returns: a handle that can be used to remove the added hook by calling handle.remove()
Return type: torch.utils.hooks.RemovableHandle
zero_grad
(*args, _kwargs_)**
[source]#
Runs the optimizer zero_grad method and calls any pre and post hooks
apply
(f)
[source]#
Calls the function on self
visit_state
(fn)
[source]#
Applies a lambda to each stateful value.
_abstract_ preinitialize
()
[source]#
The optimizer state must be initialized ahead of time in order to capture the full compute graph in the first iteration. This method must be overriden to perform the state preinitialization
abstract step
(closure=None)
[source]#
Perform the optimizer step itself. Note, there should be no new state being created in this function. All state must be created ahead of time in preinitialize and only updated in this method.
class cerebras.pytorch.optim.
Adadelta
(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0, maximize=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Adadelta optimizer implemented to perform the required pre-initialization of the optimizer state.
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (Optional_[Callable]_) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.Adafactor
(params, lr, eps=(1e-30, 0.001), clip_threshold=1.0, decay_rate=- 0.8, beta1=None, weight_decay=0.0, scale_parameter=True, relative_step=False, warmup_init=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Adafactor optimizer implemented to conform to execution within the constraints of the Cerebras WSE.
preinitialize()
[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters:
-
closure (
Callable
, optional) – A closure that reevaluates - loss. (the model and returns the) –
Adagrad
(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-06, maximize=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Adagrad optimizer implemented to conform to execution within the constraints of the Cerebras WSE.
Parameters:
- params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
- lr (float_,_ optional) – learning rate (default: 1e-2)
- lr_decay (float_,_ optional) – learning rate decay (default: 0)
- weight_decay (float_,_ optional) – weight decay (L2 penalty) (default: 0)
- eps (float_,_ optional) – term added to the denominator to improve numerical stability (default: 1e-10)
- maximize (bool_,_ optional) – maximize the params based on the objective, instead of minimizing (default: False)
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.Adamax
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, maximize=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Adamax optimizer implemented to perform the required pre-initialization of the optimizer state.
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (Optional_[Callable]_) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.Adam
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, amsgrad=False)[source]#
Bases: cerebras.pytorch.optim.AdamBase.AdamBase
Adam specific overrides to AdamBase
handle\_weight\_decay
(param_groups)[source]#
load\_state\_dict(_state_dict_)
[source]#
Loads the optimizer state.
Parameters:
state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict
.
Adds checkpoint compatibility with the Adam from PyTorch
class cerebras.pytorch.optim.AdamW
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0, correct_bias=True, amsgrad=False)[source]#
Bases: cerebras.pytorch.optim.AdamBase.AdamBase
AdamW specific overrides to AdamBase
load\_state\_dict
(state_dict)[source]#
Loads the optimizer state.
Parameters: state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict
.
Adds checkpoint compatibility with the AdamW from HuggingFace
class cerebras.pytorch.optim.ASGD
(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0, maximize=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
ASGD optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state.
For more details, see https://dl.acm.org/citation.cfm?id=131098
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (Callable_,_ optional) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.Lamb
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0, adam=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Implements Lamb algorithm. It has been proposed in Large Batch Optimization for Deep Learning: Training BERT in 76 minutes.
Parameters:
- params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
- lr (float_,_ optional) – learning rate (default: 1e-3)
- betas (Tuple_[float,_ float_]__,_ optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float_,_ optional) – term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float_,_ optional) – weight decay (L2 penalty) (default: 0)
- adam (bool_,_ optional) – always use trust ratio = 1, which turns this into Adam. Useful for comparison purposes.
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
**class cerebras.pytorch.optim.**Lion(params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Implements Lion algorithm. As proposed in Symbolic Discovery of Optimization Algorithms.
Parameters:
- params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
- lr (float_,_ optional) – learning rate (default: 1e-4)
- betas (Tuple_[float,_ float_]__,_ optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.99))
- weight_decay (float_,_ optional) – weight decay coefficient (default: 0)
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters:
closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.NAdam
(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Implements NAdam algorithm to execute within the constraints of the Cerebras WSE, including pre-initializing optimizer state.
Parameters:
- params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
- lr (float_,_ optional) – learning rate (default: 2e-3)
- betas (Tuple_[float,_ float_]__,_ optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float_,_ optional) – term added to the denominator to improve numerical stability (default: 1e-8)
- weight_decay (float_,_ optional) – weight decay (L2 penalty) (default: 0)
- momentum_decay (float_,_ optional) – momentum momentum_decay (default: 4e-3)
- foreach (bool_,_ optional) – whether foreach implementation of optimizer is used (default: None)
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.RAdam
(params, lr=0.001, betas=(0.9, 0.999), eps=1e-06, weight_decay=0.0)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
RAdam optimizer implemented to conform to execution within the constraints of the Cerebras WSE.
Parameters:
- params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
- lr (float_,_ optional) – learning rate (default: 1e-3)
- betas (Tuple_[float,_ float_]__,_ optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
- eps (float_,_ optional) – term added to the denominator to improve numerical stability (default: 1e-6)
- weight_decay (float_,_ optional) – weight decay (L2 penalty) (default: 0)
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.RMSprop
(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
RMSprop optimizer implemented to perform the required pre-initialization of the optimizer state.
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
**class cerebras.pytorch.optim.**Rprop(params, lr=0.001, etas=(0.5, 1.2), step_sizes=(1e-06, 50.0))[source]#
Bases: cerebras.pytorch.optim.optimizer.Optimizer
Rprop optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state
Parameters:
- params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
- lr (float_,_ optional) – learning rate (default: 1e-3)
- etas (Tuple_[float,_ float_]__,_ optional) – step size multipliers
- step_size (Tuple_[float,_ float_]__,_ optional) – Tuple of min, max step size values. Step size is clamped to be between these values.
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters:
closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
class cerebras.pytorch.optim.SGD(params, lr, momentum=0, dampening=0, weight_decay=0, nesterov=False, maximize=False)
[source]# Bases:cerebras.pytorch.optim.optimizer.Optimizer
SGD optimizer implemented to conform to execution within the constraints of the Cerebras WSE, including pre-initializing optimizer state
Parameters:
- params (Iterable_[torch.nn.Parameter]_) – Model parameters
- lr (float) – The learning rate to use
- momentum (float) – momentum factor
- dampening (float) – dampening for momentum
- weight_decay (float) – weight decay (L2 penalty)
- nesterov (bool) – enables Nesterov momentum
preinitialize
()[source]#
Allocates tensors for the optimizer state to allow direct compilation of the model before the first step.
step
(closure=None)#
Performs a single optimization step.
Parameters: closure (callable_,_ optional) – A closure that reevaluates the model and returns the loss.
optim helpers#
Contains all Cerebras compliant Optimizer classes.cerebras.pytorch.optim.configure_optimizer
(optimizer_type, params, _kwargs_)**
[source]#
Configures and requires an Optimizer specified using the provided optimizer type
The optimizer class’s signature is inspected and relevant parameters are extracted from the keyword arguments
Parameters: optimizer_type (str) – The name of the optimizer to configure
- params – The model parameters passed to the optimizer
Deprecated since version 2.3: Use
configure_scheduler
instead.cerebras.pytorch.optim.configure_lr_scheduler(optimizer, learning_rate, adjust_learning_rate=None)
[source]#Configures a learning rate scheduler specified using the provided lr_scheduler typeThe learning rate scheduler’s class’s signature is inspected and relevant parameters are extracted from the keyword argumentsParameters:- optimizer – The optimizer passed to the lr_scheduler
- learning_rate – learning rate schedule
- adjust_learning_rate (dict) – key: layer types, val: lr scaling factor
learning_rate
parameter formats:learning_rate
is a Python scalar (int
orfloat
)
configure_lr_scheduler
returns an instance of ConstantLR
with the provided value as the constant learning rate.learning_rate
is a dictionary
scheduler
which contains the name of the scheduler you want to configure.The rest of the parameters in the dictionary are passed in a keyword arguments to the specified schedulers init method.learning_rate
is a list of dictionaries
SequentialLR
unless the any one of the dictionaries contains the key main_scheduler
and the corresponding value is ChainedLR
.In either case, each element of the list is expected to be a dictionary that follows the format as outlines in case 2.If what is being configured is indeed a SequentialLR
, each dictionary entry is also expected to contain the key total_iters
specifying the total number of iterations each scheduler should be applied for.configure\_optimizer\_params
(optimizer_type, kwargs)[source]#
Configures and requires an Optimizer specified using the provided optimizer type
The optimizer class’s signature is inspected and relevant parameters are extracted from the keyword arguments.
Parameters:
- optimizer_type (str) – The name of the optimizer to configure
- kwargs (dict) – Flattened optimizer params
configure\_scheduler\_params
(learning_rate)[source]#
Get the kwargs and LR class from params
Parameters: learning_rate (dict) – learning rate config
Returns: LR class and args
Return type: cls, kw_args
cerebras.pytorch.optim.configure_scheduler
(optimizer, schedulers_params)[source]#
Configures a generic scheduler from scheduler params. The scheduler class’ signature is inspected and relevant parameters are extracted from the keyword arguments.
Parameters:
- optimizer – The optimizer passed to each scheduler.
- schedulers_params (dict) – A dict of scheduler params.
scheduler_params
is expected to be a dictionary with a single key corresponding to the name of a Scheduler
. The value at this key is a sub-dictionary containing key-value pairs matching the arguments of the scheduler (except optimizer
).
Example:
SequentialLR
and SequentialWD
milestones
is calculated by the function and can be ignored.
Generic Scheduler class in cerebras.pytorch
#
optim.scheduler.Scheduler#
class cerebras.pytorch.optim.scheduler.Scheduler
(optimizer, total_iters, last_epoch=- 1, param_group_tags=None)[source]#
Generic scheduler class for various optimizer params.
Parameters:
- optimizer – The optimizer to schedule
- total_iters – Number of steps to perform the decay
- last_epoch – the initial step to start at
- param_group_tags – param group tags to target update for
\_get\_closed_form
()[source]#
abstract property param\_group\_key
#
Key of the param group value to modify. For example, ‘lr’ or ‘weight_decay’.
get
()[source]#
state_dict
()[source]#
load\_state\_dict
(state_dict)[source]#
increment\_last\_epoch
()[source]#
Increments the last epoch by 1
step
(*args, _kwargs_)**[source]#
Steps the scheduler and computes the latest value
Only sets the last_epoch if running on CS
update\_last\_value
()[source]#
update_groups
(values)[source]#
Update the optimizer groups with the latest values
get\_last\_value
()[source]#
Return last computed value by current scheduler.
Learning Rate Schedulers in cerebras.pytorch
#
Available learning rate schedulers in the cerebras.pytorch
package
optim.lr_scheduler.LRScheduler
class cerebras.pytorch.optim.lr_scheduler.LRScheduler
(*args, _kwargs_)**
[source]#
property param\_group\_key
#
get\_last\_lr
()[source]#
Return last computed learning rate by current scheduler.
get_lr
()[source]#
optim.lr_scheduler.ConstantLR#
class cerebras.pytorch.optim.lr_scheduler.ConstantLR
(*args, _kwargs_)[source]#
Maintains a constant learning rate for each parameter group (no decaying).
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- val – The learning_rate value to maintain
- total_iters (int) – The number of steps to decay for
optim.lr_scheduler.PolynomialLR#
class cerebras.pytorch.optim.lr_scheduler.PolynomialLR
(*args, _kwargs_)**[source]#
Decays the learning rate of each parameter group using a polynomial function in the given total_iters.
This class is similar to the Pytorch PolynomialLR LRS.
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- end_learning_rate (float) – The final learning rate
- total_iters (int) – Number of steps to perform the decay
- power (float) – Exponent to apply to “x” (as in y=mx+b), which is ratio of step completion (1 for linear) Default: 1.0 (only Linear supported at the moment)
- cycle (bool) – Whether to cycle
initial_val
#
property end_val
#
optim.lr_scheduler.LinearLR#
class cerebras.pytorch.optim.lr_scheduler.LinearLR
(*args, **kwargs)[source]#
Alias for Polynomial LR scheduler with a power of 1
property initial_val
#
property end_val#
optim.lr_scheduler.ExponentialLR#
class cerebras.pytorch.optim.lr_scheduler.ExponentialLR
(*args, _kwargs_)**[source]#
Decays the learning rate of each parameter group by decay_rate every step.
This class is similar to the Pytorch ExponentialLR LRS.
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- total_iters (int) – Number of steps to perform the decay
- decay_rate (float) – The decay rate
- staircase (bool) – If True decay the learning rate at discrete intervals
initial_val
#
optim.lr_scheduler.InverseExponentialTimeDecayLR#
class cerebras.pytorch.optim.lr_scheduler.InverseExponentialTimeDecayLR
(*args, **kwargs)[source]#
Decays the learning rate inverse-exponentially over time, as described in the Keras InverseTimeDecay class.
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- step_exponent (int) – Exponential value.
- total_iters (int) – Number of steps to perform the decay.
- decay_rate (float) – The decay rate.
- staircase (bool) – If True decay the learning rate at discrete intervals.
initial_val
#
optim.lr_scheduler.InverseSquareRootDecayLR#
class cerebras.pytorch.optim.lr_scheduler.InverseSquareRootDecayLR(_*args_, _**kwargs_)
[source]#
Decays the learning rate inverse-squareroot over time, as described in the following equation:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- scale (float) – Multiplicative factor to scale the result.
- warmup_steps (int) – use initial_learning_rate for the first warmup_steps.
initial_val
#
optim.lr_scheduler.CosineDecayLR#
**class cerebras.pytorch.optim.lr_scheduler.CosineDecayLR(*args, _kwargs_)[source]# Applies the cosine decay schedule as described in the Keras CosineDecay class. Parameters:- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- end_learning_rate (float) – The final learning rate
- total_iters (int) – Number of steps to perform the decay
optim.lr_scheduler.SequentialLR#
class cerebras.pytorch.optim.lr_scheduler.SequentialLR
(*args, _kwargs_)**[source]#
Receives the list of schedulers that is expected to be called sequentially during optimization process and milestone points that provides exact intervals to reflect which scheduler is supposed to be called at a given step.
This class is a wrapper around the Pytorch SequentialLR LRS.
Parameters:
- optimizer (torch.optim.Optimizer) – Wrapped optimizer
- schedulers (list) – List of chained schedulers.
- milestones (list) – List of integers that reflects milestone points.
- last_epoch (int) – The index of last epoch. Default: -1.
optim.lr_scheduler.PiecewiseConstantLR#
class cerebras.pytorch.optim.lr_scheduler.PiecewiseConstantLR(_*args_, _**kwargs_)
[source]#
Adjusts the learning rate to a predefined constant at each milestone and holds this value until the next milestone. Notice that such adjustment can happen simultaneously with other changes to the learning rate from outside this scheduler.
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- learning_rates (List_[float]_) – List of learning rates to maintain before/during each milestone.
- milestones (List_[int]_) – List of step indices. Must be increasing.
optim.lr_scheduler.MultiStepLR#
class cerebras.pytorch.optim.lr_scheduler.MultiStepLR
(*args, **kwargs)[source]#
Decays the learning rate of each parameter group by gamma once the number of steps reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.
This class is similar to the Pytorch MultiStepLR LRS.
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- gamma (float) – Multiplicative factor of learning rate decay.
- milestones (List_[int]_) – List of step indices. Must be increasing.
nitial_val
#
optim.lr_scheduler.StepLR#
class cerebras.pytorch.optim.lr_scheduler.StepLR
(*args, **kwargs)[source]#
Decays the learning rate of each parameter group by gamma every step_size. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler.
This class is similar to the Pytorch StepLR LRS.
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- step_size (int) – Period of decay.
- gamma (float) – Multiplicative factor of decay.
initial_val
#
optim.lr_scheduler.CosineAnnealingLR#
class cerebras.pytorch.optim.lr_scheduler.CosineAnnealingLR
(*args, _kwargs_)**[source]#
Set the learning rate of each parameter group using a cosine annealing schedule, where $$\eta_{\text{max}}$$
is set to the initial lr and $$\text{For } T_{\text{cur}}$$
is the number of steps since the last restart in SGDR:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
- T_max (int) – Maximum number of iterations.
- eta_min (float) – Minimum learning rate.
initial_val
#
optim.lr_scheduler.LambdaLR#
class cerebras.pytorch.optim.lr_scheduler.LambdaLR
(*args, **kwargs)[source]#
Sets the learning rate of each parameter group to the initial lr times a given function (which is specified by overriding set_value_lambda).
Parameters:
- optimizer (torch.optim.Optimizer) – The optimizer to schedule
- initial_learning_rate (float) – The initial learning rate.
initial_val
#
optim.lr_scheduler.CosineAnnealingWarmRestarts#
class cerebras.pytorch.optim.lr_scheduler.CosineAnnealingWarmRestarts
(*args, _kwargs_)**[source]#
Set the learning rate of each parameter group using a cosine annealing schedule, where $$(\eta_{\max} )$$
is set to the initial lr, $$ T_{\text{cur}} $$
is the number of steps since the last restart and $$T_i {\text{ set }} $$
is the number of steps between two warm restarts in SGDR: