SparsityAlgorithm
is the abstract base class that all sparsity algorithms should derive from.
SparsityAlgorithm
(sparsity, init_method=‘random’)HyperParameterSchedule
. If a dictionary is passed in, then it is automatically converted to a HyperParameterSchedule
make_init_method
for more details.
num\_sparse\_params_
: ** int**#
Return the number of parameters that have been sparsified by this algorithm.
get_sparse_params
(obj)[source]#
Get all sparse parameters that were sparsified by this algorithm.
Parameters:
obj (Union_[torch.Tensor,_ torch.nn.Module, torch.optim.Optimizer]) – The object to get sparse parameters from.
Returns:
If obj is a Tensor, returns the sparse parameter associated with that tensor (if any). If obj is a Module, returns an iterator over all sparse parameters of the module
and its submodules recursively.
If obj is an Optimizer, returns an iterator over all sparse parameters associated
with the optimize param groups.
Return type
Union[cerebras.pytorch.sparse.base.SparseParameter, Generator[cerebras.pytorch.sparse.base.SparseParameter, None, None]]
initialize
()[source]#
Initialize the sparsity pattern for all parameters sparsified by this algorithm.
csx_annotate_sparsity
(param)[source]#
Annotate the parameter with hints about the sparsity pattern.
These hints are used as performance hints for the Cerebras compiler.
Parameters: param (cerebras.pytorch.sparse.base.SparseParameter) – The sparse parameter to annotate with hints.
property sparsity
: Dict[torch.Tensor, cerebras.pytorch.sparse.utils.HyperParameterSchedule]_#
Return the mapping between a parameter and its sparsity schedule.
sparsify_parameter
(module, name, param)[source]#
Initialize the mask for a parameter in the given module.
Parameters:
apply
(obj)[source]#
Sparsify the passed in object.
module.apply(sparsity)
or optimizer.apply(sparsity)
torch.nn.Module
or a cstorch.optim.Optimizer
object to sparsify.
sparsify_module
(module)[source]#
Sparsify the torch.nn.Module
object.
Parameters: module (torch.nn.Module) – the torch.nn.Module
object to sparsify
prune_weight
(sparse_param)#
Prune the dense weight and register a hook to prune the gradients.
_grad_hook
(p, grad)[source]#
Hook to prune the gradients after backward().
sparsify_optimizer
(optimizer)[source]#
Sparsify the torch.optim.Optimizer
object.
Parameters: optimizer (torch.optim.Optimizer) – the torch.optim.Optimizer
object to sparsify
update
(optimizer=None)register_target_sparsity_hook
(hook)[source]#
Register a hook which will be called when a new target sparsity is computed. It should have the following signature:
hook(sparsity, name, target)
sparsity
argument is the sparsity instance being used. name
is the name of the group of parameters that the target sparsity is being computed for. target
is the computed target sparsity value.
Parameters: hook (Callable) – The user defined hook to be registered.
Returns: a handle that can be used to remove the added hook by calling handle.remove()
Return type: torch.utils.hooks.RemovableHandle
register_computed_sparsity_hook
(hook)[source]#
Register a hook which will be called when a new sparsity mask is computed. It should have the following signature:
hook(sparsity, name, computed)
sparsity
argument is the sparsity instance being used. name
is the name of the parameter that the mask belongs to. computed
is the calculated sparsity level of the newly computed mask.
Parameters:
hook (Callable) – The user defined hook to be registered.
Returns: a handle that can be used to remove the added hook by calling handle.remove()
Return type: torch.utils.hooks.RemovableHandle
visit_state
(f)[source]#
Apply a callable to the stateful tensors.
state_dict
()[source]#
Return a dictionary of all stateful tensors.
load_state_dict
(state_dict)[source]#
Load the state of all stateful tensors.
Static
(sparsity=None, _kwargs_)**cerebras.pytorch.sparse.base.SparsityAlgorithm
Constructs a Static sparsity instance.
Parameters: sparsity (Optional_[float]_) – A float specifying the level of sparsity to apply to each parameter
DynamicSparsityAlgorithm
(sparsity=None, update=None, _kwargs_)**cerebras.pytorch.sparse.base.SparsityAlgorithm
, abc.ABC
Constructs a DynamicSparsityAlgorithm instance.
Parameters:
FreqSchedule
or a ListSchedule
. If not provided, the sparsity pattern will be updated every step.
is_update_step_
: torch.BoolTensor#
Returns a boolean tensor indificating whether the current step is an update step according to the update schedule.
update_mask
(p, mask, sparsity)GMP
(**kwargs)**cerebras.pytorch.sparse.dynamic.DynamicSparsityAlgorithm
Implements Gradual Magnitude Pruning
Sparsity increases monotonically based on weight magnitude.
See: https://arxiv.org/abs/1710.01878
Parameters: **kwargs – All arguments are passed to the DynamicSparsityAlgorithm
’s constructor.
SET
(drop_fraction=0.3, _kwargs_)**cerebras.pytorch.sparse.dynamic.DynamicSparsityAlgorithm
Implements Sparse Evolutionary Training (SET)
Sparsity levels stay constant throughout training, but the lowest magnitude weights are pruned and then regrown randomly.
See: https://arxiv.org/abs/1707.04780
Parameters:
DynamicSparsityAlgorithm
’s constructor.
RigL
(drop_fraction=0.3, balance_in_groups=None, balance_out_groups=None, _kwargs_)**cerebras.pytorch.sparse.dynamic.DynamicSparsityAlgorithm
Implements Rigging the Lottery (RigL)
Sparsity levels stay constant throughout training, but the lowest magnitude weights are pruned and then regrown using a proxy measure of where a pruned connection would have had the most impact by finding the highest magnitude (dense) gradients of pruned weights.
See: https://arxiv.org/abs/1911.11134
Parameters:
InputGroupScoreShaper
OutputGroupScoreShaper
DynamicSparsityAlgorithm
’s constructor.
cerebras.pytorch.sparse.base.SparsityAlgorithm
Group sparsity algorithm. This algorithm allows for multiple sparsity algorithms to be applied to different groups of parameters.
For example:
add
for more details.
add(filter, algorithm)[source]#
Add a sparsity algorithm to the group.
Parameters:
SparsityAlgorithm
extend
(group)[source]#
Extend the group with the filters and algorithms from another group.
Parameters:
group (cerebras.pytorch.sparse.group.Group) – An instance of Group
configure
, which will configure a sparsity algorithm and return it. The config dictionary follows the same form as given in Sparsity via YAML.
configure
(config)param_filter
is not provided, the following default param filter gets applied.
cerebras.pytorch.sparse.configure.default_sparse_param_filter(
name, param)[source]#
Return True if the given parameter should be sparse.
Only returns true if the parameter is > 1D and not an embedding or norm or lm_head or pe_helper.
Parameters:
cerebras.pytorch.sparse.init
#SparsityAlgorithm
.
cerebras.pytorch.sparse.init.random
(p, sparsity, score_shaper=None, device=None)[source]#
Uniformly random sparsity pattern.
A score tensor with the same shape as the parameter is randomly generated with values between 0.0 and 1.0. The mask is then created by taking the top-k
of the score tensor, where k is determined by the sparsity level.
cerebras.pytorch.sparse.init.topk
(p, sparsity, score_shaper=None, device=None)[source]#
Prune lowest magnitude weights.
cerebras.pytorch.sparse.init.from_zeros
(p, sparsity, score_shaper=None, _device=None_)[source]#
Any zeros currently in the weights represent pruned connections. NOTE: Doesn’t actualy honor the configured sparsity.
cerebras.pytorch.sparse.init.checkerboard
(p, sparsity, score_shaper=None, device=None)[source]#
Mostly for stress and performance testing, creates a sparsity mask that is maximally distributed in a checkerboard across the weight.
cerebras.pytorch.sparse.init.make\_init\_method
(init_method)[source]#
Returns the corresponding init method callable for the given init_method.
Parameters:
init_method (Union_[str,_ Callable_[[torch.nn.Parameter,_ torch.FloatTensor_,_ Optional_[cerebras.pytorch.sparse.utils.ScoreShaper],_ Optional_[torch.device]],_ torch.BoolTensor_]__]_) –
The method to use to initialize the sparsity mask. This can be a string or a callable. If a string, it must be one of
If a callable, it must have the signature:
- ”
random
”: Randomly initialize the mask- ”
topk
”: prune the lowest magnitude weights- ”
from_zeros
”: Any zeros in the weights represent pruned connections- ”
checkerboard
”: Creates a sparsity mask that is maximally distributed across the weight
param
is the original dense parameter
sparsity
is the sparsity level
scope_shaper
is an optional callable that can be used to reshape the mask
device
is optionally the device to use to initialize the mask
cerebras.pytorch.sparse.utils
#HyperParameterSchedule
[source]#
Base class for step-aware hyperparameters used in Sparsity Optimizers.
abstract compute
(step)[source]#
Return a torch.Tensor with the value of the hyperparatmer at the given step.
Parameters:
step (torch.Tensor) – int64 tensor holding current step
Returns: torch.Tensor on the device of step with the value of the
hyperparamter
Return type: torch.Tensor
update
(is_update_step)[source]#
Given a boolean tensor indicating if this is an update step, update the internal state of this hyperparameter.
Parameters: is_update_step (torch.Tensor) – A boolean tensor indicating if this is an update step.
visit_state(fn)[source]#
Applies a lambda to each stateful value.
get\_min\_max_end
(begin, end)[source]#
Given a beginning and ending step, compute the statistics of this step-aware hyper parameter. Used for estimating memory requirements for dynamic sparsity.
Return [min, max, ending]
class cerebras.pytorch.sparse.utils.Constant
(value)[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Constant at every step.
Parameters:
value (float) – The constant value of the hyperparameter
class cerebras.pytorch.sparse.utils.Linear**
(init, slope)**[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Linear change from an initial value.
\(y(step) = init + step \cdot slope\)
Parameters:
Exp
(init, gamma, final=1)[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Exponential, approaching an asymptotic final value
Parameters:
Power
(init, beta)[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Power law.
\(y(step) = init \cdot beta^\)
Parameters
Cosine
(init, half_period, minimum=0.0)[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Cosine function for oscilating between an initial (maximum) value down to a minimum and back to the maximum every period.
\(y(step) = o + a \cdot \cos(step \cdot \pi / half\_period)\), where \(o = (init + minimum)/2\) and \(a = init - o\).
Parameters
Cycling
(values)[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Hyper parameter cycling between discrete values at update steps.
Parameters:
values (List_[float]_) – A list of discrete values to cycle through
class cerebras.pytorch.sparse.utils.Lambda
(fn)[source]#
Bases: cerebras.pytorch.sparse.utils.HyperParameterSchedule
Invoke a user’s lambda function of step to obtain the hyper parameter.
Parameters:
fn (Callable_[[torch.Tensor],_ torch.Tensor]) – A lambda function that takes a step and returns a hyperparameter
cerebras.pytorch.sparse.utils.make\_hyperparam\_schedule
(schedule)[source]#
Given some user specified configuration, construct a HyperParameterSchedule object that is step aware.
class cerebras.pytorch.sparse.utils.FreqSchedule
(freq=1, start=0, stop=None)[source]#
Bases: cerebras.pytorch.sparse.utils.UpdateSchedule
When schedulding sparsity update steps on a regular interval, this class allows configuring the start and stop step in addition to the update frequency.
Parameters:
ListSchedule
(steps)[source]#
Bases: cerebras.pytorch.sparse.utils.UpdateSchedule
When schedulding requires an irregular update cadence, explicit steps can be provided as a list.
Parameters
steps (Union_[List[int]__,_ torch.Tensor]) – A list of steps at which to update the sparsity pattern
cerebras.pytorch.sparse.utils.make\_update\_schedule
(update)[source]#
Instantiate a supported schedule type.
class cerebras.pytorch.sparse.utils.ScoreFlattener
[source]#
Bases: cerebras.pytorch.sparse.utils.ScoreShaper
Default ScoreShaper which everything is flattened, providing a global competition for magnitude. If only sub-portions of the weight should compete for magnitude, provide an alternative shaper object.
class cerebras.pytorch.sparse.utils.OutputGroupScoreShaper
(num_groups)[source]#
Bases: cerebras.pytorch.sparse.utils.ScoreShaper
A ScoreShaper interface when weights are logically shaped as [num_groups*out_per_group, insize], but need to be scored in a “balanced” fashion as [num_groups, out_per_group*insize]
Examples
InputGroupScoreShaper
(num_groups)[source]#
Bases: cerebras.pytorch.sparse.utils.ScoreShaper
A ScoreShaper interface when weights are logically shaped as [outsize, num_groups*in_per_group], but need to be scored in a “balanced” fashion as [num_groups, outsize*in_per_group]
Examples
make\_mask\_drop_minimum
(score, mask, drop_fraction, score_shaper=None)[source]#
Given a sparse score
(with mask
), return a new torch.BoolTensor
the same shape as mask where a drop_fraction
portion of the currently present (mask==True
) connections are dropped (mask==False
).
The connections are dropped at positions corresponding to the lowest values of score
.
Equivalently, a subset of mask
is returned corresponding to the highest magnitude elements of score
.
Parameters: * score (torch.FloatTensor) – Values used to evaluate which positions to drop
score
score
(and mask
) will be interpreted as multiple independent subtensors. This can be used to ensure sparsity distribution is “balanced” or to produce blockwise sparsity. By default, score
and mask
are reinterpreted as 1D tensors, yielding completely unstructured sparsity.
make\_mask\_grow_maximum
(score, mask, sparsity, mask_nonzero=None, score_shaper=None)[source]#
Given a sparse score
(with mask
), return a new torch.BoolTensor the same shape as mask
where some currently pruned connections are regrown (from those positions with the highest score) such that the returned mask has the given target sparsity.
If mask
is already less sparse (has more connections) than the target, none are regrown and the original mask is returned as-is. That is, the given mask
should be more sparse than the target sparsity.
Parameters: * score (torch.FloatTensor) – Values used to evaluate which positions to regrow
score
mask.nonzero().int()
. Since make_mask_grow_maximum
is often used in conjunction with make_mask_drop_minimum
, this value is commonly available.
score
(and mask
) will be interpreted as multiple independent subtensors. This can be used to ensure sparsity distribution is “balanced” or to produce blockwise sparsity. By default, score
and mask
are reinterpreted as 1D tensors, yielding completely unstructured sparsity.
make\_mask\_topk_sparsity
(score, sparsity, score_shaper=None)[source]#
Given a dense score
, return a torch.BoolTensor
which is True at positions corresponding to values in the top k = (1-sparsity)*score.numel()
of score
.
Parameters:
score
will be interpreted as multiple independent subtensors. This can be used to ensure sparsity distribution is “balanced” or to produce blockwise sparsity. By default, score
is reinterpreted as a 1D tensor, yielding completely unstructured sparsity.
mask
with given sparsity
, keeping only the highest values from score
.
Return type: torch.BoolTensor
Examples