torch.nn.Module
Alibi Position Embedding Layer, Symmetric case with bidirectional supported
Alibi bias as in the paper: https://arxiv.org/abs/2108.12409
None
.xavier_uniform
.class cerebras.modelzoo.layers.MultiheadAttention(*args, **kwargs)
torch.nn.Module
Multi-head attention layer. Adapted from:
https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention.
int
) – Number of input units in each projection output
int
) – Number of attention heads.
int
) – Number of output units in attention query/key/value projection. Defaults to embed_dim
.
float
) – Dropout rate for key-query weights. Defaults to 0.0.
bool
) – If True, then the input and output tensors are provided as (batch, seq, feature), otherwise the format will be (seq, batch, feature). Default: True (batch, seq, feature).
bool
) – If specified, adds bias to the key and value sequences at dim=0. Default: False.
bool
) – If specified, adds a new batch of zeros to the key and value sequences at dim=1. Default: False.
int
) – Number of input units in the key projection.
int
) – Number of input units in the value projection.
bool
) – Whether to use bias in the key, query, and value projections.
bool
) – Whether to use bias in the output projection.
str
) – Projection kernel initializer. Defaults to xavier_uniform
.
attention_initializer
.
str
| initializer
) – If not None, use this initializer for the output transform layer. Defaults to None.
str
) – Bias initializer. Defaults to zeros
.
str
) – The attention variant to execute. Currently accepts dot_product
and scaled_dot_product
. Defaults to scaled_dot_product
.
bool
) – If True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).
float
) – Scales the QK^T dot product. Used to stabilize logits in muP training.
bool
) – Use an FP32 softmax implementation.
str
| None
) – Kernel to use. Uses default
if None. See accepted values below.
None
- Default implementation.fast_attention
- Experimental optimized implementation.optional
) – Device to create the model parameters on, can be a CUDA device or CS device.
MultiheadAttention.forward(q, k, v, attn_mask=None, key_padding_mask=None, need_weights=False, average_attn_weights=True, past_kv=None, cache_present_kv=False, past_kv_self_attn=True, position_bias=None, rotary_position_embedding_helper=None, layer_idx=None, **extra_args)
q
, keys k
, and values v
.
Tensor
) – Queries, shape [batch_size, seq_length, embed_dim]
.Tensor
) – Keys, shape [batch_size, seq_length, embed_dim]
.Tensor
) – Values, shape [batch_size, seq_length, embed_dim]
.Tensor
) – Attention mask. Can be 2D of shape [batch_size, seq_length]
, or 3D of shape [batch, query_length, seq_length]
.Tensor
) – If specified, a mask of shape (N, S) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to None.bool
) – If specified, returns attn_output_weights
in addition to attn_outputs
. Default: False.bool
) – If true, indicates that the returned attn_weights
should be averaged across heads. Otherwise, attn_weights
are provided separately per head. Note that this flag only has an effect when need_weights=True
. Default: True (i.e., average weights across heads).tuple(tensor, tensor)
) – Past keys and values. Tensors have shape [batch_size, num_heads, seq_length, embed_dim / num_heads]
. The 0th and 1st tensors contain the past keys and values, respectively. Defaults to None
.bool
) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False
.bool
) – Specifies whether the past keys & values should be used for self-attention (true) or cross-attention (false). Ignored if past_kv
is not provided. Default: True.Tensor
) – Tensor containing position bias to apply in attention with shape [num_heads, query_length, key_length]
.Optional[RotaryPositionEmbeddingHelper]
) – A helper class to apply rotary embedding on the input tensor.[batch_size, seq_length, embed_dim]
.
class cerebras.modelzoo.layers.BatchChannelNorm2D(*args, **kwargs)[source]#
Bases: torch.nn.Module
Implements Batch Channel Normalization proposed in Micro-Batch Training with Batch-Channel Normalization and Weight Standardization https://arxiv.org/abs/1903.10520
Parameters
torch.nn.Module
Creates token and, optionally, position and segment embeddings.
Parameters
[batch_size, seq_length]
.
[batch_size, seq_length]
.
[batch_size, seq_length]
.
[batch_size, seq_length, embedding_size]
.
class cerebras.modelzoo.layers.FeedForwardNetwork(*args, **kwargs)[source]#
Bases: torch.nn.Module
A feed forward network that consists of a stack of fully connected layers arranged as [LinearLayer -> Activation -> Dropout] block repeated len(layers_units) times.
Parameters
config (FeedForwardNetworkConfig) – Feed forward network config.
Initialize the FFN object instance.
class cerebras.modelzoo.layers.GPTJDecoderLayer(*args, **kwargs)[source]#
Bases: cerebras.modelzoo.layers.TransformerDecoderLayer.TransformerDecoderLayer
GPTJDecoderLayer is inherited from TransformerDecoderLayer, it has 2 modifications:
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. (optional).
torch.nn.Module
Uses torch.nn.GroupNorm to emulate InstanceNorm by setting number of groups equal to the number of channels.
Parameters
num_channels (int) – number of channels. C from an expected input of size (N, C, H, W).
class cerebras.modelzoo.layers.MultiQueryAttention(*args, **kwargs)[source]#
Bases: cerebras.modelzoo.layers.AttentionLayer.MultiheadAttention
Implements the Multi-Query Attention Layer from
Fast Transformer Decoding: One Write-Head is All You Need <https://arxiv.org/abs/1911.02150>
Parameters
embed_dim
.
xavier_uniform
.
attention_initializer
zeros
.
dot_product
and scaled_dot_product
. Defaults to scaled_dot_product
.
default
if None. See accepted values below.
None
- Default implementation.fast_attention
- Experimental optimized implementation.
torch.nn.Module
Relative Position Embedding Layer
Parameters
xavier_uniform
.
[num_heads, query_length, key_length]
memory_position - query_position
, i.e., the distance in tokens from the attending position to the attended-to position.
If bidirectional_relative_attention = False
, then positive relative positions are invalid. We use smaller buckets for small absolute relative positions and larger buckets for larger absolute relative positions. All relative positions >= max_distance
map to the same bucket. All relative positions <= -max_distance
map to the same bucket. This should allow for more graceful generalization to longer sequences than the model has been trained on.
Tensor
bool
int
int
False
relative_position
, containing int32
values in the range [0, num_relative_attention_buckets)
.
class cerebras.modelzoo.layers.Transformer(*args, **kwargs)[source]#
Bases: torch.nn.Module
A transformer model. User is able to modify the attributes as needed. The architecture is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users can build the BERT(https://arxiv.org/abs/1810.04805) model with corresponding parameters.
Parameters
True
, then the input and output tensors are provided as (batch, seq, feature). Default: False
(seq, batch, feature).
True
, encoder and decoder layers will perform LayerNorms before other attention and feedforward operations, otherwise after. Default: False
(after).
Copy to clipboard Note: A full example to apply nn.Transformer module for the word language model is available in https://github.com/pytorch/examples/tree/master/word_language_model forward(src, tgt, src_mask=None, tgt_mask=None, memory_mask=None, src_key_padding_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None)[source]# Take in and process masked source/target sequences. Parameters> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12) > src = torch.rand((10, 32, 512)) > tgt = torch.rand((20, 32, 512)) > out = transformer_model(src, tgt)
True
are not allowed to attend while False
values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight. [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of True
will be ignored while the position with the value of False
will be unchanged.
Copy to clipboard class cerebras.modelzoo.layers.TransformerDecoder(*args, **kwargs)[source]# Bases:> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
torch.nn.Module
TransformerDecoder is a stack of N decoder layers
Parameters
Copy to clipboard forward(tgt, memory=None, tgt_mask=None, sparse_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, self_attn_position_bias=None, cross_attn_position_bias=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, extract_layer_idx=None, expert_hash_idx=None, **extra_args)[source]# Pass the inputs (and mask) through the decoder layer in turn. Parameters> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8) > transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6) > memory = torch.rand(10, 32, 512) > tgt = torch.rand(20, 32, 512) > out = transformer_decoder(tgt, memory)
torch.nn.Module
TransformerDecoderLayer is made up of self-attn, multihead-attn and feedforward network. This standard decoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters
True
, then the input and output tensors are provided as (batch, seq, feature). Default: False
(seq, batch, feature).
True
, layer norm is done prior to self attention, multihead attention and feedforward operations, respectively. Otherwise it’s done after. Default: False
(after).
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set to False
)
attention_initializer
attention_initializer
True
, dropout will be enabled after the first feed forward layer. Default: True
Copy to clipboard forward(tgt, memory=None, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, rotary_position_embedding_helper=None, past_kv=None, cache_present_kv=False, self_attn_position_bias=None, cross_attn_position_bias=None, layer_idx=None, expert_hash_idx=None, **extra_args)[source]# Pass the inputs (and mask) through the decoder layer. Parameters> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True) > memory = torch.rand(32, 10, 512) > tgt = torch.rand(32, 20, 512) > out = decoder_layer(tgt, memory)
[batch_size, num_heads, seq_length, embed_dim / num_heads]
. (optional).
torch.nn.Module
TransformerEncoder is a stack of N encoder layers
Parameters
False
(disabled).
Copy to clipboard forward(src, mask=None, src_key_padding_mask=None, rotary_position_embedding_helper=None, self_attn_position_bias=None, extract_layer_idx=None, **extra_args)[source]# Pass the input through the encoder layers in turn. Parameters> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8) > transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6) > src = torch.rand(10, 32, 512) > out = transformer_encoder(src)
torch.nn.Module
TransformerEncoderLayer is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000-6010. Users may modify or implement in a different way during application.
Parameters
True
, then the input and output tensors are provided as (batch, seq, feature). Default: False
(seq, batch, feature).
True
, layer norm is done prior to attention and feedforward operations, respectively. Otherwise it’s done after. Default: False
(after).
True
scales QK^T dot product by d(=hidden/d_head) instead of sqrt(d).
True
, adds cross-attention layer between encoder/decoder, otherwise, only self-attention is used in the decoder (GPT-style models should set to False
)
attention_initializer
attention_initializer
True
, dropout will be enabled after the first feed forward layer. Default: True
batch_first
is True
: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True) >>> src = torch.rand(32, 10, 512) >>> out = encoder_layer(src)
forward(src, src_mask=None, src_key_padding_mask=None, rotary_position_embedding_helper=None, self_attn_position_bias=None, **extra_args)[source]#
Pass the input through the encoder layer.
Parameters