Model Description
Mistral is a family of decoder-only transformer models optimized for efficiency and throughput while preserving strong general performance. Architecturally, Mistral builds on the transformer decoder backbone with several key enhancements: it adopts grouped-query attention (GQA) for faster inference, replaces absolute positional encodings with sliding window attention for improved scalability, and utilizes SwiGLU activation functions. These models are well-suited for instruction following, reasoning, summarization, and coding tasks. Mistral is a very similar architecture to LLaMA except that:- It uses grouped-query attention (GQA), which reduces the number of attention heads for keys and values.
- It applies sliding window attention (SWA) with a 4K window, enabling local attention over long sequences.
- It supports a higher default maximum sequence length (MSL) of 32K, rather than LLaMA’s 4K.
Code Structure
The code for this model is located in the/mistral
directory within ModelZoo. Here’s how it’s organized:
Our implementation of Mistral is built on top of our GPT-2 implementation. For more details, see
gpt2_model.py
.Available Configurations
Configuration | Description |
---|---|
params_mistral_7B.yaml | 7B parameter Mistral model. |
params_mistral_7B_msl128k.yaml | 7B parameter Mistral model with 128K MSL. |
params_mistral_12b.yaml | 12B parameter Mistral model. |
Workflow
For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.References
- Jiang, Albert, et al. (2023). Mistral 7B
- Ainslie, Joshua, et al. (2023). GQA: Training Multi-Query Transformer Models from Multi-Head Checkpoints
- Child, Rewon, et al. (2019). Generating Long Sequences with Sparse Transformers