Mistral

Model Description
Code Structure
Available Configurations
Workflow
References

Model Description

Mistral is a family of decoder-only transformer models optimized for efficiency and throughput while preserving strong general performance. Architecturally, Mistral builds on the transformer decoder backbone with several key enhancements: it adopts grouped-query attention (GQA) for faster inference, replaces absolute positional encodings with sliding window attention for improved scalability, and utilizes SwiGLU activation functions. These models are well-suited for instruction following, reasoning, summarization, and coding tasks. Mistral is a very similar architecture to LLaMA except that:

It uses grouped-query attention (GQA), which reduces the number of attention heads for keys and values.
It applies sliding window attention (SWA) with a 4K window, enabling local attention over long sequences.
It supports a higher default maximum sequence length (MSL) of 32K, rather than LLaMA’s 4K.

For more details on each technique, see the original papers in the References section.

Code Structure

The code for this model is located in the /mistral directory within ModelZoo. Here’s how it’s organized:

/configs: Contains YAML configuration files.
model.py: The implementation of the Mistral model.

Our implementation of Mistral is built on top of our GPT-2 implementation. For more details, see gpt2_model.py.

Available Configurations

Configuration	Description
`params_mistral_7B.yaml`	7B parameter Mistral model.
`params_mistral_7B_msl128k.yaml`	7B parameter Mistral model with 128K MSL.
`params_mistral_12b.yaml`	12B parameter Mistral model.

Workflow

For example workflows using language models from the Cerebras Model Zoo, see our tutorials on pretraining and fine-tuning. For a complete list of Cerebras ModelZoo CLI commands, see the command reference.

References

Jiang, Albert, et al. (2023). Mistral 7B
Ainslie, Joshua, et al. (2023). GQA: Training Multi-Query Transformer Models from Multi-Head Checkpoints
Child, Rewon, et al. (2019). Generating Long Sequences with Sparse Transformers

LLaMA Mixtral

⌘I

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

Model Description

Code Structure

Available Configurations

Workflow

References

Get Started

Setup and Installation

Models

Data Preparation

Model Configuration

Training and Eval

Configure and Run Jobs

Monitoring and Troubleshooting

Convert and Port

Advanced Usage

​Model Description

​Code Structure

​Available Configurations

​Workflow

​References

Model Description

Code Structure

Available Configurations

Workflow

References