M8
Open SourceText Generationby Mistral AI

Mistral 8x22B

Mixtral 8x22B is Mistral AI's open sparse mixture-of-experts model. With 141B total but only ~39B active parameters per token, it pairs strong quality and a 64K context with efficient inference, under a permissive Apache 2.0 licence.

apache-2function-callingllmmistral-aimixture-of-expertsopen-source-ai
Quick facts
LicenseApache 2.0
Params141B (39B active)
TypeSparse MoE
Context64K
No ratings yet — be the first
Params
141B / 39B
total / active
Experts
8, 2 active
sparse MoE
Context
64K
tokens
License
Apache 2.0
permissive

What is Mixtral 8x22B?

Mixtral 8x22B is an open sparse mixture-of-experts (SMoE) large language model from Mistral AI. It contains 141 billion total parameters spread across eight expert sub-networks, but for each token it routes to only two experts, activating just about 39 billion parameters. This gives it the knowledge capacity of a very large model with the inference cost of a far smaller one. Released under the permissive Apache 2.0 licence with a long 64K-token context, it was one of the strongest fully open models available at release.

The architecture

Mixtral uses a mixture-of-experts design: within each layer there are eight expert feed-forward networks, and a lightweight router selects the two best experts for each token. Because only a fraction of the network runs per token, Mixtral delivers high quality while keeping compute efficient. It builds on the smaller Mixtral 8x7B, scaling the experts up to 22B each. It is natively multilingual, strong at code and mathematics, and supports function calling, with the 64K context handling long documents.

What it is good at

Mixtral 8x22B is a strong general-purpose model: chat and reasoning, multilingual text (English, French, German, Spanish, Italian), code and maths, summarisation and function calling for tool use. Its efficiency-to-quality ratio makes it attractive for self-hosted assistants, RAG and agentic applications where you want high capability without the full cost of a dense model of equivalent quality. The Instruct variant is tuned for assistant and chat behaviour.

Licensing & access

Mixtral 8x22B is released under Apache 2.0 — fully permissive for research and commercial use — with weights on Hugging Face, easy local running via Ollama, and availability through Mistral's API and many inference providers. Despite activating only ~39B parameters per token, the full 141B must fit in memory, so self-hosting needs substantial multi-GPU hardware or quantisation; hosted endpoints are a simpler route for many.

Practical considerations

Use the Instruct variant for chat and the base for fine-tuning. The main practical hurdle is memory: all experts must be resident even though few run per token, so plan for multi-GPU or quantised deployment. MoE models are also a little more involved to serve efficiently than dense ones. Mistral has since released newer models, so for the latest quality compare options — but Mixtral 8x22B remains an excellent, permissively licensed open MoE.

How it compares

DBRX is another open MoE with finer-grained routing; Llama 2 and Falcon are dense open models. Mixtral's edges are its Apache 2.0 licence, 64K context, strong multilingual and code ability, and efficient MoE inference. Against dense models of similar quality it is cheaper to run per token; against DBRX it offers a different routing design and a longer context. For a permissive, efficient, capable open model, Mixtral 8x22B is a leading choice.

Getting started

The quickest path is a hosted endpoint (Mistral's API or a provider) with Mixtral 8x22B Instruct, prompting it like any chat model; or run locally via Ollama or Transformers with quantisation to fit your GPUs. Use the Instruct variant for assistants, the base for fine-tuning, exploit the 64K context for long inputs, and lean on its code/maths strength — validating quality on your own workloads before rollout.

Model variants

Mixtral 8x22B

141B / 39B active
Base

For fine-tuning

MOST POPULAR

Mixtral 8x22B Instruct

141B / 39B active
InstructChat

Tuned for chat

Capabilities

🧩
Sparse MoE
Eight experts with two active per token give strong quality per active parameter.
Efficient inference
Only ~39B of 141B parameters run per token, cutting compute cost.
💻
Code and maths
Strong on programming, mathematics and reasoning.
📏
64K context
Handles long prompts and documents within a 64K-token window.

Pros & Cons

Pros6
  • Strong open quality with MoE efficiency
  • 141B total but only ~39B active per token
  • Permissive Apache 2.0 licence
  • Long 64K context window
  • Strong multilingual, code and maths
  • Function calling for tool use
Cons4
  • All 141B must fit in memory to self-host
  • Needs multi-GPU or quantisation
  • MoE serving more involved than dense
  • Use the Instruct variant for chat

Inspiration

Mistral 8x22B use cases & project ideas

Chat assistant

Capable conversational AI.

Code & math

Programming and reasoning.

Multilingual

Several European languages.

Tool use

Function-calling agents.

FAQ

Frequently asked questions

An open sparse mixture-of-experts LLM from Mistral AI: 141B total parameters but only ~39B active per token, with a 64K context.