Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Key Features
Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
- Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
- Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
- Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
- Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
- Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
- Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral-3B
Benchmark Results
Audio
Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:

Text

Usage
The model can be used with the following frameworks;
Notes:
temperature=0.2andtop_p=0.95for chat completion (e.g. Audio Understanding) andtemperature=0.0for transcription- Multiple audios per message and multiple user turns with audio are supported
- System prompts are not yet supported
vLLM (recommended)
We recommend using this model with vLLM.
Installation
Make sure to install vllm >= 0.10.0, we recommend using uv:
markdown
uv pip install -U "vllm[audio]" --system
Doing so should automatically install LLM_MARKDOWN_PROTECTED_7.
To check:
markdown
python -c "import mistral_common; print(mistral_common.__version__)"
Offline
You can test that your vLLM setup works as expected by cloning the vLLM repo:
sh
git clone https://github.com/vllm-project/vllm && cd vllm
and then running:
sh
python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
Serve
We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
- Spin up a server:
markdown
vllm serve mistralai/Voxtral-Mini-3B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral
Note: Running Voxtral-Mini-3B-2507 on GPU requires ~9.5 GB of GPU RAM in bf16 or fp16.
- To ping the client you can use a simple Python snippet. See the following examples.
Audio Instruct
Leverage the audio capabilities of Voxtral-Mini-3B-2507 to chat.
Make sure that your client has mistral-common with audio installed:
sh
pip install --upgrade mistral_common\[audio\]
py
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudiofrom mistral_common.audio import Audiofrom huggingface_hub import hf_hub_downloadfrom openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://<your-server-host>:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].idobama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")def file_to_chunk(file: str) -> AudioChunk:audio = Audio.from_file(file, strict=False)return AudioChunk.from_audio(audio)text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other?")user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()print(30 * "=" + "USER 1" + 30 * "=")print(text_chunk.text)print("\n\n")response = client.chat.completions.create(model=model,messages=[user_msg],temperature=0.2,top_p=0.95,)content = response.choices[0].message.contentprint(30 * "=" + "BOT 1" + 30 * "=")print(content)print("\n\n")# The speaker who is more inspiring is the one who delivered the farewell address, as they express# gratitude, optimism, and a strong commitment to the nation and its citizens. They emphasize the importance of# self-government and active citizenship, encouraging everyone to participate in the democratic process. In contrast,# the other speaker provides a factual update on the weather in Barcelona, which is less inspiring as it# lacks the emotional and motivational content of the farewell address.# **Differences:**# - The farewell address speaker focuses on the values and responsibilities of citizenship, encouraging active participation in democracy.# - The weather update speaker provides factual information about the temperature in Barcelona, without any emotional or motivational content.messages = [user_msg,AssistantMessage(content=content).to_openai(),UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()]print(30 * "=" + "USER 2" + 30 * "=")print(messages[-1]["content"])print("\n\n")response = client.chat.completions.create(model=model,messages=messages,temperature=0.2,top_p=0.95,)content = response.choices[0].message.contentprint(30 * "=" + "BOT 2" + 30 * "=")print(content)
Transcription
Voxtral-Mini-3B-2507 has powerful transcription capabilities!
Make sure that your client has mistral-common with audio installed:
sh
pip install --upgrade mistral_common\[audio\]
python
from mistral_common.protocol.transcription.request import TranscriptionRequestfrom mistral_common.protocol.instruct.messages import RawAudiofrom mistral_common.audio import Audiofrom huggingface_hub import hf_hub_downloadfrom openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://<your-server-host>:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].idobama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")audio = Audio.from_file(obama_file, strict=False)audio = RawAudio.from_audio(audio)req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))response = client.audio.transcriptions.create(**req)print(response)
Transformers 🤗
Starting with transformers >= 4.54.0 and above, you can run Voxtral natively!
Install Transformers:
bash
pip install -U transformers
Make sure to have mistral-common >= 1.8.1 installed with audio dependencies:
bash
pip install --upgrade "mistral-common[audio]"
Audio Instruct
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Mini-3B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",},{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},],}]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Mini-3B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",},{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",},{"type": "text", "text": "Describe briefly what you can hear."},],},{"role": "assistant","content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",},{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},{"type": "text", "text": "Ok, now compare this new audio with the previous one."},],},]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Mini-3B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "text","text": "Why should AI models be open-sourced?",},],}]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Mini-3B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},],}]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Mini-3B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversations = [[{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",},{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",},{"type": "text","text": "Who's speaking in the speach and what city's weather is being discussed?",},],}],[{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},{"type": "text", "text": "What can you tell me about this audio?"},],}],]inputs = processor.apply_chat_template(conversations)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated responses:")print("=" * 80)for decoded_output in decoded_outputs:print(decoded_output)print("=" * 80)
Transcription
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Mini-3B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)inputs = processor.apply_transcription_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated responses:")print("=" * 80)for decoded_output in decoded_outputs:print(decoded_output)print("=" * 80)
Model provider
mistralai
Model tree
Base
this model
Modalities
Input
Audio, Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information