Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Key Features
Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
- Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
- Long-form context: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
- Built-in Q&A and summarization: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
- Natively multilingual: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
- Function-calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
- Highly capable at text: Retains the text understanding capabilities of its language model backbone, Mistral Small 3.1
Benchmark Results
Audio
Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:

Text

Usage
The model can be used with the following frameworks;
Notes:
temperature=0.2andtop_p=0.95for chat completion (e.g. Audio Understanding) andtemperature=0.0for transcription- Multiple audios per message and multiple user turns with audio are supported
- Function calling is supported
- System prompts are not yet supported
vLLM (recommended)
We recommend using this model with vLLM.
Installation
Make sure to install vllm >= 0.10.0, we recommend using uv
markdown
uv pip install -U "vllm[audio]" --system
Doing so should automatically install LLM_MARKDOWN_PROTECTED_7.
To check:
markdown
python -c "import mistral_common; print(mistral_common.__version__)"
Offline
You can test that your vLLM setup works as expected by cloning the vLLM repo:
sh
git clone https://github.com/vllm-project/vllm && cd vllm
and then running:
sh
python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
Serve
We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
- Spin up a server:
markdown
vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice
Note: Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
- To ping the client you can use a simple Python snippet. See the following examples.
Audio Instruct
Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
Make sure that your client has mistral-common with audio installed:
sh
pip install --upgrade mistral_common\[audio\]
py
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudiofrom mistral_common.audio import Audiofrom huggingface_hub import hf_hub_downloadfrom openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://<your-server-host>:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].idobama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")def file_to_chunk(file: str) -> AudioChunk:audio = Audio.from_file(file, strict=False)return AudioChunk.from_audio(audio)text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()print(30 * "=" + "USER 1" + 30 * "=")print(text_chunk.text)print("\n\n")response = client.chat.completions.create(model=model,messages=[user_msg],temperature=0.2,top_p=0.95,)content = response.choices[0].message.contentprint(30 * "=" + "BOT 1" + 30 * "=")print(content)print("\n\n")# The model could give the following answer:# ```L'orateur le plus inspirant est le président.# Il est plus inspirant parce qu'il parle de ses expériences personnelles# et de son optimisme pour l'avenir du pays.# Il est différent de l'autre orateur car il ne parle pas de la météo,# mais plutôt de ses interactions avec les gens et de son rôle en tant que président.```messages = [user_msg,AssistantMessage(content=content).to_openai(),UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()]print(30 * "=" + "USER 2" + 30 * "=")print(messages[-1]["content"])print("\n\n")response = client.chat.completions.create(model=model,messages=messages,temperature=0.2,top_p=0.95,)content = response.choices[0].message.contentprint(30 * "=" + "BOT 2" + 30 * "=")print(content)
Transcription
Voxtral-Small-24B-2507 has powerful transcription capabilities!
Make sure that your client has mistral-common with audio installed:
sh
pip install --upgrade mistral_common\[audio\]
python
from mistral_common.protocol.transcription.request import TranscriptionRequestfrom mistral_common.protocol.instruct.messages import RawAudiofrom mistral_common.audio import Audiofrom huggingface_hub import hf_hub_downloadfrom openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://<your-server-host>:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].idobama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")audio = Audio.from_file(obama_file, strict=False)audio = RawAudio.from_audio(audio)req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))response = client.audio.transcriptions.create(**req)print(response)
Function Calling
Voxtral has some experimental function calling support. You can try as shown below.
Make sure that your client has mistral-common with audio installed:
sh
pip install --upgrade mistral_common\[audio\]
python
from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage, TextChunkfrom mistral_common.protocol.transcription.request import TranscriptionRequestfrom mistral_common.protocol.instruct.tool_calls import Function, Toolfrom mistral_common.audio import Audiofrom huggingface_hub import hf_hub_downloadfrom openai import OpenAI# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://<your-server-host>:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].idtool = Tool(function=Function(name="get_current_weather",description="Get the current weather",parameters={"type": "object","properties": {"location": {"type": "string","description": "The city and state, e.g. San Francisco, CA",},"format": {"type": "string","enum": ["celsius", "fahrenheit"],"description": "The temperature unit to use. Infer this from the user's location.",},},"required": ["location", "format"],},))tools = [tool.to_openai()]weather_like = hf_hub_download("patrickvonplaten/audio_samples", "fn_calling.wav", repo_type="dataset")def file_to_chunk(file: str) -> AudioChunk:audio = Audio.from_file(file, strict=False)return AudioChunk.from_audio(audio)audio_chunk = file_to_chunk(weather_like)print(30 * "=" + "Transcription" + 30 * "=")req = TranscriptionRequest(model=model, audio=audio_chunk.input_audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))response = client.audio.transcriptions.create(**req)print(response.text) # How is the weather in Madrid at the moment?print("\n")print(30 * "=" + "Function calling" + 30 * "=")audio_chunk = file_to_chunk(weather_like)user_msg = UserMessage(content=[audio_chunk]).to_openai()response = client.chat.completions.create(model=model,messages=[user_msg],temperature=0.2,top_p=0.95,tools=[tool.to_openai()])print(30 * "=" + "BOT 1" + 30 * "=")print(response.choices[0].message.tool_calls)print("\n\n")
Transformers 🤗
Starting with transformers >= 4.54.0 and above, you can run Voxtral natively!
Install Transformers:
bash
pip install -U transformers
Make sure to have mistral-common >= 1.8.1 installed with audio dependencies:
bash
pip install --upgrade "mistral-common[audio]"
Audio Instruct
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Small-24B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",},{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},],}]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Small-24B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",},{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",},{"type": "text", "text": "Describe briefly what you can hear."},],},{"role": "assistant","content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",},{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},{"type": "text", "text": "Ok, now compare this new audio with the previous one."},],},]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Small-24B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "text","text": "Why should AI models be open-sourced?",},],}]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Small-24B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversation = [{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},],}]inputs = processor.apply_chat_template(conversation)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated response:")print("=" * 80)print(decoded_outputs[0])print("=" * 80)
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Small-24B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)conversations = [[{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",},{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",},{"type": "text","text": "Who's speaking in the speach and what city's weather is being discussed?",},],}],[{"role": "user","content": [{"type": "audio","path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",},{"type": "text", "text": "What can you tell me about this audio?"},],}],]inputs = processor.apply_chat_template(conversations)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated responses:")print("=" * 80)for decoded_output in decoded_outputs:print(decoded_output)print("=" * 80)
Transcription
python
from transformers import VoxtralForConditionalGeneration, AutoProcessorimport torchdevice = "cuda"repo_id = "mistralai/Voxtral-Small-24B-2507"processor = AutoProcessor.from_pretrained(repo_id)model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)inputs = processor.apply_transcription_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)inputs = inputs.to(device, dtype=torch.bfloat16)outputs = model.generate(**inputs, max_new_tokens=500)decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)print("\nGenerated responses:")print("=" * 80)for decoded_output in decoded_outputs:print(decoded_output)print("=" * 80)
Model provider
mistralai
Model tree
Base
mistralai/Mistral-Small-24B-Base-2501
Fine-tuned
this model
Modalities
Input
Audio, Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information