Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Key Features
Ministral 3 14B consists of two main architectural components:
- 13.5B Language Model
- 0.4B Vision Encoder
The Ministral 3 14B Reasoning model offers the following capabilities:
- Vision: Enables the model to analyze images and provide insights based on visual content, in addition to text.
- Multilingual: Supports dozens of languages, including English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic.
- System Prompt: Maintains strong adherence and support for system prompts.
- Agentic: Offers best-in-class agentic capabilities with native function calling and JSON outputting.
- Reasoning: Excels at complex, multi-step reasoning and dynamic problem-solving.
- Edge-Optimized: Delivers best-in-class performance at a small scale, deployable anywhere.
- Apache 2.0 License: Open-source license allowing usage and modification for both commercial and non-commercial purposes.
- Large Context Window: Supports a 256k context window.
Use Cases
Private AI deployments where advanced capabilities meet practical hardware constraints:
- Private/custom chat and AI assistant deployments in constrained environments
- Advanced local agentic use cases
- Fine-tuning and specialization
- And more...
Bringing advanced AI capabilities to most environments.
Recommended Settings
We recommend deploying with the following best practices:
- System Prompt: Use our provided system prompt, and append it to your custom system prompt to define a clear environment and use case, including guidance on how to effectively leverage tools in agentic systems.
- Multi-turn Traces: We highly recommend keeping the reasoning traces in context.
- Sampling Parameters: Use a temperature of 1 for most environments ; Different temperatures may be explored for different use cases - developers are encouraged to experiment with alternative settings.
- Tools: Keep the set of tools well-defined and limit their number to the minimum required for the use case - Avoiding overloading the model with an excessive number of tools.
- Vision: When deploying with vision capabilities, we recommend maintaining an aspect ratio close to 1:1 (width-to-height) for images. Avoiding the use of overly thin or wide images - crop them as needed to ensure optimal performance.
Ministral 3 Family
| Model Name | Type | Precision | Link |
|---|---|---|---|
| Ministral 3 3B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 3B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 3B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
| Ministral 3 8B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 8B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 8B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
| Ministral 3 14B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 14B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 14B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
Other formats available here.
Benchmark Results
We compare Ministral 3 to similar sized models.
Reasoning
| Model | AIME25 | AIME24 | GPQA Diamond | LiveCodeBench |
|---|---|---|---|---|
| Ministral 3 14B | 0.850 | 0.898 | 0.712 | 0.646 |
| Qwen3-14B (Thinking) | 0.737 | 0.837 | 0.663 | 0.593 |
| Ministral 3 8B | 0.787 | 0.860 | 0.668 | 0.616 |
| Qwen3-VL-8B-Thinking | 0.798 | 0.860 | 0.671 | 0.580 |
| Ministral 3 3B | 0.721 | 0.775 | 0.534 | 0.548 |
| Qwen3-VL-4B-Thinking | 0.697 | 0.729 | 0.601 | 0.513 |
Instruct
| Model | Arena Hard | WildBench | MATH Maj@1 | MM MTBench |
|---|---|---|---|---|
| Ministral 3 14B | 0.551 | 68.5 | 0.904 | 8.49 |
| Qwen3 14B (Non-Thinking) | 0.427 | 65.1 | 0.870 | NOT MULTIMODAL |
| Gemma3-12B-Instruct | 0.436 | 63.2 | 0.854 | 6.70 |
| Ministral 3 8B | 0.509 | 66.8 | 0.876 | 8.08 |
| Qwen3-VL-8B-Instruct | 0.528 | 66.3 | 0.946 | 8.00 |
| Ministral 3 3B | 0.305 | 56.8 | 0.830 | 7.83 |
| Qwen3-VL-4B-Instruct | 0.438 | 56.8 | 0.900 | 8.01 |
| Qwen3-VL-2B-Instruct | 0.163 | 42.2 | 0.786 | 6.36 |
| Gemma3-4B-Instruct | 0.318 | 49.1 | 0.759 | 5.23 |
Base
| Model | Multilingual MMLU | MATH CoT 2-Shot | AGIEval 5-shot | MMLU Redux 5-shot | MMLU 5-shot | TriviaQA 5-shot |
|---|---|---|---|---|---|---|
| Ministral 3 14B | 0.742 | 0.676 | 0.648 | 0.820 | 0.794 | 0.749 |
| Qwen3 14B Base | 0.754 | 0.620 | 0.661 | 0.837 | 0.804 | 0.703 |
| Gemma 3 12B Base | 0.690 | 0.487 | 0.587 | 0.766 | 0.745 | 0.788 |
| Ministral 3 8B | 0.706 | 0.626 | 0.591 | 0.793 | 0.761 | 0.681 |
| Qwen 3 8B Base | 0.700 | 0.576 | 0.596 | 0.794 | 0.760 | 0.639 |
| Ministral 3 3B | 0.652 | 0.601 | 0.511 | 0.735 | 0.707 | 0.592 |
| Qwen 3 4B Base | 0.677 | 0.405 | 0.570 | 0.759 | 0.713 | 0.530 |
| Gemma 3 4B Base | 0.516 | 0.294 | 0.430 | 0.626 | 0.589 | 0.640 |
Usage
The model can be used with the following frameworks;
vllm: See heretransformers: See here
vLLM
We recommend using this model with vLLM.
Installation
Make sure to install vllm >= 0.12.0:
markdown
pip install vllm --upgrade
Doing so should automatically install mistral_common >= 1.8.6.
To check:
markdown
python -c "import mistral_common; print(mistral_common.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve
To fully exploit the Ministral-3-14B-Reasoning-2512 we recommed using 2xH200 GPUs for deployment due to its large context. However if you don't need a large context, you can fall back to a single GPU.
A simple launch command is:
bash
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \--tensor-parallel-size 2 \--tokenizer_mode mistral --config_format mistral --load_format mistral \--enable-auto-tool-choice --tool-call-parser mistral \--reasoning-parser mistral
Key parameter notes:
- enable-auto-tool-choice: Required when enabling tool usage.
- tool-call-parser mistral: Required when enabling tool usage.
- reasoning-parser mistral: Required when enabling reasoning.
Additional flags:
- You can set
--max-model-lento preserve memory. By default it is set to262144which is quite large but not necessary for most scenarios. - You can set
--max-num-batched-tokensto balance throughput and latency, higher means higher throughput but higher latency.
Usage of the model
Here we assume that the model mistralai/Ministral-3-8B-Reasoning-2512 is served and you can ping it to the domain localhost with the port 8000 which is the default for vLLM.
Let's see if the Ministral 3 model knows when to pick a fight !
python
from typing import Anyfrom openai import OpenAIfrom huggingface_hub import hf_hub_download# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://localhost:8000/v1"TEMP = 0.7TOP_P = 0.95MAX_TOK = 262144client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].iddef load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:file_path = hf_hub_download(repo_id=repo_id, filename=filename)with open(file_path, "r") as file:system_prompt = file.read()index_begin_think = system_prompt.find("[THINK]")index_end_think = system_prompt.find("[/THINK]")return {"role": "system","content": [{"type": "text", "text": system_prompt[:index_begin_think]},{"type": "thinking","thinking": system_prompt[index_begin_think + len("[THINK]") : index_end_think],"closed": True,},{"type": "text","text": system_prompt[index_end_think + len("[/THINK]") :],},],}SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"messages = [SYSTEM_PROMPT,{"role": "user","content": [{"type": "text","text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",},{"type": "image_url", "image_url": {"url": image_url}},],},]stream = client.chat.completions.create(model=model,messages=messages,stream=True,temperature=TEMP,top_p=TOP_P,max_tokens=MAX_TOK,)print("client: Start streaming chat completions...:\n")printed_reasoning_content = Falseanswer = []for chunk in stream:reasoning_content = Nonecontent = None# Check the content is reasoning_content or contentif hasattr(chunk.choices[0].delta, "reasoning_content"):reasoning_content = chunk.choices[0].delta.reasoning_contentif hasattr(chunk.choices[0].delta, "content"):content = chunk.choices[0].delta.contentif reasoning_content is not None:if not printed_reasoning_content:printed_reasoning_content = Trueprint("Start reasoning:\n", end="", flush=True)print(reasoning_content, end="", flush=True)elif content is not None:# Extract and print the contentif not reasoning_content and printed_reasoning_content:answer.extend(content)print(content, end="", flush=True)if answer:print("\n\n=============\nAnswer\n=============\n")print("".join(answer))else:print("\n\n=============\nNo Answer\n=============\n")print("No answer was generated by the model, probably because the maximum number of tokens was reached.")
Now we'll make it compute some maths !
python
from typing import Anyfrom openai import OpenAIfrom huggingface_hub import hf_hub_download# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://localhost:8000/v1"TEMP = 0.7TOP_P = 0.95MAX_TOK = 262144client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].iddef load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:file_path = hf_hub_download(repo_id=repo_id, filename=filename)with open(file_path, "r") as file:system_prompt = file.read()index_begin_think = system_prompt.find("[THINK]")index_end_think = system_prompt.find("[/THINK]")return {"role": "system","content": [{"type": "text", "text": system_prompt[:index_begin_think]},{"type": "thinking","thinking": system_prompt[index_begin_think + len("[THINK]") : index_end_think],"closed": True,},{"type": "text","text": system_prompt[index_end_think + len("[/THINK]") :],},],}SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")image_url = "https://i.ytimg.com/vi/5Y3xLHeyKZU/hqdefault.jpg"messages = [SYSTEM_PROMPT,{"role": "user","content": [{"type": "text","text": "Solve the equations. If they contain only numbers, use your calculator, else only think. Answer in the language of the image.",},{"type": "image_url", "image_url": {"url": image_url}},],},]stream = client.chat.completions.create(model=model,messages=messages,stream=True,temperature=TEMP,top_p=TOP_P,max_tokens=MAX_TOK,)print("client: Start streaming chat completions...:\n")printed_reasoning_content = Falseanswer = []for chunk in stream:reasoning_content = Nonecontent = None# Check the content is reasoning_content or contentif hasattr(chunk.choices[0].delta, "reasoning_content"):reasoning_content = chunk.choices[0].delta.reasoning_contentif hasattr(chunk.choices[0].delta, "content"):content = chunk.choices[0].delta.contentif reasoning_content is not None:if not printed_reasoning_content:printed_reasoning_content = Trueprint("Start reasoning:\n", end="", flush=True)print(reasoning_content, end="", flush=True)if content is not None:# Extract and print the contentif not reasoning_content and printed_reasoning_content:answer.extend(content)print(content, end="", flush=True)if answer:print("\n\n=============\nAnswer\n=============\n")print("".join(answer))else:print("\n\n=============\nNo Answer\n=============\n")print("No answer was generated by the model, probably because the maximum number of tokens was reached.")
Let's do more maths and leave it up to the model to figure out how to achieve a result.
python
from typing import Anyfrom openai import OpenAIfrom huggingface_hub import hf_hub_download# Modify OpenAI's API key and API base to use vLLM's API server.openai_api_key = "EMPTY"openai_api_base = "http://localhost:8000/v1"TEMP = 0.7TOP_P = 0.95MAX_TOK = 262144client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)models = client.models.list()model = models.data[0].iddef load_system_prompt(repo_id: str, filename: str) -> dict[str, Any]:file_path = hf_hub_download(repo_id=repo_id, filename=filename)with open(file_path, "r") as file:system_prompt = file.read()index_begin_think = system_prompt.find("[THINK]")index_end_think = system_prompt.find("[/THINK]")return {"role": "system","content": [{"type": "text", "text": system_prompt[:index_begin_think]},{"type": "thinking","thinking": system_prompt[index_begin_think + len("[THINK]") : index_end_think],"closed": True,},{"type": "text","text": system_prompt[index_end_think + len("[/THINK]") :],},],}SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")query = "Use each number in 2,5,6,3 exactly once, along with any combination of +, -, ×, ÷ (and parentheses for grouping), to make the number 24."messages = [SYSTEM_PROMPT,{"role": "user", "content": query}]stream = client.chat.completions.create(model=model,messages=messages,stream=True,temperature=TEMP,top_p=TOP_P,max_tokens=MAX_TOK,)print("client: Start streaming chat completions...:\n")printed_reasoning_content = Falseanswer = []for chunk in stream:reasoning_content = Nonecontent = None# Check the content is reasoning_content or contentif hasattr(chunk.choices[0].delta, "reasoning_content"):reasoning_content = chunk.choices[0].delta.reasoning_contentif hasattr(chunk.choices[0].delta, "content"):content = chunk.choices[0].delta.contentif reasoning_content is not None:if not printed_reasoning_content:printed_reasoning_content = Trueprint("Start reasoning:\n", end="", flush=True)print(reasoning_content, end="", flush=True)if content is not None:# Extract and print the contentif not reasoning_content and printed_reasoning_content:answer.extend(content)print(content, end="", flush=True)if answer:print("\n\n=============\nAnswer\n=============\n")print("".join(answer))else:print("\n\n=============\nNo Answer\n=============\n")print("No answer was generated by the model, probably because the maximum number of tokens was reached.")
Transformers
You can also use Ministral 3 3B Reasoning 2512 with Transformers !
Make sure to install Transformers from its first v5 release candidate or from "main":
markdown
pip install transformers==5.0.0rc0
To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.8.6 to use our tokenizer.
bash
pip install mistral-common --upgrade
Then load our tokenizer along with the model and generate:
python
import torchfrom transformers import Mistral3ForConditionalGeneration, MistralCommonBackendmodel_id = "mistralai/Ministral-3-14B-Reasoning-2512"tokenizer = MistralCommonBackend.from_pretrained(model_id)model = Mistral3ForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"messages = [{"role": "user","content": [{"type": "text","text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",},{"type": "image_url", "image_url": {"url": image_url}},],},]tokenized = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)tokenized["input_ids"] = tokenized["input_ids"].to(device="cuda")tokenized["pixel_values"] = tokenized["pixel_values"].to(dtype=torch.bfloat16, device="cuda")image_sizes = [tokenized["pixel_values"].shape[-2:]]output = model.generate(**tokenized,image_sizes=image_sizes,max_new_tokens=8092,)[0]decoded_output = tokenizer.decode(output[len(tokenized["input_ids"][0]):])print(decoded_output)
License
This model is licensed under the Apache 2.0 License.
You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party’s rights, including intellectual property rights.
Model provider
mistralai
Model tree
Base
mistralai/Ministral-3-14B-Base-2512
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information