Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Container
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Stock proof
bash
docker run --rm -it \--gpus all \--ipc=host \-p 8001:8000 \-v ~/.cache/huggingface:/root/.cache/huggingface \vllm/vllm-openai:latest \mistralai/Ministral-3-3B-Instruct-2512 \--served-model-name Ministral-3-3B-Instruct-2512-stock \--dtype bfloat16 \--max-model-len 8192 \--gpu-memory-utilization 0.7
Serve the packaged artifact
bash
docker run --rm -it \--gpus all \--ipc=host \-p 8002:8000 \-v /path/to/Ministral-3-3B-Instruct-2512-W4A16-BF16Vision:/model \-v ~/.cache/huggingface:/root/.cache/huggingface \vllm/vllm-openai:latest \/model \--served-model-name Ministral-3-3B-Instruct-2512-W4A16-BF16Vision \--dtype bfloat16 \--quantization compressed-tensors \--max-model-len 8192 \--gpu-memory-utilization 0.7
Smoke test
bash
python verify.py --url http://localhost:8002/v1/chat/completions
Notes
- Best fit: RTX 30xx/40xx Ampere cards.
- The Pixtral vision tower and multimodal projector remain in BF16; only the language-model decoder is quantized.
Model provider
useful-quants
Model tree
Base
mistralai/Ministral-3-3B-Instruct-2512
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information