Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0News
[!NOTE] [2026.02.06] 🥳 🥳 🥳 MiniCPM-o 4.5 Local & Ready-to-Run! Experience low-latency full-duplex communication directly on your own Mac using our new official Docker image. Try it now!
MiniCPM-o 4.5
MiniCPM-o 4.5 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits a significant performance improvement, and introduces new features for full-duplex multimodal live streaming. Notable features of MiniCPM-o 4.5 include:
-
🔥 Leading Visual Capability. MiniCPM-o 4.5 achieves an average score of 77.6 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 9B parameters, it surpasses widely used proprietary models like GPT-4o, Gemini 2.0 Pro, and approaches Gemini 2.5 Flash for vision-language capabilities. It supports instruct and thinking modes in a single model, better covering efficiency and performance trade-offs in different user scenarios.
-
🎙 Strong Speech Capability. MiniCPM-o 4.5 supports bilingual real-time speech conversation with configurable voices in English and Chinese. It features more natural, expressive and stable speech conversation. The model also allows for fun features such as voice cloning and role play via a simple reference audio clip, where the cloning performance surpasses strong TTS tools such as CosyVoice2.
-
🎬 New Full-Duplex and Proactive Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This allows MiniCPM-o 4.5 to see, listen, and speak simultaneously, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform proactive interaction, such as initiating reminders or comments based on its continuous understanding of the live scene.
-
💪 Strong OCR Capability, Efficiency and Others. Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process high-resolution images (up to 1.8 million pixels) and high-FPS videos (up to 10fps) in any aspect ratio efficiently. It achieves state-of-the-art peformance for end-to-end English document parsing on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features trustworthy behaviors, matching Gemini 2.5 Flash on MMHal-Bench, and supports multilingual capabilities on more than 30 languages.
-
💫 Easy Usage. MiniCPM-o 4.5 can be easily used in various ways: (1) llama.cpp and Ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM and SGLang support for high-throughput and memory-efficient inference, (4) FlagOS support for the unified multi-chip backend plugin, (5) fine-tuning on new domains and tasks with LLaMA-Factory, and (6) online web demo on server. We also rollout a high-performing llama.cpp-omni inference framework together with a WebRTC Demo, which enables the full-duplex multimodal live streaming experience on local devices such as PCs (e.g., on a MacBook).
Model Architecture.
Evaluation
Image Understanding (Instruct)
Image Understanding (Thinking)
Video Understanding
OmniDocBench
Text Capability
Omni Simplex
Vision Duplex
Audio Understanding
Speech Generation
Long Speech Generation
Emotion Control
Inference Efficiency
Examples
Examples: 🎙️ Speech
Simplex speech conversation with custom reference audio and character prompts.
Examples: Vision-Language
Usage
Note: This GPTQ model is pre-quantized to W4A16, reducing GPU memory usage from ~19GB (BF16) to ~11GB (INT4). For loading, use
torch_dtype=torch.bfloat16withdevice_map="auto"— the quantized layers weight format is automatically handled by the GPTQ kernel.
bash
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.2" auto-gptq
python
import torchfrom transformers import AutoModelmodel = AutoModel.from_pretrained("openbmb/MiniCPM-o-4_5-gptq",trust_remote_code=True,attn_implementation="sdpa",torch_dtype=torch.bfloat16,device_map="auto",init_vision=True,init_audio=True,init_tts=True,)model.eval()
For omni-modal inference (vision + audio), ensure init_vision=True, init_audio=True, init_tts=True. For vision-only inference, set init_audio=False and init_tts=False.
For detailed usage (chat, streaming, full-duplex, TTS, visual understanding, etc.), see the base model README and the Cookbook.
License
Model License
- The MiniCPM-o/V model weights and code are open-sourced under the Apache-2.0 license.
Statement
- As an LMM, MiniCPM-o 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers
- We will not be liable for any problems arising from the use of the MinCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.
Key Techniques and Other Multimodal Projects
👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:
VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V
Citation
If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️!
bib
@article{yao2024minicpm,title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},journal={arXiv preprint arXiv:2408.01800},year={2024}}
Model provider
openbmb
Model tree
Base
openbmb/MiniCPM-o-4_5
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information