openbmb/MiniCPM-o-4_5-GPTQ API & Inference Endpoint

News

[!NOTE] [2026.02.06] 🥳 🥳 🥳 MiniCPM-o 4.5 Local & Ready-to-Run! Experience low-latency full-duplex communication directly on your own Mac using our new official Docker image. Try it now!

MiniCPM-o 4.5

MiniCPM-o 4.5 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits a significant performance improvement, and introduces new features for full-duplex multimodal live streaming. Notable features of MiniCPM-o 4.5 include:

🔥 Leading Visual Capability. MiniCPM-o 4.5 achieves an average score of 77.6 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. With only 9B parameters, it surpasses widely used proprietary models like GPT-4o, Gemini 2.0 Pro, and approaches Gemini 2.5 Flash for vision-language capabilities. It supports instruct and thinking modes in a single model, better covering efficiency and performance trade-offs in different user scenarios.
🎙 Strong Speech Capability. MiniCPM-o 4.5 supports bilingual real-time speech conversation with configurable voices in English and Chinese. It features more natural, expressive and stable speech conversation. The model also allows for fun features such as voice cloning and role play via a simple reference audio clip, where the cloning performance surpasses strong TTS tools such as CosyVoice2.
🎬 New Full-Duplex and Proactive Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This allows MiniCPM-o 4.5 to see, listen, and speak simultaneously, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform proactive interaction, such as initiating reminders or comments based on its continuous understanding of the live scene.
💪 Strong OCR Capability, Efficiency and Others. Advancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process high-resolution images (up to 1.8 million pixels) and high-FPS videos (up to 10fps) in any aspect ratio efficiently. It achieves state-of-the-art peformance for end-to-end English document parsing on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features trustworthy behaviors, matching Gemini 2.5 Flash on MMHal-Bench, and supports multilingual capabilities on more than 30 languages.
💫 Easy Usage. MiniCPM-o 4.5 can be easily used in various ways: (1) llama.cpp and Ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM and SGLang support for high-throughput and memory-efficient inference, (4) FlagOS support for the unified multi-chip backend plugin, (5) fine-tuning on new domains and tasks with LLaMA-Factory, and (6) online web demo on server. We also rollout a high-performing llama.cpp-omni inference framework together with a WebRTC Demo, which enables the full-duplex multimodal live streaming experience on local devices such as PCs (e.g., on a MacBook).

Model Architecture.

Evaluation

Image Understanding (Instruct)

Image Understanding (Thinking)

Video Understanding

OmniDocBench

Text Capability

Omni Simplex

Vision Duplex

Audio Understanding

Speech Generation

Long Speech Generation

Emotion Control

Inference Efficiency

Examples

Examples: 🎙️ Speech

Simplex speech conversation with custom reference audio and character prompts.

Examples: Vision-Language

Usage

Note: This GPTQ model is pre-quantized to W4A16, reducing GPU memory usage from ~19GB (BF16) to ~11GB (INT4). For loading, use torch_dtype=torch.bfloat16 with device_map="auto" — the quantized layers weight format is automatically handled by the GPTQ kernel.

bash
pip install "transformers==4.51.0" accelerate "torch>=2.3.0,<=2.8.0" "torchaudio<=2.8.0" "minicpmo-utils[all]>=1.0.2" auto-gptq

python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "openbmb/MiniCPM-o-4_5-gptq",
    trust_remote_code=True,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    init_vision=True,
    init_audio=True,
    init_tts=True,
)
model.eval()

For omni-modal inference (vision + audio), ensure init_vision=True, init_audio=True, init_tts=True. For vision-only inference, set init_audio=False and init_tts=False.

For detailed usage (chat, streaming, full-duplex, TTS, visual understanding, etc.), see the base model README and the Cookbook.

License

Model License

The MiniCPM-o/V model weights and code are open-sourced under the Apache-2.0 license.

Statement

As an LMM, MiniCPM-o 4.5 generates contents by learning a large amount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-o 4.5 does not represent the views and positions of the model developers
We will not be liable for any problems arising from the use of the MinCPM-o models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Key Techniques and Other Multimodal Projects

👏 Welcome to explore key techniques of MiniCPM-o/V and other multimodal projects of our team:

VisCPM | RLPR | RLHF-V | LLaVA-UHD | RLAIF-V

Citation

If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️！

bib
@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}

MiniCPM-o-4_5-GPTQ

Get help setting up a custom Dedicated Endpoints.

README