88plug
MiniCPM-o-4.5-W8A16-NeuralMax
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a Glance
| Property | Value |
|---|---|
| Base model | openbmb/MiniCPM-o-4.5 |
| Architecture | Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS |
| Quant format | compressed-tensors (native vLLM) |
| Quant method | AutoRound W8A16 (RTN, datafree) |
| Quantized | model.llm transformer layers |
| Kept BF16 | vision encoder, audio encoder, TTS components |
| Disk size | ~9 GB |
| Min GPU | 1× RTX 3090 24GB |
Memory Requirements
| Configuration | BF16 | W8A16 |
|---|---|---|
| Weights | ~18 GB | ~9 GB |
| Min GPU | 1× A100 40GB | 1× RTX 3090 24GB |
Quick Start
vLLM — text output
bash
docker run --gpus device=0 -p 8080:8080 \vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \88plug/MiniCPM-o-4.5-W8A16 \--kv-cache-dtype fp8 \--max-model-len 32768 \--gpu-memory-utilization 0.90
Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.
llama.cpp — audio/vision in, text out
Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.
bash
python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \--outfile MiniCPM-o-4.5-BF16.ggufllama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0llama-quantize --imatrix calibration_datav3.txt \MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XSllama-server \--model MiniCPM-o-4.5-Q8_0.gguf \--n-gpu-layers 999 \--ctx-size 32768 \--port 8081
Benchmarks
Results pending.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — | — | — | — |
| llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | — | — | — | — |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
What's Quantized, What's Not
| Component | Precision | Reason |
|---|---|---|
model.llm.* transformer layers | W8A16 INT8 | Quantized |
| Vision encoder (SigLIP2) | BF16 | Excluded |
| Audio encoder (Whisper) | BF16 | Excluded |
| CosyVoice2 TTS | BF16 | Excluded |
| Embeddings, LM head, norms | BF16 | Standard practice |
Quality Targets
| Metric | Target |
|---|---|
| KL divergence vs BF16 | < 0.005 |
| MMLU recovery | ≥ 99.7% |
Citation
bibtex
@misc{minicpmo,title = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},author = {MiniCPM Team, OpenBMB},year = {2025},url = {https://huggingface.co/openbmb/MiniCPM-o-4.5}}
About
Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.
Model provider
88plug
Model tree
Base
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information