88plug

MiniCPM-o-4.5-W8A16-NeuralMax

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a Glance

Table
PropertyValue
Base modelopenbmb/MiniCPM-o-4.5
ArchitectureQwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant formatcompressed-tensors (native vLLM)
Quant methodAutoRound W8A16 (RTN, datafree)
Quantizedmodel.llm transformer layers
Kept BF16vision encoder, audio encoder, TTS components
Disk size~9 GB
Min GPU1× RTX 3090 24GB

Memory Requirements

Table
ConfigurationBF16W8A16
Weights~18 GB~9 GB
Min GPU1× A100 40GB1× RTX 3090 24GB

Quick Start

vLLM — text output

bash

docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/MiniCPM-o-4.5-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

llama.cpp — audio/vision in, text out

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.

bash

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
--outfile MiniCPM-o-4.5-BF16.gguf
llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS
llama-server \
--model MiniCPM-o-4.5-Q8_0.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081

Benchmarks

Results pending.

Table
EngineFormatBatchctxtok/sTTFT p50TTFT p99VRAM
vLLM v0.21.0W8A16132k
vLLM v0.21.0W8A16832k
llama.cpp b9297Q8_0 GGUF132k
llama.cpp b9297IQ4_XS GGUF132k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


What's Quantized, What's Not

Table
ComponentPrecisionReason
model.llm.* transformer layersW8A16 INT8Quantized
Vision encoder (SigLIP2)BF16Excluded
Audio encoder (Whisper)BF16Excluded
CosyVoice2 TTSBF16Excluded
Embeddings, LM head, normsBF16Standard practice

Quality Targets

Table
MetricTarget
KL divergence vs BF16< 0.005
MMLU recovery≥ 99.7%

Citation

bibtex

@misc{minicpmo,
title = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
author = {MiniCPM Team, OpenBMB},
year = {2025},
url = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.

Model provider

88plug

Model tree

Base

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today