MiniCPM-o-4.5-W8A16-NeuralMax API & Inference Endpoint

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`openbmb/MiniCPM-o-4.5`
Architecture	Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant format	compressed-tensors (native vLLM)
Quant method	AutoRound W8A16 (RTN, datafree)
Quantized	`model.llm` transformer layers
Kept BF16	vision encoder, audio encoder, TTS components
Disk size	~9 GB
Min GPU	1× RTX 3090 24GB

Memory Requirements

Table with columns: Configuration, BF16, W8A16
Configuration	BF16	W8A16
Weights	~18 GB	~9 GB
Min GPU	1× A100 40GB	1× RTX 3090 24GB

Quick Start

vLLM — text output

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

llama.cpp — audio/vision in, text out

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.

bash
python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Results pending.

Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM
Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W8A16	1	32k	—	—	—	—
vLLM v0.21.0	W8A16	8	32k	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

What's Quantized, What's Not

Table with columns: Component, Precision, Reason
Component	Precision	Reason
`model.llm.*` transformer layers	W8A16 INT8	Quantized
Vision encoder (SigLIP2)	BF16	Excluded
Audio encoder (Whisper)	BF16	Excluded
CosyVoice2 TTS	BF16	Excluded
Embeddings, LM head, norms	BF16	Standard practice

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

Citation

bibtex
@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.

Property

Value

Base model

openbmb/MiniCPM-o-4.5

Architecture

Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS

Quant format

compressed-tensors (native vLLM)

Quant method

AutoRound W8A16 (RTN, datafree)

Quantized

model.llm transformer layers

Kept BF16

vision encoder, audio encoder, TTS components

Disk size

~9 GB

Min GPU

1× RTX 3090 24GB

Configuration

BF16

W8A16

Weights

~18 GB

~9 GB

Min GPU

1× A100 40GB

1× RTX 3090 24GB

bash

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

bash

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Engine

Format

Batch

ctx

tok/s

TTFT p50

TTFT p99

VRAM

vLLM v0.21.0

W8A16

32k

—

vLLM v0.21.0

W8A16

32k

—

Component

Precision

Reason

model.llm.* transformer layers

W8A16 INT8

Quantized

Vision encoder (SigLIP2)

BF16

Excluded

Audio encoder (Whisper)

BF16

Excluded

CosyVoice2 TTS

BF16

Excluded

Embeddings, LM head, norms

BF16

Standard practice

Metric

Target

KL divergence vs BF16

< 0.005

MMLU recovery

≥ 99.7%

bibtex

@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

MiniCPM-o-4.5-W8A16-NeuralMax

README

At a Glance

Memory Requirements

Quick Start

vLLM — text output

llama.cpp — audio/vision in, text out

Benchmarks

What's Quantized, What's Not

Quality Targets

Citation

About

Explore FriendliAI today

README

At a Glance

Memory Requirements

Quick Start

vLLM — text output

llama.cpp — audio/vision in, text out

Benchmarks

What's Quantized, What's Not

Quality Targets

Citation

About