88plug

MiniCPM-o-4.5-W8A16

README

License: apache-2.0

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`openbmb/MiniCPM-o-4.5`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W8A16 (weight-only INT8)
FLAC status	Not measured (T+7d milestone)
Architecture	Qwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant format	compressed-tensors (native vLLM)
Quantized	`model.llm` transformer layers
Kept BF16	vision encoder, audio encoder, TTS components
Disk size	~9 GB
Min GPU	1× RTX 3090 24GB

Memory Requirements

Table with columns: Configuration, BF16, W8A16
Configuration	BF16	W8A16
Weights	~18 GB	~9 GB
Min GPU	1× A100 40GB	1× RTX 3090 24GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/MiniCPM-o-4.5-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

llama.cpp — audio/vision in, text out

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.

bash
python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
  --outfile MiniCPM-o-4.5-BF16.gguf

llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS

llama-server \
  --model MiniCPM-o-4.5-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Table with columns: Metric, Status
Metric	Status
Throughput (tok/s)	In progress — T+7d milestone
MMLU delta vs BF16	In progress — T+7d milestone
RULER@128k	In progress — T+30d milestone

No fabricated numbers. Results will be published to this card when measured.

What's Quantized, What's Not

Table with columns: Component, Precision, Reason
Component	Precision	Reason
`model.llm.*` transformer layers	W8A16 INT8	Quantized
Vision encoder (SigLIP2)	BF16	Excluded
Audio encoder (Whisper)	BF16	Excluded
CosyVoice2 TTS	BF16	Excluded
Embeddings, LM head, norms	BF16	Standard practice

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

vs. Other MiniCPM-o-4.5 Quants

This is the first compressed-tensors W8A16 checkpoint for MiniCPM-o-4.5. It halves VRAM usage while retaining native vLLM serving with audio and vision input.

Table with columns: Quant, Method, Size, GPU Compatibility, Notes
Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors RTN W8A16	~9 GB	Any Ampere+ ≥16 GB	First W8A16; native vLLM; LLM backbone quantized
Community GGUF Q4_K_M	llama.cpp GGUF	~5 GB	CPU / any GPU	Vision via mmproj; no CosyVoice2 in mainline
Community GGUF Q8_0	llama.cpp GGUF	~9 GB	Any GPU ≥10 GB

Limitations

LLM backbone only: Only model.llm transformer layers are quantized. Vision encoder (SigLIP2), audio encoder (Whisper), and CosyVoice2 TTS components stay BF16.
No CosyVoice2 in mainline vLLM: Speech output is not supported by mainline vLLM. Use the tc-mb/llama.cpp-omni fork for speech synthesis.
RTN (data-free) quantization: No calibration corpus used for the LLM backbone. Near-lossless at W8A16 but not AutoRound-calibrated.
Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex
@misc{minicpmo,
  title  = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
  author = {MiniCPM Team, OpenBMB},
  year   = {2025},
  url    = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

88plug

Model Tree

Base

this model

Input Modalities

TextAudioImageVideo

Output Modalities

Text

Supported Functionality