88plug

MiniCPM-o-4.5-W8A16

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a Glance

Table
PropertyValue
Base modelopenbmb/MiniCPM-o-4.5
ArchitectureQwen3-8B LLM + SigLIP2 vision + Whisper audio + CosyVoice2 TTS
Quant formatcompressed-tensors (native vLLM)
Quant methodAutoRound W8A16 (RTN, datafree)
Quantizedmodel.llm transformer layers
Kept BF16vision encoder, audio encoder, TTS components
Disk size~9 GB
Min GPU1× RTX 3090 24GB

Memory Requirements

Table
ConfigurationBF16W8A16
Weights~18 GB~9 GB
Min GPU1× A100 40GB1× RTX 3090 24GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM — text output

bash

docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/MiniCPM-o-4.5-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0. Mainline vLLM returns text only; CosyVoice2 TTS output is not supported.

llama.cpp — audio/vision in, text out

Mainline llama.cpp supports MiniCPM-V (vision + text). For full CosyVoice2 speech output, use the tc-mb/llama.cpp-omni fork. Convert from BF16 base.

bash

python convert_hf_to_gguf.py openbmb/MiniCPM-o-4.5 \
--outfile MiniCPM-o-4.5-BF16.gguf
llama-quantize MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
MiniCPM-o-4.5-BF16.gguf MiniCPM-o-4.5-IQ4_XS.gguf IQ4_XS
llama-server \
--model MiniCPM-o-4.5-Q8_0.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081

Benchmarks

Results pending.

Table
EngineFormatBatchctxtok/sTTFT p50TTFT p99VRAM
vLLM v0.21.0W8A16132k
vLLM v0.21.0W8A16832k
llama.cpp b9297Q8_0 GGUF132k
llama.cpp b9297IQ4_XS GGUF132k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


What's Quantized, What's Not

Table
ComponentPrecisionReason
model.llm.* transformer layersW8A16 INT8Quantized
Vision encoder (SigLIP2)BF16Excluded
Audio encoder (Whisper)BF16Excluded
CosyVoice2 TTSBF16Excluded
Embeddings, LM head, normsBF16Standard practice

Quality Targets

Table
MetricTarget
KL divergence vs BF16< 0.005
MMLU recovery≥ 99.7%

vs. Other MiniCPM-o-4.5 Quants

This is the first compressed-tensors W8A16 checkpoint for MiniCPM-o-4.5. It halves VRAM usage while retaining native vLLM serving with audio and vision input.

Table
QuantMethodSizeGPU CompatibilityNotes
88plug W8A16 (this)compressed-tensors RTN W8A16~9 GBAny Ampere+ ≥16 GBFirst W8A16; native vLLM; LLM backbone quantized
Community GGUF Q4_K_Mllama.cpp GGUF~5 GBCPU / any GPUVision via mmproj; no CosyVoice2 in mainline
Community GGUF Q8_0llama.cpp GGUF~9 GBAny GPU ≥10 GBNear-lossless; same TTS limitation
BF16 baselineNone~18 GB1× A100 40GBReference; requires high-VRAM GPU

Limitations

  • LLM backbone only: Only model.llm transformer layers are quantized. Vision encoder (SigLIP2), audio encoder (Whisper), and CosyVoice2 TTS components stay BF16.
  • No CosyVoice2 in mainline vLLM: Speech output is not supported by mainline vLLM. Use the tc-mb/llama.cpp-omni fork for speech synthesis.
  • RTN (data-free) quantization: No calibration corpus used for the LLM backbone. Near-lossless at W8A16 but not AutoRound-calibrated.
  • Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex

@misc{minicpmo,
title = {MiniCPM-o: A GPT-4o Level Multimodal LLM on Your Phone},
author = {MiniCPM Team, OpenBMB},
year = {2025},
url = {https://huggingface.co/openbmb/MiniCPM-o-4.5}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: MiniCPM-o-4.5-W4A16 (INT4, ~4–5 GB) · MiniCPM-o-4.5-W8A16 (INT8, ~9 GB)

Browse all releases → huggingface.co/88plug

Model provider

88plug

Model tree

Base

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today