Rem520/PLUME-7B API & Inference Endpoint

Highlights

Universal: a single model for text / image / video / visual-document embeddings.
Latent reasoning: fewer than 10 latent steps replace hundreds of generated CoT tokens, giving >30× faster inference than explicit-CoT UME at comparable or better quality.
Strong retrieval: evaluated on the 78-task MMEB-v2 benchmark, outperforming strong explicit-CoT UME baselines — especially where evidence is dense and structurally complex (video and visual-document retrieval).

Model details

Backbone: zhibinlan/UME-R1-7B (Qwen2-VL-7B, Qwen2VLForConditionalGeneration)
Parameters: ~7B, weights in half precision (4 safetensors shards, ~17 GB)
License: Apache-2.0

Usage

The weights load as a standard Qwen2-VL checkpoint:

python
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Rem520/PLUME-7B", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Rem520/PLUME-7B")

To use the full PLUME embedding pipeline (latent rollout + semantic-anchor-guided transition adapter), follow the official code: https://github.com/haoxiangzhao12138/PLUME

Citation

bibtex
@article{he2026plume,
  title   = {PLUME: Latent Reasoning Based Universal Multimodal Embedding},
  author  = {He, Chenwei and Hao, Xiangzhao and Yang, Tianyu and Ma, Yuxiang and
             Jia, Yuheng and Wu, Lingxiang and Zhao, Chaoyang and Guo, Haiyun and Wang, Jinqiao},
  journal = {arXiv preprint arXiv:2604.02073},
  year    = {2026}
}

Paper: arXiv:2604.02073
Code: github.com/haoxiangzhao12138/PLUME

PLUME-7B

Get help setting up a custom Dedicated Endpoints.

README

Highlights

Model details

Usage

Citation

Explore FriendliAI today

PLUME-7B