README

License: apache-2.0

Model Highlights

Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. ๐Ÿง โœจ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model's representation power while deepening the semantic alignment between visual and language modalitiesโ€”unlocking unprecedented capabilities in nuanced visual-textual reasoning. ๐Ÿ“Š

The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. โšก Responding to strong community demand, we've significantly strengthened the model's grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. ๐ŸŽฏ Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge. ๐Ÿ”๐Ÿ–ผ๏ธ

Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what's possible in visual-language understanding. ๐Ÿค–๐ŸŒŸ

benchmark

Key Capabilities

As a lightweight model that activates only 3B parameters โšก, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry's top flagship models across various benchmarks. ๐Ÿš€

  • Visual Reasoning ๐Ÿง ๐Ÿ‘๏ธ: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! ๐Ÿ“Šโœจ
  • STEM Reasoning ๐Ÿ”ฌ๐Ÿ“: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! ๐ŸŽฏ๐Ÿ’ก
  • Visual Grounding ๐Ÿ“๐ŸŽจ: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! โš™๏ธ๐Ÿ’ช
  • Thinking with Images ๐Ÿค”๐Ÿ”: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. ๐Ÿ–ผ๏ธโœจ
  • Tool Utilization ๐Ÿ› ๏ธโšก: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! ๐Ÿ”Ž๐Ÿ“š
  • Video Understanding ๐ŸŽฌ๐ŸŽฅ: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! โฑ๏ธ๐ŸŒŸ

Quickstart

Hugging Face ๐Ÿค— app

Using transformers Library

Requirement: transformers <= 4.57.6

Here is an example of how to use the transformers library for inference:

python

import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM
model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What color clothes is the girl in the picture wearing?"
},
{
"type": "image_url",
"image_url": {
"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
}
},
]
},
]
text = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
device = next(model.parameters()).device
inputs = inputs.to(device)
generated_ids = model.generate(
inputs=inputs['input_ids'].to(device),
**inputs,
max_new_tokens=1024,
use_cache=False
)
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

vLLM Inference

Install vLLM

bash

pip install decord
pip install uv
uv pip install vllm==0.11.2 --torch-backend=auto

Run vLLM

bash

# 80G*1 GPU๏ผŒIf an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code

Run vLLM using reasoning-parser and tool-call-parser

bash

# 80G*1 GPU๏ผŒIf an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
--reasoning-parser ernie45 \
--tool-call-parser ernie45 \
--enable-auto-tool-choice

Run vLLM for video understanding (ensure your vLLM version includes PR#31274 for accurate timestamp rendering)

bash

vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
--reasoning-parser ernie45 \
--media-io-kwargs '{"video": {"num_frames": 180, "fps": 2}}'

FastDeploy Inference

Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.

Note: For single-card deployment, at least 48GB of GPU memory is required.

bash

fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
--max-model-len 131072 \
--max-num-seqs 32 \
--port 8180 \
--quantization wint8 \
--reasoning-parser ernie-45-vl-thinking \
--tool-call-parser ernie-45-vl-thinking \
--mm-processor-kwargs '{"image_max_pixels": 12845056 }'

Finetuning with ERNIEKit

ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance.

Usage Examples:

bash

# Download model
huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking
# SFT
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml
# SFT (Function Call)
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml

For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the ERNIEKit repository.

License

The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.

Citation

If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:

text

@misc{ernie2025technicalreport,
title={ERNIE 4.5 Technical Report},
author={Baidu-ERNIE-Team},
year={2025},
primaryClass={cs.CL},
howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}}
}

Model provider

baidu

baidu

Model tree

Base

this model

Modalities

Input

-

Output

-

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today