README
License: apache-2.0Model Highlights
Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. ๐ง โจ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model's representation power while deepening the semantic alignment between visual and language modalitiesโunlocking unprecedented capabilities in nuanced visual-textual reasoning. ๐
The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. โก Responding to strong community demand, we've significantly strengthened the model's grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. ๐ฏ Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge. ๐๐ผ๏ธ
Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what's possible in visual-language understanding. ๐ค๐

Key Capabilities
As a lightweight model that activates only 3B parameters โก, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry's top flagship models across various benchmarks. ๐
- Visual Reasoning ๐ง ๐๏ธ: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! ๐โจ
- STEM Reasoning ๐ฌ๐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! ๐ฏ๐ก
- Visual Grounding ๐๐จ: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! โ๏ธ๐ช
- Thinking with Images ๐ค๐: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. ๐ผ๏ธโจ
- Tool Utilization ๐ ๏ธโก: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! ๐๐
- Video Understanding ๐ฌ๐ฅ: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! โฑ๏ธ๐
Quickstart
Using transformers Library
Requirement: transformers <= 4.57.6
Here is an example of how to use the transformers library for inference:
python
import torchfrom transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLMmodel_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'model = AutoModelForCausalLM.from_pretrained(model_path,device_map="auto",dtype=torch.bfloat16,trust_remote_code=True)processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)model.add_image_preprocess(processor)messages = [{"role": "user","content": [{"type": "text","text": "What color clothes is the girl in the picture wearing?"},{"type": "image_url","image_url": {"url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"}},]},]text = processor.tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)image_inputs, video_inputs = processor.process_vision_info(messages)inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",)device = next(model.parameters()).deviceinputs = inputs.to(device)generated_ids = model.generate(inputs=inputs['input_ids'].to(device),**inputs,max_new_tokens=1024,use_cache=False)output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])print(output_text)
vLLM Inference
Install vLLM
bash
pip install decordpip install uvuv pip install vllm==0.11.2 --torch-backend=auto
Run vLLM
bash
# 80G*1 GPU๏ผIf an error occurs, add the --gpu-memory-utilization 0.95 and try againvllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code
Run vLLM using reasoning-parser and tool-call-parser
bash
# 80G*1 GPU๏ผIf an error occurs, add the --gpu-memory-utilization 0.95 and try againvllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \--reasoning-parser ernie45 \--tool-call-parser ernie45 \--enable-auto-tool-choice
Run vLLM for video understanding (ensure your vLLM version includes PR#31274 for accurate timestamp rendering)
bash
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \--reasoning-parser ernie45 \--media-io-kwargs '{"video": {"num_frames": 180, "fps": 2}}'
FastDeploy Inference
Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.
Note: For single-card deployment, at least 48GB of GPU memory is required.
bash
fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \--max-model-len 131072 \--max-num-seqs 32 \--port 8180 \--quantization wint8 \--reasoning-parser ernie-45-vl-thinking \--tool-call-parser ernie-45-vl-thinking \--mm-processor-kwargs '{"image_max_pixels": 12845056 }'
Finetuning with ERNIEKit
ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance.
Usage Examples:
bash
# Download modelhuggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking# SFTerniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml# SFT (Function Call)erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml
For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the ERNIEKit repository.
License
The ERNIE 4.5 models are provided under the Apache License 2.0. This license permits commercial use, subject to its terms and conditions. Copyright (c) 2025 Baidu, Inc. All Rights Reserved.
Citation
If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:
text
@misc{ernie2025technicalreport,title={ERNIE 4.5 Technical Report},author={Baidu-ERNIE-Team},year={2025},primaryClass={cs.CL},howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}}}
Model provider
baidu
Model tree
Base
this model
Modalities
Input
-
Output
-
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information