baidu/ERNIE-4.5-VL-28B-A3B-Thinking API & Inference Endpoint

Model Highlights

Built upon the powerful ERNIE-4.5-VL-28B-A3B architecture, the newly upgraded ERNIE-4.5-VL-28B-A3B-Thinking achieves a remarkable leap forward in multimodal reasoning capabilities. 🧠✨ Through an extensive mid-training phase, the model absorbed a vast and highly diverse corpus of premium visual-language reasoning data. This massive-scale training process dramatically boosted the model's representation power while deepening the semantic alignment between visual and language modalities—unlocking unprecedented capabilities in nuanced visual-textual reasoning. 📊

The model leverages cutting-edge multimodal reinforcement learning techniques on verifiable tasks, integrating GSPO and IcePop strategies to stabilize MoE training combined with dynamic difficulty sampling for exceptional learning efficiency. ⚡ Responding to strong community demand, we've significantly strengthened the model's grounding performance with improved instruction-following capabilities, making visual grounding functions more accessible than ever. 🎯 Additionally, our innovative "Thinking with Images" feature, when paired with tools like image zooming and image search, dramatically elevates the model's ability to process fine-grained details and handle long-tail visual knowledge. 🔍🖼️

Together, these enhancements form a critical foundation for developing sophisticated multimodal agents, empowering developers and researchers to create next-generation AI applications that push the boundaries of what's possible in visual-language understanding. 🤖🌟

benchmark

Key Capabilities

As a lightweight model that activates only 3B parameters ⚡, ERNIE-4.5-VL-28B-A3B-Thinking closely matches the performance of the industry's top flagship models across various benchmarks. 🚀

Visual Reasoning 🧠👁️: Bolstered by large-scale reinforcement learning, the model demonstrates exceptional multi-step reasoning, chart analysis, and causal reasoning capabilities in complex visual tasks! 📊✨
STEM Reasoning 🔬📐: Leveraging its powerful visual abilities, the model achieves a leap in performance on STEM tasks like solving problems from photos, easily handling even complex questions! 🎯💡
Visual Grounding 📍🎨: Features more precise grounding and flexible instruction execution, easily triggering grounding functions in complex industrial scenarios for a significant efficiency boost! ⚙️💪
Thinking with Images 🤔🔍: The model thinks like a human, capable of freely zooming in and out of images to grasp every detail and uncover all information. 🖼️✨
Tool Utilization 🛠️⚡: Empowered by robust tool-calling capabilities, the model can instantly use functions like image search to easily identify long-tail knowledge and achieve comprehensive information retrieval! 🔎📚
Video Understanding 🎬🎥: The model possesses outstanding temporal awareness and event localization abilities, accurately identifying content changes across different time segments in a video, making video analysis smarter and more efficient! ⏱️🌟

Quickstart

Hugging Face 🤗 app

Using `transformers` Library

Requirement: transformers <= 4.57.6

Here is an example of how to use the transformers library for inference:

python
import torch
from transformers import AutoProcessor, AutoTokenizer, AutoModelForCausalLM

model_path = 'baidu/ERNIE-4.5-VL-28B-A3B-Thinking'
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model.add_image_preprocess(processor)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What color clothes is the girl in the picture wearing?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://paddlenlp.bj.bcebos.com/datasets/paddlemix/demo_images/example1.jpg"
                }
            },
        ]
    },
]

text = processor.tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)

device = next(model.parameters()).device
inputs = inputs.to(device)

generated_ids = model.generate(
    inputs=inputs['input_ids'].to(device),
    **inputs,
    max_new_tokens=1024,
    use_cache=False
    )
output_text = processor.decode(generated_ids[0][len(inputs['input_ids'][0]):])
print(output_text)

vLLM Inference

Install vLLM

bash
pip install decord
pip install uv
uv pip install vllm==0.11.2 --torch-backend=auto

Run vLLM

bash
# 80G*1 GPU，If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code

Run vLLM using reasoning-parser and tool-call-parser

bash
# 80G*1 GPU，If an error occurs, add the --gpu-memory-utilization 0.95 and try again
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
 --reasoning-parser ernie45  \
 --tool-call-parser ernie45  \
 --enable-auto-tool-choice

Run vLLM for video understanding (ensure your vLLM version includes PR#31274 for accurate timestamp rendering)

bash
vllm serve baidu/ERNIE-4.5-VL-28B-A3B-Thinking --trust-remote-code \
  --reasoning-parser ernie45 \
  --media-io-kwargs '{"video": {"num_frames": 180, "fps": 2}}'

FastDeploy Inference

Quickly deploy services using FastDeploy as shown below. For more detailed usage, refer to the FastDeploy GitHub Repository.

Note: For single-card deployment, at least 48GB of GPU memory is required.

bash
fastdeploy serve --model baidu/ERNIE-4.5-VL-28B-A3B-Thinking \
  --max-model-len 131072 \
  --max-num-seqs 32 \
  --port 8180 \
  --quantization wint8 \
  --reasoning-parser ernie-45-vl-thinking \
  --tool-call-parser ernie-45-vl-thinking \
  --mm-processor-kwargs '{"image_max_pixels": 12845056 }'

Finetuning with ERNIEKit

ERNIEKit is a training toolkit based on PaddlePaddle, specifically designed for the ERNIE series of open-source large models. It provides comprehensive support for scenarios such as instruction fine-tuning (SFT, LoRA) and alignment training (DPO), ensuring optimal performance.

Usage Examples:

bash
# Download model
huggingface-cli download baidu/ERNIE-4.5-VL-28B-A3B-Thinking --local-dir baidu/ERNIE-4.5-VL-28B-A3B-Thinking
# SFT
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft/run_sft_lora_8k.yaml
# SFT (Function Call)
erniekit train examples/configs/ERNIE-4.5-VL-28B-A3B-Thinking/sft_function_call/run_sft_8k.yaml

For more detailed examples, including SFT with LoRA, multi-GPU configurations, and advanced scripts, please refer to the examples folder within the ERNIEKit repository.

License

Citation

If you find ERNIE 4.5 useful or wish to use it in your projects, please kindly cite our technical report:

text
@misc{ernie2025technicalreport,
      title={ERNIE 4.5 Technical Report},
      author={Baidu-ERNIE-Team},
      year={2025},
      primaryClass={cs.CL},
      howpublished={\url{https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf}}
}

ERNIE-4.5-VL-28B-A3B-Thinking

README

Model Highlights

Key Capabilities

Quickstart

Using `transformers` Library

vLLM Inference

FastDeploy Inference

Finetuning with ERNIEKit

License

Citation

Explore FriendliAI today

ERNIE-4.5-VL-28B-A3B-Thinking

Model Highlights

Key Capabilities

Quickstart

Using transformers Library

vLLM Inference

FastDeploy Inference

Finetuning with ERNIEKit

License

Citation

Explore FriendliAI today

Using `transformers` Library