baidu/Qianfan-VL-3B API & Inference Endpoint

🔗 Quick Links

Repository: 💻 GitHub
Models: 🤗 Hugging Face | 🤖 ModelScope
Documentation: 📚 Cookbook | 📝 Technical Report
Blogs: 🇨🇳 中文博客 | 🇬🇧 English Blog

Model Description

Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.

Model Variants

Table with columns: Model, Parameters, Context Length, CoT Support, Best For
Model	Parameters	Context Length	CoT Support	Best For
Qianfan-VL-3B	3B	32k	❌	Edge deployment, real-time OCR
Qianfan-VL-8B	8B	32k	✅	Server-side general scenarios, fine-tuning
Qianfan-VL-70B	70B	32k	✅	Complex reasoning, data synthesis

Architecture

Language Model:
- Qianfan-VL-3B: Based on Qwen2.5-3B
- Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
- Enhanced with 3T multilingual corpus
Vision Encoder: InternViT-based, supports dynamic patching up to 4K resolution
Cross-modal Fusion: MLP adapter for efficient vision-language bridging

Key Capabilities

🔍 OCR & Document Understanding

Full-Scenario OCR: Handwriting, formulas, natural scenes, cards/documents
Document Intelligence: Layout analysis, table parsing, chart understanding, document Q&A
High Precision: Industry-leading performance on OCR benchmarks

🧮 Chain-of-Thought Reasoning (8B & 70B)

Complex chart analysis and reasoning
Mathematical problem-solving with step-by-step derivation
Visual reasoning and logical inference
Statistical computation and trend prediction

📊 Benchmark Performance

General Vision-Language Benchmarks

Table with columns: Benchmark, Qianfan-VL-3B, Qianfan-VL-8B, Qianfan-VL-70B, InternVL-3-8B, InternVL-3-78B, Qwen2.5-VL-7B, Qwen2.5-VL-72B
Benchmark	Qianfan-VL-3B	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
A-Bench_VAL	75.65	75.72	78.1	75.86	75.86	76.49	79.22
CCBench	66.86	70.39

OCR & Document Understanding

Table with columns: Benchmark, Qianfan-VL-3B, Qianfan-VL-8B, Qianfan-VL-70B, InternVL-3-8B, InternVL-3-78B, Qwen2.5-VL-3B, Qwen2.5-VL-7B, Qwen2.5-VL-72B
Benchmark	Qianfan-VL-3B	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-3B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
OCRBench	831	854	873	881	847	810	883	874
AI2D_TEST

Mathematical Reasoning

Table with columns: Benchmark, Qianfan-VL-8B, Qianfan-VL-70B, InternVL-3-8B, InternVL-3-78B, Qwen2.5-VL-7B, Qwen2.5-VL-72B
Benchmark	Qianfan-VL-8B	Qianfan-VL-70B	InternVL-3-8B	InternVL-3-78B	Qwen2.5-VL-7B	Qwen2.5-VL-72B
Mathvista-mini	69.19	78.6	69.5	70.1	67.2	73.9
Mathvision	32.82	50.29	29.61	34.8	25.95

Quick Start

Installation

bash
pip install transformers accelerate torch torchvision pillow einops

Using Transformers

python
import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)

# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 512},
        verbose=False
    )
print(response)

Using vLLM

You can deploy Qianfan-VL using vLLM's official Docker image for high-performance inference with an OpenAI-compatible API:

Start vLLM Service

bash
docker run -d --name qianfan-vl \
  --gpus all \
  -v /path/to/Qianfan-VL-8B:/model \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model /model \
  --served-model-name qianfan-vl \
  --trust-remote-code \
  --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'

Call the API

bash
curl 'http://127.0.0.1:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "qianfan-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
            }
          },
          {
            "type": "text",
            "text": "<image>请识别图中所有文字"
          }
        ]
      }
    ]
  }'

Or use Python with OpenAI SDK:

python
from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8000/v1"
)

response = client.chat.completions.create(
    model="qianfan-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
                },
                {
                    "type": "text",
                    "text": "<image>请描述这张图片"
                }
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Training Details

Four-Stage Progressive Training

Cross-modal Alignment (100B tokens): Establishes vision-language connections
General Knowledge Injection (3.5T tokens): Builds strong foundational capabilities
Domain Enhancement (300B tokens): Specialized OCR and reasoning capabilities
Post-training (1B tokens): Instruction following and preference alignment

Infrastructure

Trained on 5000+ Baidu Kunlun chips
Single-task parallel training with 5000 chips demonstrating unprecedented scale
90%+ scaling efficiency for large-scale distributed training
Innovative communication-computation fusion technology

Model Card

Developed by: Baidu AI Cloud Qianfan Team
Model type: Vision-Language Transformer
Languages: Multilingual support
License: [Please check model card for specific license]
Base Architecture: Please Reference Technical Report

Citation

If you use Qianfan-VL in your research, please cite:

bibtex
@misc{qianfan-vl-2025,
  title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
  author={Qianfan Team},
  year={2025},
  publisher={Baidu}
}

Contact

For more information and API access, visit: Baidu Qianfan Platform

Acknowledgments

This model series represents a significant advancement in multimodal AI, combining general capabilities with domain-specific enhancements for real-world applications.

Model

Parameters

Context Length

CoT Support

Best For

Qianfan-VL-3B

32k

❌

Edge deployment, real-time OCR

Qianfan-VL-8B

32k

✅

Server-side general scenarios, fine-tuning

Qianfan-VL-70B

70B

32k

✅

Complex reasoning, data synthesis

Benchmark

Qianfan-VL-3B

Qianfan-VL-8B

Qianfan-VL-70B

InternVL-3-8B

InternVL-3-78B

Qwen2.5-VL-7B

Qwen2.5-VL-72B

A-Bench_VAL

75.65

75.72

78.1

75.86

76.49

79.22

CCBench

66.86

70.39

Benchmark

Qianfan-VL-3B

Qianfan-VL-8B

Qianfan-VL-70B

InternVL-3-8B

InternVL-3-78B

Qwen2.5-VL-3B

Qwen2.5-VL-7B

Qwen2.5-VL-72B

OCRBench

831

854

873

881

847

810

883

874

AI2D_TEST

Benchmark

Qianfan-VL-8B

Qianfan-VL-70B

InternVL-3-8B

InternVL-3-78B

Qwen2.5-VL-7B

Qwen2.5-VL-72B

Mathvista-mini

69.19

78.6

69.5

70.1

67.2

73.9

Mathvision

32.82

50.29

29.61

34.8

25.95

python

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B"  # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)

# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
    response = model.chat(
        tokenizer,
        pixel_values=pixel_values,
        question=prompt,
        generation_config={"max_new_tokens": 512},
        verbose=False
    )
print(response)

bash

docker run -d --name qianfan-vl \
  --gpus all \
  -v /path/to/Qianfan-VL-8B:/model \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model /model \
  --served-model-name qianfan-vl \
  --trust-remote-code \
  --hf-overrides '{"architectures":["InternVLChatModel"],"model_type":"internvl_chat"}'

bash

curl 'http://127.0.0.1:8000/v1/chat/completions' \
  --header 'Content-Type: application/json' \
  --data '{
    "model": "qianfan-vl",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"
            }
          },
          {
            "type": "text",
            "text": "<image>请识别图中所有文字"
          }
        ]
      }
    ]
  }'

python

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:8000/v1"
)

response = client.chat.completions.create(
    model="qianfan-vl",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": "https://qianfan-public-demo.bj.bcebos.com/qianfan-vl/2509/images/scene_ocr.png"}
                },
                {
                    "type": "text",
                    "text": "<image>请描述这张图片"
                }
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Qianfan-VL-3B

🔗 Quick Links

Model Description

Model Variants

Architecture

Key Capabilities

🔍 OCR & Document Understanding

🧮 Chain-of-Thought Reasoning (8B & 70B)

📊 Benchmark Performance

General Vision-Language Benchmarks

OCR & Document Understanding

Mathematical Reasoning

Quick Start

Installation

Using Transformers

Using vLLM

Start vLLM Service

Call the API

Training Details

Four-Stage Progressive Training

Infrastructure

Model Card

Citation

Contact

Acknowledgments

Explore FriendliAI today

🔗 Quick Links

Model Description

Model Variants

Architecture

Key Capabilities

🔍 OCR & Document Understanding

🧮 Chain-of-Thought Reasoning (8B & 70B)

📊 Benchmark Performance

General Vision-Language Benchmarks

OCR & Document Understanding

Mathematical Reasoning

Quick Start

Installation

Using Transformers

Using vLLM

Start vLLM Service

Call the API

Training Details

Four-Stage Progressive Training

Infrastructure

Model Card

Citation

Contact

Acknowledgments