README

License: other

🔗 Quick Links

Model Description

Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.

Model Variants

ModelParametersContext LengthCoT SupportBest For
Qianfan-VL-3B3B32kEdge deployment, real-time OCR
Qianfan-VL-8B8B32kServer-side general scenarios, fine-tuning
Qianfan-VL-70B70B32kComplex reasoning, data synthesis

Architecture

  • Language Model:
    • Qianfan-VL-3B: Based on Qwen2.5-3B
    • Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
    • Enhanced with 3T multilingual corpus
  • Vision Encoder: InternViT-based, supports dynamic patching up to 4K resolution
  • Cross-modal Fusion: MLP adapter for efficient vision-language bridging

Key Capabilities

🔍 OCR & Document Understanding

  • Full-Scenario OCR: Handwriting, formulas, natural scenes, cards/documents
  • Document Intelligence: Layout analysis, table parsing, chart understanding, document Q&A
  • High Precision: Industry-leading performance on OCR benchmarks

🧮 Chain-of-Thought Reasoning (8B & 70B)

  • Complex chart analysis and reasoning
  • Mathematical problem-solving with step-by-step derivation
  • Visual reasoning and logical inference
  • Statistical computation and trend prediction

📊 Benchmark Performance

General Vision-Language Benchmarks

BenchmarkQianfan-VL-3BQianfan-VL-8BQianfan-VL-70BInternVL-3-8BInternVL-3-78BQwen2.5-VL-7BQwen2.5-VL-72B
A-Bench_VAL75.6575.7278.175.8675.8676.4979.22
CCBench66.8670.3980.9877.8470.7857.6573.73
SEEDBench_IMG76.5578.0279.1377.077.5276.9878.34
SEEDBench2_Plus67.5970.9773.1769.5268.4770.9373.25
MMVet48.1753.2167.3480.2878.970.6475.69
MMMU_VAL46.4447.1158.3356.1160.7851.065.78
ScienceQA_TEST95.1997.6298.7697.9797.1785.4792.51
ScienceQA_VAL93.8597.6298.8197.8195.1483.5991.32
MMT-Bench_VAL62.2363.2271.0665.1763.6761.469.49
MTVQA_TEST26.530.1432.1830.327.6229.0831.48
BLINK49.9756.8159.4455.8751.8754.5563.02
MMStar57.9364.0769.4768.466.0761.5366.0
RealWorldQA65.7570.5971.6371.1174.2569.2873.86
Q-Bench1_VAL73.5175.2577.4675.9977.9978.179.93
POPE85.0886.0688.9790.5988.8785.9783.35
RefCOCO (Avg)85.9489.3791.0189.6591.4086.5690.25

OCR & Document Understanding

BenchmarkQianfan-VL-3BQianfan-VL-8BQianfan-VL-70BInternVL-3-8BInternVL-3-78BQwen2.5-VL-3BQwen2.5-VL-7BQwen2.5-VL-72B
OCRBench831854873881847810883874
AI2D_TEST81.3885.0787.2385.0783.5577.0780.47283.84
OCRVQA_TEST66.1568.9874.0639.0335.5869.2471.0266.8
TextVQA_VAL80.1182.1384.4882.1583.5279.0984.96283.26
DocVQA_VAL90.8593.5494.7592.0483.8292.7194.9195.75
ChartQA_TEST81.7987.7289.685.7682.0483.486.6887.16

Quick Start

Installation

bash

pip install transformers accelerate torch torchvision pillow einops

Using Transformers

python

import torch
import torchvision.transforms as T
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
from PIL import Image
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
# Load model
MODEL_PATH = "baidu/Qianfan-VL-8B" # or Qianfan-VL-3B, Qianfan-VL-70B
model = AutoModel.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
).eval()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
# Load and process image
pixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)
# Inference
prompt = "<image>请识别图中所有文字"
with torch.no_grad():
response = model.chat(
tokenizer,
pixel_values=pixel_values,
question=prompt,
generation_config={"max_new_tokens": 512},
verbose=False
)
print(response)

Training Details

Four-Stage Progressive Training

  1. Cross-modal Alignment (100B tokens): Establishes vision-language connections
  2. General Knowledge Injection (3.5T tokens): Builds strong foundational capabilities
  3. Domain Enhancement (300B tokens): Specialized OCR and reasoning capabilities
  4. Post-training (1B tokens): Instruction following and preference alignment

Infrastructure

  • Trained on 5000+ Baidu Kunlun chips
  • Single-task parallel training with 5000 chips demonstrating unprecedented scale
  • 90%+ scaling efficiency for large-scale distributed training
  • Innovative communication-computation fusion technology

Citation

If you use Qianfan-VL or Qianfan-OCR in your research, please cite:

bibtex

@misc{dong2026qianfanocr,
title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
year={2026},
eprint={2603.13398},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.13398}
}
@misc{qianfan-vl-2025,
title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},
author={Qianfan Team},
year={2025},
publisher={Baidu}
}

Contact

For more information and API access, visit: Baidu AI Cloud Qianfan Platform

Model provider

baidu

baidu

Model tree

Base

this model

Modalities

Input

-

Output

-

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today