README
License: other🔗 Quick Links
- Repository: 💻 GitHub
- Models: 🤗 Hugging Face | 🤖 ModelScope
- Documentation: 📚 Cookbook | 📝 Technical Report
- Blogs: 🇨🇳 中文博客 | 🇬🇧 English Blog
Model Description
Qianfan-VL is a series of general-purpose multimodal large language models enhanced for enterprise-level multimodal applications. The models offer deep optimization for high-frequency scenarios in industrial deployment while maintaining strong general capabilities.
Model Variants
| Model | Parameters | Context Length | CoT Support | Best For |
|---|---|---|---|---|
| Qianfan-VL-3B | 3B | 32k | ❌ | Edge deployment, real-time OCR |
| Qianfan-VL-8B | 8B | 32k | ✅ | Server-side general scenarios, fine-tuning |
| Qianfan-VL-70B | 70B | 32k | ✅ | Complex reasoning, data synthesis |
Architecture
- Language Model:
- Qianfan-VL-3B: Based on Qwen2.5-3B
- Qianfan-VL-8B/70B: Based on Llama 3.1 architecture
- Enhanced with 3T multilingual corpus
- Vision Encoder: InternViT-based, supports dynamic patching up to 4K resolution
- Cross-modal Fusion: MLP adapter for efficient vision-language bridging
Key Capabilities
🔍 OCR & Document Understanding
- Full-Scenario OCR: Handwriting, formulas, natural scenes, cards/documents
- Document Intelligence: Layout analysis, table parsing, chart understanding, document Q&A
- High Precision: Industry-leading performance on OCR benchmarks
🧮 Chain-of-Thought Reasoning (8B & 70B)
- Complex chart analysis and reasoning
- Mathematical problem-solving with step-by-step derivation
- Visual reasoning and logical inference
- Statistical computation and trend prediction
📊 Benchmark Performance
General Vision-Language Benchmarks
| Benchmark | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
|---|---|---|---|---|---|---|---|
| A-Bench_VAL | 75.65 | 75.72 | 78.1 | 75.86 | 75.86 | 76.49 | 79.22 |
| CCBench | 66.86 | 70.39 | 80.98 | 77.84 | 70.78 | 57.65 | 73.73 |
| SEEDBench_IMG | 76.55 | 78.02 | 79.13 | 77.0 | 77.52 | 76.98 | 78.34 |
| SEEDBench2_Plus | 67.59 | 70.97 | 73.17 | 69.52 | 68.47 | 70.93 | 73.25 |
| MMVet | 48.17 | 53.21 | 67.34 | 80.28 | 78.9 | 70.64 | 75.69 |
| MMMU_VAL | 46.44 | 47.11 | 58.33 | 56.11 | 60.78 | 51.0 | 65.78 |
| ScienceQA_TEST | 95.19 | 97.62 | 98.76 | 97.97 | 97.17 | 85.47 | 92.51 |
| ScienceQA_VAL | 93.85 | 97.62 | 98.81 | 97.81 | 95.14 | 83.59 | 91.32 |
| MMT-Bench_VAL | 62.23 | 63.22 | 71.06 | 65.17 | 63.67 | 61.4 | 69.49 |
| MTVQA_TEST | 26.5 | 30.14 | 32.18 | 30.3 | 27.62 | 29.08 | 31.48 |
| BLINK | 49.97 | 56.81 | 59.44 | 55.87 | 51.87 | 54.55 | 63.02 |
| MMStar | 57.93 | 64.07 | 69.47 | 68.4 | 66.07 | 61.53 | 66.0 |
| RealWorldQA | 65.75 | 70.59 | 71.63 | 71.11 | 74.25 | 69.28 | 73.86 |
| Q-Bench1_VAL | 73.51 | 75.25 | 77.46 | 75.99 | 77.99 | 78.1 | 79.93 |
| POPE | 85.08 | 86.06 | 88.97 | 90.59 | 88.87 | 85.97 | 83.35 |
| RefCOCO (Avg) | 85.94 | 89.37 | 91.01 | 89.65 | 91.40 | 86.56 | 90.25 |
OCR & Document Understanding
| Benchmark | Qianfan-VL-3B | Qianfan-VL-8B | Qianfan-VL-70B | InternVL-3-8B | InternVL-3-78B | Qwen2.5-VL-3B | Qwen2.5-VL-7B | Qwen2.5-VL-72B |
|---|---|---|---|---|---|---|---|---|
| OCRBench | 831 | 854 | 873 | 881 | 847 | 810 | 883 | 874 |
| AI2D_TEST | 81.38 | 85.07 | 87.23 | 85.07 | 83.55 | 77.07 | 80.472 | 83.84 |
| OCRVQA_TEST | 66.15 | 68.98 | 74.06 | 39.03 | 35.58 | 69.24 | 71.02 | 66.8 |
| TextVQA_VAL | 80.11 | 82.13 | 84.48 | 82.15 | 83.52 | 79.09 | 84.962 | 83.26 |
| DocVQA_VAL | 90.85 | 93.54 | 94.75 | 92.04 | 83.82 | 92.71 | 94.91 | 95.75 |
| ChartQA_TEST | 81.79 | 87.72 | 89.6 | 85.76 | 82.04 | 83.4 | 86.68 | 87.16 |
Quick Start
Installation
bash
pip install transformers accelerate torch torchvision pillow einops
Using Transformers
python
import torchimport torchvision.transforms as Tfrom torchvision.transforms.functional import InterpolationModefrom transformers import AutoModel, AutoTokenizerfrom PIL import ImageIMAGENET_MEAN = (0.485, 0.456, 0.406)IMAGENET_STD = (0.229, 0.224, 0.225)def build_transform(input_size):MEAN, STD = IMAGENET_MEAN, IMAGENET_STDtransform = T.Compose([T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),T.ToTensor(),T.Normalize(mean=MEAN, std=STD)])return transformdef find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):best_ratio_diff = float('inf')best_ratio = (1, 1)area = width * heightfor ratio in target_ratios:target_aspect_ratio = ratio[0] / ratio[1]ratio_diff = abs(aspect_ratio - target_aspect_ratio)if ratio_diff < best_ratio_diff:best_ratio_diff = ratio_diffbest_ratio = ratioelif ratio_diff == best_ratio_diff:if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:best_ratio = ratioreturn best_ratiodef dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):orig_width, orig_height = image.sizeaspect_ratio = orig_width / orig_height# calculate the existing image aspect ratiotarget_ratios = set((i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) ifi * j <= max_num and i * j >= min_num)target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])# find the closest aspect ratio to the targettarget_aspect_ratio = find_closest_aspect_ratio(aspect_ratio, target_ratios, orig_width, orig_height, image_size)# calculate the target width and heighttarget_width = image_size * target_aspect_ratio[0]target_height = image_size * target_aspect_ratio[1]blocks = target_aspect_ratio[0] * target_aspect_ratio[1]# resize the imageresized_img = image.resize((target_width, target_height))processed_images = []for i in range(blocks):box = ((i % (target_width // image_size)) * image_size,(i // (target_width // image_size)) * image_size,((i % (target_width // image_size)) + 1) * image_size,((i // (target_width // image_size)) + 1) * image_size)# split the imagesplit_img = resized_img.crop(box)processed_images.append(split_img)assert len(processed_images) == blocksif use_thumbnail and len(processed_images) != 1:thumbnail_img = image.resize((image_size, image_size))processed_images.append(thumbnail_img)return processed_imagesdef load_image(image_file, input_size=448, max_num=12):image = Image.open(image_file).convert('RGB')transform = build_transform(input_size=input_size)images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)pixel_values = [transform(image) for image in images]pixel_values = torch.stack(pixel_values)return pixel_values# Load modelMODEL_PATH = "baidu/Qianfan-VL-8B" # or Qianfan-VL-3B, Qianfan-VL-70Bmodel = AutoModel.from_pretrained(MODEL_PATH,torch_dtype=torch.bfloat16,trust_remote_code=True,device_map="auto").eval()tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)# Load and process imagepixel_values = load_image("./example/scene_ocr.png").to(torch.bfloat16)# Inferenceprompt = "<image>请识别图中所有文字"with torch.no_grad():response = model.chat(tokenizer,pixel_values=pixel_values,question=prompt,generation_config={"max_new_tokens": 512},verbose=False)print(response)
Training Details
Four-Stage Progressive Training
- Cross-modal Alignment (100B tokens): Establishes vision-language connections
- General Knowledge Injection (3.5T tokens): Builds strong foundational capabilities
- Domain Enhancement (300B tokens): Specialized OCR and reasoning capabilities
- Post-training (1B tokens): Instruction following and preference alignment
Infrastructure
- Trained on 5000+ Baidu Kunlun chips
- Single-task parallel training with 5000 chips demonstrating unprecedented scale
- 90%+ scaling efficiency for large-scale distributed training
- Innovative communication-computation fusion technology
Citation
If you use Qianfan-VL or Qianfan-OCR in your research, please cite:
bibtex
@misc{dong2026qianfanocr,title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},year={2026},eprint={2603.13398},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2603.13398}}@misc{qianfan-vl-2025,title={Qianfan-VL: Domain-Enhanced Universal Vision-Language Models},author={Qianfan Team},year={2025},publisher={Baidu}}
Contact
For more information and API access, visit: Baidu AI Cloud Qianfan Platform
Model provider
baidu
Model tree
Base
this model
Modalities
Input
-
Output
-
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information