baidu
Qianfan-OCR
4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and understanding in a single architecture.
Introduction
Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by the Baidu Qianfan Team. It unifies document parsing, layout analysis, and document understanding within a single vision-language architecture.
Unlike traditional multi-stage OCR pipelines that chain separate layout detection, text recognition, and language comprehension modules, Qianfan-OCR performs direct image-to-Markdown conversion and supports a broad range of prompt-driven tasks — from structured document parsing and table extraction to chart understanding, document question answering, and key information extraction — all within one model.
Key Highlights
- #1 End-to-End Model on OmniDocBench v1.5 — Achieves 93.12 overall score, surpassing DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33), and all other end-to-end models
- #1 End-to-End Model on OlmOCR Bench — Scores 79.8
- #1 on Key Information Extraction — Overall mean score of 87.9 across five public KIE benchmarks, surpassing Gemini-3.1-Pro, Gemini-3-Pro, Seed-2.0, and Qwen3-VL-235B-A22B
- Layout-as-Thought — An innovative optional thinking phase that recovers explicit layout analysis within the end-to-end paradigm via ⟨
think⟩ tokens - 192 Languages — Multilingual OCR support across diverse scripts
- Efficient Deployment — Achieves 1.024 PPS (pages per second) with W8A8 quantization on a single A100 GPU
Architecture
Qianfan-OCR adopts the multimodal bridging architecture from Qianfan-VL, consisting of three core components:
| Component | Details |
|---|---|
| Vision Encoder | Qianfan-ViT, 24 Transformer layers, AnyResolution design (up to 4K), 256 visual tokens per 448×448 tile, max 4,096 tokens per image |
| Language Model | Qwen3-4B (3.6B non-embedding), 36 layers, 2560 hidden dim, GQA (32 query / 8 KV heads), 32K context (extendable to 131K) |
| Cross-Modal Adapter | 2-layer MLP with GELU activation, projecting from 1024-dim to 2560-dim |
Layout-as-Thought
A key innovation is Layout-as-Thought: an optional thinking phase triggered by ⟨think⟩ tokens, where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final outputs.
This mechanism serves two purposes:
- Functional: Recovers layout analysis capability within the end-to-end paradigm — users obtain structured layout results directly
- Enhancement: Provides targeted accuracy improvements on documents with complex layouts, cluttered elements, or non-standard reading orders
Skill
This model provides a Qianfan OCR Document Intelligence skill for image and PDF understanding workflows.
It can be used by users of OpenClaw, Claude Code, Codex, and other assistants that support this skill format. This skill packages reusable instructions, scripts, and references so the agent can automatically apply Qianfan-powered document intelligence to tasks such as:
- document parsing to Markdown
- layout analysis
- element recognition
- general OCR
- key information extraction
- chart understanding
- document VQA
The skill is designed for visual understanding tasks over images and PDFs, and includes the execution flow needed to prepare inputs, choose the right analysis mode, and call the bundled CLI tools.
License
This project is licensed under the Apache License 2.0. See LICENSE for the full license text.
Some bundled third-party source files are licensed under the MIT License. See NOTICE for the file list and corresponding attribution details.
Model provider
baidu
Model tree
Base
this model
Modalities
Input
-
Output
-
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Serverless Endpoints
Dedicated Endpoints
Container
More information