Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitUsage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.9 + CUDA11.8๏ผ
markdown
torch==2.6.0transformers==4.46.3tokenizers==0.20.3einopsaddicteasydictpip install flash-attn==2.7.3 --no-build-isolation
python
from transformers import AutoModel, AutoTokenizerimport torchimport osos.environ["CUDA_VISIBLE_DEVICES"] = '0'model_name = 'deepseek-ai/DeepSeek-OCR'tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)model = AutoModel.from_pretrained(model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True)model = model.eval().cuda().to(torch.bfloat16)# prompt = "<image>\nFree OCR. "prompt = "<image>\n<|grounding|>Convert the document to markdown. "image_file = 'your_image.jpg'output_path = 'your/output/dir'# infer(self, tokenizer, prompt='', image_file='', output_path = ' ', base_size = 1024, image_size = 640, crop_mode = True, test_compress = False, save_results = False):# Tiny: base_size = 512, image_size = 512, crop_mode = False# Small: base_size = 640, image_size = 640, crop_mode = False# Base: base_size = 1024, image_size = 1024, crop_mode = False# Large: base_size = 1280, image_size = 1280, crop_mode = False# Gundam: base_size = 1024, image_size = 640, crop_mode = Trueres = model.infer(tokenizer, prompt=prompt, image_file=image_file, output_path = output_path, base_size = 1024, image_size = 640, crop_mode=True, save_results = True, test_compress = True)
vLLM
Refer to ๐GitHub for guidance on model inference acceleration and PDF processing, etc.
[2025/10/23] ๐๐๐ DeepSeek-OCR is now officially supported in upstream vLLM.
shell
uv venvsource .venv/bin/activate# Until v0.11.1 release, you need to install vLLM from nightly builduv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
python
from vllm import LLM, SamplingParamsfrom vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessorfrom PIL import Image# Create model instancellm = LLM(model="deepseek-ai/DeepSeek-OCR",enable_prefix_caching=False,mm_processor_cache_gb=0,logits_processors=[NGramPerReqLogitsProcessor])# Prepare batched input with your image fileimage_1 = Image.open("path/to/your/image_1.png").convert("RGB")image_2 = Image.open("path/to/your/image_2.png").convert("RGB")prompt = "<image>\nFree OCR."model_input = [{"prompt": prompt,"multi_modal_data": {"image": image_1}},{"prompt": prompt,"multi_modal_data": {"image": image_2}}]sampling_param = SamplingParams(temperature=0.0,max_tokens=8192,# ngram logit processor argsextra_args=dict(ngram_size=30,window_size=90,whitelist_token_ids={128821, 128822}, # whitelist: <td>, </td>),skip_special_tokens=False,)# Generate outputmodel_outputs = llm.generate(model_input, sampling_param)# Print outputfor output in model_outputs:print(output.outputs[0].text)
Visualizations
Acknowledgement
We would like to thank Vary, GOT-OCR2.0, MinerU, PaddleOCR, OneChart, Slow Perception for their valuable models and ideas.
We also appreciate the benchmarks: Fox, OminiDocBench.
Citation
bibtex
@article{wei2025deepseek,title={DeepSeek-OCR: Contexts Optical Compression},author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},journal={arXiv preprint arXiv:2510.18234},year={2025}}
Model provider
deepseek-ai
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information