typhoon-ocr1.5-2b API & Inference Endpoint

Model Performance

BLEU Score (↑ Higher is better)

BLEU Score

ROUGE-L Score (↑ Higher is better)

ROUGE-L Score

Levenshtein Distance (↓ Lower is better)

Levenshtein Distance

Prompting

python
prompt = """Extract all text from the image.

Instructions:
- Only return the clean Markdown.
- Do not include any explanation or extra text.
- You must include all information on the page.

Formatting Rules:
- Tables: Render tables using <table>...</table> in clean HTML format.
- Equations: Render equations using LaTeX syntax with inline ($...$) and block ($$...$$).
- Images/Charts/Diagrams: Wrap any clearly defined visual areas (e.g. charts, diagrams, pictures) in:

<figure>
Describe the image's main elements (people, objects, text), note any contextual clues (place, event, culture), mention visible text and its meaning, provide deeper analysis when relevant (especially for financial charts, graphs, or documents), comment on style or architecture if relevant, then give a concise overall summary. Describe in Thai.
</figure>

- Page Numbers: Wrap page numbers in <page_number>...</page_number> (e.g., <page_number>14</page_number>).
- Checkboxes: Use ☐ for unchecked and ☑ for checked boxes."""

Quickstart

Full inference code available on Colab Using Typhoon-OCR Package

bash
pip install typhoon-ocr -U

python
from typhoon_ocr import ocr_document

# please set env TYPHOON_OCR_API_KEY or OPENAI_API_KEY to use this function
markdown = ocr_document("test.png", model = "typhoon-ocr", figure_language = "Thai", task_type = "v1.5")
print(markdown)

Local Model via vllm (GPU Required):

bash
pip install vllm
vllm serve scb10x/typhoon-ocr1.5-2b --max-model-len 49152 --served-model-name typhoon-ocr-1-5 # OpenAI Compatible at http://localhost:8000 (or other port)
# then you can supply base_url in to ocr_document

python
from typhoon_ocr import ocr_document
markdown = ocr_document('image.png', model = "typhoon-ocr" , figure_language = "Thai" , task_type="v1.5", base_url='http://localhost:8000/v1', api_key='no-key')
print(markdown)

To read more about vllm

Local Model - Transformers (GPU Required):

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

def resize_if_needed(img, max_size):
    width, height = img.size
    # Only resize if one dimension exceeds max_size
    if width > 300 or height > 300:
        if width >= height:
            scale = max_size / float(width)
            new_size = (max_size, int(height * scale))
        else:
            scale = max_size / float(height)
            new_size = (int(width * scale), max_size)

        img = img.resize(new_size, Image.Resampling.LANCZOS)
        print(f"{width, height}==> {img.size}")
        return img
    else:
        return img 


model = AutoModelForImageTextToText.from_pretrained(
    "scb10x/typhoon-ocr1.5-2b", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("scb10x/typhoon-ocr1.5-2b")

img = Image.open("image.png")


#This is important because the model is trained with a fixed image dimension of 1800 px
img = resize_if_needed(img, 1800)

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img,
                },
                {
                    "type": "text",
                    "text": prompt
                }
            ],
        }
    ]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=10000)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Hosting

We recommend to inference typhoon-ocr using vllm instead of huggingface transformers, and using typhoon-ocr library to ocr documents. To read more about vllm

bash
pip install vllm
vllm serve scb10x/typhoon-ocr1.5-2b --max-model-len 49152 --served-model-name typhoon-ocr-1-5  # OpenAI Compatible at http://localhost:8000
# then you can supply base_url in to ocr_document

python
from typhoon_ocr import ocr_document
markdown = ocr_document('image.png', model = "typhoon-ocr" , figure_language = "Thai", task_type="v1.5", base_url='http://localhost:8000/v1', api_key='no-key')
print(markdown)

Ollama & On-device inference

We recommend running Typhoon-OCR on-device using Ollama.

Intended Uses & Limitations

This is a task-specific model intended to be used only with the provided prompts. It does not include any guardrails or VQA capability. Due to the nature of large language models (LLMs), a certain level of hallucination may occur. We recommend that developers carefully assess these risks in the context of their specific use case.

https://twitter.com/opentyphoon

Support

https://discord.gg/us5gAYmrxw

Citation

If you find Typhoon OCR useful for your work, please cite it using:

markdown
@misc{nonesung2026typhoonocropenvisionlanguage,
      title={Typhoon OCR: Open Vision-Language Model For Thai Document Extraction}, 
      author={Surapon Nonesung and Natapong Nitarach and Teetouch Jaknamon and Pittawat Taveekitworachai and Kunat Pipatanakul},
      year={2026},
      eprint={2601.14722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.14722}, 
}

python

prompt = """Extract all text from the image.

Instructions:
- Only return the clean Markdown.
- Do not include any explanation or extra text.
- You must include all information on the page.

Formatting Rules:
- Tables: Render tables using <table>...</table> in clean HTML format.
- Equations: Render equations using LaTeX syntax with inline ($...$) and block ($$...$$).
- Images/Charts/Diagrams: Wrap any clearly defined visual areas (e.g. charts, diagrams, pictures) in:

<figure>
Describe the image's main elements (people, objects, text), note any contextual clues (place, event, culture), mention visible text and its meaning, provide deeper analysis when relevant (especially for financial charts, graphs, or documents), comment on style or architecture if relevant, then give a concise overall summary. Describe in Thai.
</figure>

- Page Numbers: Wrap page numbers in <page_number>...</page_number> (e.g., <page_number>14</page_number>).
- Checkboxes: Use ☐ for unchecked and ☑ for checked boxes."""

python

from typhoon_ocr import ocr_document

# please set env TYPHOON_OCR_API_KEY or OPENAI_API_KEY to use this function
markdown = ocr_document("test.png", model = "typhoon-ocr", figure_language = "Thai", task_type = "v1.5")
print(markdown)

bash

pip install vllm
vllm serve scb10x/typhoon-ocr1.5-2b --max-model-len 49152 --served-model-name typhoon-ocr-1-5 # OpenAI Compatible at http://localhost:8000 (or other port)
# then you can supply base_url in to ocr_document

python

from typhoon_ocr import ocr_document
markdown = ocr_document('image.png', model = "typhoon-ocr" , figure_language = "Thai" , task_type="v1.5", base_url='http://localhost:8000/v1', api_key='no-key')
print(markdown)

python

from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

def resize_if_needed(img, max_size):
    width, height = img.size
    # Only resize if one dimension exceeds max_size
    if width > 300 or height > 300:
        if width >= height:
            scale = max_size / float(width)
            new_size = (max_size, int(height * scale))
        else:
            scale = max_size / float(height)
            new_size = (int(width * scale), max_size)

        img = img.resize(new_size, Image.Resampling.LANCZOS)
        print(f"{width, height}==> {img.size}")
        return img
    else:
        return img 


model = AutoModelForImageTextToText.from_pretrained(
    "scb10x/typhoon-ocr1.5-2b", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("scb10x/typhoon-ocr1.5-2b")

img = Image.open("image.png")


#This is important because the model is trained with a fixed image dimension of 1800 px
img = resize_if_needed(img, 1800)

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img,
                },
                {
                    "type": "text",
                    "text": prompt
                }
            ],
        }
    ]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=10000)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

bash

pip install vllm
vllm serve scb10x/typhoon-ocr1.5-2b --max-model-len 49152 --served-model-name typhoon-ocr-1-5  # OpenAI Compatible at http://localhost:8000
# then you can supply base_url in to ocr_document

python

from typhoon_ocr import ocr_document
markdown = ocr_document('image.png', model = "typhoon-ocr" , figure_language = "Thai", task_type="v1.5", base_url='http://localhost:8000/v1', api_key='no-key')
print(markdown)

markdown

@misc{nonesung2026typhoonocropenvisionlanguage,
      title={Typhoon OCR: Open Vision-Language Model For Thai Document Extraction}, 
      author={Surapon Nonesung and Natapong Nitarach and Teetouch Jaknamon and Pittawat Taveekitworachai and Kunat Pipatanakul},
      year={2026},
      eprint={2601.14722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.14722}, 
}

typhoon-ocr1.5-2b

Get help setting up a custom Dedicated Endpoints.

README

Model Performance

BLEU Score (↑ Higher is better)

ROUGE-L Score (↑ Higher is better)

Levenshtein Distance (↓ Lower is better)

Prompting

Quickstart

Hosting

Ollama & On-device inference

Intended Uses & Limitations

Follow us

Support

Citation

Explore FriendliAI today

README

Model Performance

BLEU Score (↑ Higher is better)

ROUGE-L Score (↑ Higher is better)

Levenshtein Distance (↓ Lower is better)

Prompting

Quickstart

Hosting

Ollama & On-device inference

Intended Uses & Limitations

Follow us

Support

Citation