htmlgen-qwen3.5-4b-lora API & Inference Endpoint

Model Details

Table with columns: Item, Value
Item	Value
Base Model	Qwen/Qwen3.5-4B
Fine-tune Method	LoRA (PEFT)
LoRA Rank	32
LoRA Alpha	64
LoRA Dropout	0.05
Target Modules	all-linear (language_model)
Training Framework	ms-swift v4.3.0
Precision	bfloat16
Max Sequence Length	10240

Quick Start

Installation

bash
pip install torch transformers peft accelerate pillow qwen_vl_utils
# Or use ms-swift (recommended):
pip install ms-swift[all]

Inference with ms-swift (Recommended)

python
from htmlgen_infer import HTMLGenModel

model = HTMLGenModel(
    adapter_path="Yesianrohn",  # or local path to this repo
    base_model="Qwen/Qwen3.5-4B",   # auto-detected from adapter_config.json
    merge_lora=True,
)

# Single image inference
html_output = model.predict("path/to/document_image.png")
print(html_output)

# Batch inference
results = model.predict_batch(["img1.png", "img2.png"])

Inference with Transformers + PEFT (Manual)

python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from peft import PeftModel
from PIL import Image

# Load base model
base_model_id = "Qwen/Qwen3.5-4B"
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "Yesianrohn")
model = model.merge_and_unload()  # Optional: merge for faster inference

# Prepare input
image = Image.open("document.png").convert("RGB")
system_prompt = (
    "You are an expert document parser. Given an image of a document page, "
    "reconstruct its source as a single complete, self-contained HTML5 "
    "document. Faithfully preserve the original layout, typography, tables, "
    "formulas, and visual hierarchy using inline CSS where appropriate. "
    "Output only the HTML source, with no explanations, no markdown fences, "
    "and no extra prose."
)
user_prompt = (
    "Convert this document page into a complete HTML document. "
    "Preserve the layout, headings, tables, and formulas exactly as shown. "
    "Return only the HTML source."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": user_prompt},
    ]},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=10240, temperature=0.0, do_sample=False)
output_text = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(output_text)

Inference with ms-swift CLI

bash
# Direct inference with swift
swift infer \
    --model Qwen/Qwen3.5-4B \
    --adapters Yesianrohn \
    --merge_lora true \
    --torch_dtype bfloat16 \
    --stream false

System Prompt

The model was trained with the following system prompt:

You are an expert document parser. Given an image of a document page, reconstruct its source as a single complete, self-contained HTML5 document. Faithfully preserve the original layout, typography, tables, formulas, and visual hierarchy using inline CSS where appropriate. Output only the HTML source, with no explanations, no markdown fences, and no extra prose.

Limitations

The model works best on clean, well-scanned document pages.
Very complex multi-column layouts or low-resolution images may produce imperfect HTML.
The maximum output length is 10240 tokens; very long documents may be truncated.

License

This adapter is released under the Apache 2.0 license. The base model (Qwen3.5-4B) has its own license — please refer to Qwen/Qwen3.5-4B for details.

Item

Value

Base Model

Qwen/Qwen3.5-4B

Fine-tune Method

LoRA (PEFT)

LoRA Rank

LoRA Alpha

LoRA Dropout

0.05

Target Modules

all-linear (language_model)

Training Framework

ms-swift v4.3.0

Precision

bfloat16

Max Sequence Length

10240

python

from htmlgen_infer import HTMLGenModel

model = HTMLGenModel(
    adapter_path="Yesianrohn",  # or local path to this repo
    base_model="Qwen/Qwen3.5-4B",   # auto-detected from adapter_config.json
    merge_lora=True,
)

# Single image inference
html_output = model.predict("path/to/document_image.png")
print(html_output)

# Batch inference
results = model.predict_batch(["img1.png", "img2.png"])

python

import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from peft import PeftModel
from PIL import Image

# Load base model
base_model_id = "Qwen/Qwen3.5-4B"
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True)

# Load LoRA adapter
model = PeftModel.from_pretrained(model, "Yesianrohn")
model = model.merge_and_unload()  # Optional: merge for faster inference

# Prepare input
image = Image.open("document.png").convert("RGB")
system_prompt = (
    "You are an expert document parser. Given an image of a document page, "
    "reconstruct its source as a single complete, self-contained HTML5 "
    "document. Faithfully preserve the original layout, typography, tables, "
    "formulas, and visual hierarchy using inline CSS where appropriate. "
    "Output only the HTML source, with no explanations, no markdown fences, "
    "and no extra prose."
)
user_prompt = (
    "Convert this document page into a complete HTML document. "
    "Preserve the layout, headings, tables, and formulas exactly as shown. "
    "Return only the HTML source."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": user_prompt},
    ]},
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=10240, temperature=0.0, do_sample=False)
output_text = processor.batch_decode(output_ids[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(output_text)

htmlgen-qwen3.5-4b-lora

README

Model Details

Quick Start

Installation

Inference with ms-swift (Recommended)

Inference with Transformers + PEFT (Manual)

Inference with ms-swift CLI

System Prompt

Limitations

License

Explore FriendliAI today

README

Model Details

Quick Start

Installation

Inference with ms-swift (Recommended)

Inference with Transformers + PEFT (Manual)

Inference with ms-swift CLI

System Prompt

Limitations

License