AbstractPhil

Qwen3.5-0.8B-json-captioner

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

What it is

Base: Qwen/Qwen3.5-0.8B — qwen3_5 architecture, ~873M params, image-text-to-text, Apache-2.0.
Adapter: AbstractPhil/qwen3.5-0.8b-task_1-lora-v2, folded in with peft's merge_and_unload().
Result: a single checkpoint with the base architecture — AutoModelForImageTextToText + AutoProcessor, no peft.

The merge was faithfulness-checked (base+LoRA logits vs. merged, in-memory and reloaded-from-disk) before upload.

Intended use

Turn an image into a fixed-schema caption JSON for downstream training pipelines (it was built to fill the structured-caption field of an image-caption super-dataset). It is a narrow extraction model, not a general chat or VQA model.

Training

Two-stage curriculum (`qwen_lora_train_v2.py`)

The v2 adapter was trained via a two-stage curriculum, warm-started from the v1 LoRA (AbstractPhil/qwen3.5-0.8b-task_1-lora, which was trained on the Claude gold set alone).

Stage 1 — Bulk pretraining on ~50,000 grounded rows from AbstractPhil/cc-task1-json (Qwen-generated Conceptual Captions conversions, filtered to grounded==True). High volume, ~99%-clean but 0.8B-quality. 1 epoch.

Stage 2 — Refinement on ~20,505 Claude Sonnet 4.6 gold extractions from AbstractPhil/json-coco-format, config task_1. These are higher-fidelity, more robust tool-call examples produced by the ClaudeProvider (strict prompt mode, forced emit_caption_schema tool choice, filtered to grounding_rate==1.0). 2 epochs.

The hypothesis: v1 may have been quality-capped by the small 20K Claude set; bulk CC data broadens it, and the gold refinement stage re-anchors. Three checkpoints exist for comparison: v1 (Claude only) → v2-stage1 (+ 50K CC) → v2 (CC then Claude refine).

Data format

Source captions are MS-COCO (Karpathy split). The teacher is Claude Sonnet 4.6, run in strict mode with forced emit_caption_schema tool choice and filtered to grounding_rate==1.0 (every extracted entity must trace back to the input caption). Each example is in the Qwen3.5-native tool-call shape:

messages[0] — system prompt (caption-structuring assistant)
messages[1] — user turn: the raw caption text
messages[2] — assistant turn with tool_calls[0].function.arguments:

json
// Input:  "A long restaurant table with rattan rounded back chairs."
// Output:
{
  "subjects": [
    {"name": "restaurant table", "attributes": ["long"]},
    {"name": "chairs", "attributes": ["rattan", "rounded back"]}
  ],
  "actions": [],
  "setting": "indoor"
}

// Input:  "a long table with a plant on top of it surrounded with wooden chairs"
// Output:
{
  "subjects": [
    {"name": "table", "attributes": ["long"]},
    {"name": "plant", "attributes": []},
    {"name": "chairs", "attributes": ["wooden"]}
  ],
  "actions": ["plant on top of table", "table surrounded with wooden chairs"],
  "setting": "indoor"
}

Note: style and mood are omitted — they are const: null in the schema (strict mode forced them null in all training examples). The meta column records model, mode, schema_valid, validator_passed, and token/cost accounting per row.

At inference time, Qwen3.5 generates in its native text format (<tool_call><function=emit_caption_schema><parameter=subjects>…</parameter>…</tool_call>), which is parsed into the dict above by parse_tool_call.

Schema reference

markdown
subjects  [SubjectValue]   max 8 items
            ├─ name        str (1–64 chars, required)
            └─ attributes  [str] (max 8, optional)
actions   [str]            max 8 items — relational phrases, not single verbs
setting   enum             "indoor" | "outdoor" | "unknown" (default: "unknown")
style     null             const null (strict mode)
mood      null             const null (strict mode)

LoRA config (from `adapter_config.json`)

Table with columns: parameter, value
parameter	value
rank `r`	32
`lora_alpha`	64
`lora_dropout`	0.05
`target_modules`	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
`bias`	none

Training hyperparameters (from `qwen_lora_train_v2.py`)

Table with columns: parameter, value
parameter	value
trainer	`transformers.Trainer`
optimizer	AdamW (default)
LR (both stages)	`1e-4` (below v1's `2e-4` — continuing a trained adapter)
LR schedule	cosine with 3% warmup
batch size	16
gradient accumulation	1 (effective batch = 16)
precision	bf16

The adapter modifies only the language-model projections; the base's vision encoder is untouched. That is why, although training was text-only, the merged model also does image → JSON at inference: feed an image and the vision-conditioned generation inherits the same tool-call structuring behavior.

Important: the task scaffold is not baked into the weights

The system prompt and the tools definition the LoRA was trained against live in the dataset AbstractPhil/json-coco-format (config task_1), not in the model. For the structured output this model is tuned for, apply that same system prompt + tools at inference (shown below). Without them the model still runs, but you lose the schema grounding.

json
[
  {
    "type": "function",
    "function": {
      "name": "emit_caption_schema",
      "description": "Emit the structured caption representation. The parameters follow the qwen-test-runner slot registry.",
      "parameters": {
        "$defs": {
          "SubjectValue": {
            "description": "A single entity in the caption.",
            "properties": {
              "name": { "maxLength": 64, "minLength": 1, "title": "Name", "type": "string" },
              "attributes": { "items": { "type": "string" }, "maxItems": 8, "title": "Attributes", "type": "array" }
            },
            "required": ["name"],
            "title": "SubjectValue",
            "type": "object"
          }
        },
        "properties": {
          "subjects": { "items": { "$ref": "#/$defs/SubjectValue" }, "maxItems": 8, "title": "Subjects", "type": "array" },
          "actions": { "items": { "type": "string" }, "maxItems": 8, "title": "Actions", "type": "array" },
          "setting": { "default": "unknown", "enum": ["indoor", "outdoor", "unknown"], "title": "Setting", "type": "string" },
          "style": { "anyOf": [{ "maxLength": 64, "type": "string" }, { "type": "null" }], "default": null, "title": "Style", "const": null },
          "mood": { "anyOf": [{ "maxLength": 64, "type": "string" }, { "type": "null" }], "default": null, "title": "Mood", "const": null }
        },
        "title": "Caption",
        "type": "object"
      }
    }
  }
]

Usage

python
import json, torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor, AutoModelForImageTextToText

REPO = "AbstractPhil/Qwen3.5-0.8B-json-captioner"

processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
    REPO, dtype=torch.bfloat16, device_map="cuda").eval()
processor.tokenizer.padding_side = "left"
if processor.tokenizer.pad_token_id is None:
    processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id

# Task scaffold (system prompt + tools). Read the JSONL directly: the dataset card
# declares a 'Json' feature type that datasets>=4.0 rejects, so load_dataset() fails
# ("Feature type 'Json' not found") — hf_hub_download + json.loads(first line) is robust.
_p = hf_hub_download("AbstractPhil/json-coco-format", "data/task_1.jsonl", repo_type="dataset")
with open(_p, encoding="utf-8") as f:
    scaffold = json.loads(f.readline())
SYSTEM_PROMPT = scaffold["messages"][0]["content"]
TOOLS         = scaffold["tools"]

image = Image.open("example.jpg").convert("RGB")
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text",  "text": "Extract the structured representation of what this image shows."},
    ]},
]
inputs = processor.apply_chat_template(
    messages, tools=TOOLS, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt", enable_thinking=False).to(model.device)

out = model.generate(
    **inputs, max_new_tokens=768, do_sample=False,
    pad_token_id=processor.tokenizer.pad_token_id,
    stop_strings=["</tool_call>"], tokenizer=processor.tokenizer)

text = processor.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(text)   # -> <tool_call><function=emit_caption_schema><parameter=...>...</tool_call>

The continuation is a Qwen tool call; parse the <function=...><parameter=...> block into a dict to get the caption JSON. Text-only input (an image-synthesis prompt instead of an image) works too — pass the prompt as the user text and drop the image content block.

Notes

Precision: bfloat16 is recommended (the merge was done in bf16).
Attention backend: sdpa is correct on Blackwell (sm_120) and Turing (sm_75), where flash-attn kernels don't run. On Ampere/Ada/Hopper (sm_80/86/89/90) you can pass attn_implementation="flash_attention_2" if flash-attn is installed, for a faster prefill.
Decoding: deterministic (do_sample=False) with stop_strings=["</tool_call>"] to halt once the tool call closes.

Provenance

Produced by merging the LoRA into the base via merge_and_unload(safe_merge=True), then save_pretrained (weights + config) and processor.save_pretrained (image processor + tokenizer + chat template). Qwen/Qwen3.5-0.8B is a standard transformers architecture, so the repo is self-contained — no custom remote code.

License

Model weights: Apache-2.0, inherited from Qwen/Qwen3.5-0.8B.
Training data: AbstractPhil/json-coco-format is CC-BY-4.0. Source captions are MS-COCO (Karpathy split).

Limitations

Small (0.8B): extraction quality is bounded by the task_1 LoRA's training; it is not a general-purpose captioner or chat model.
Image → JSON is a transfer capability. The adapter was trained on text caption → JSON, so image grounding rides on the base VLM's vision encoder plus the LoRA's structuring behavior — it was not directly trained on image inputs. Expect text → JSON to be its strongest mode.
The output schema is fixed by the emit_caption_schema tool — subjects (structured {name, attributes} objects), actions, setting (3-way enum), with style/mood always null. Anything outside that schema is out of scope.
Tuned toward grounded, literal extraction; it is not designed for creative or interpretive captions.

Model provider

AbstractPhil

Model tree

Base

AbstractPhil/qwen3.5-0.8b-task_1-lora-v2

Base

Qwen/Qwen3.5-0.8B

Merged

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

What it is

Base: Qwen/Qwen3.5-0.8B — qwen3_5 architecture, ~873M params, image-text-to-text, Apache-2.0.
Adapter: AbstractPhil/qwen3.5-0.8b-task_1-lora-v2, folded in with peft's merge_and_unload().
Result: a single checkpoint with the base architecture — AutoModelForImageTextToText + AutoProcessor, no peft.

The merge was faithfulness-checked (base+LoRA logits vs. merged, in-memory and reloaded-from-disk) before upload.

Intended use

Training

Two-stage curriculum (`qwen_lora_train_v2.py`)

The v2 adapter was trained via a two-stage curriculum, warm-started from the v1 LoRA (AbstractPhil/qwen3.5-0.8b-task_1-lora, which was trained on the Claude gold set alone).

Data format

messages[0] — system prompt (caption-structuring assistant)
messages[1] — user turn: the raw caption text
messages[2] — assistant turn with tool_calls[0].function.arguments:

json
// Input:  "A long restaurant table with rattan rounded back chairs."
// Output:
{
  "subjects": [
    {"name": "restaurant table", "attributes": ["long"]},
    {"name": "chairs", "attributes": ["rattan", "rounded back"]}
  ],
  "actions": [],
  "setting": "indoor"
}

// Input:  "a long table with a plant on top of it surrounded with wooden chairs"
// Output:
{
  "subjects": [
    {"name": "table", "attributes": ["long"]},
    {"name": "plant", "attributes": []},
    {"name": "chairs", "attributes": ["wooden"]}
  ],
  "actions": ["plant on top of table", "table surrounded with wooden chairs"],
  "setting": "indoor"
}

Schema reference

markdown
subjects  [SubjectValue]   max 8 items
            ├─ name        str (1–64 chars, required)
            └─ attributes  [str] (max 8, optional)
actions   [str]            max 8 items — relational phrases, not single verbs
setting   enum             "indoor" | "outdoor" | "unknown" (default: "unknown")
style     null             const null (strict mode)
mood      null             const null (strict mode)

LoRA config (from `adapter_config.json`)

Table with columns: parameter, value
parameter	value
rank `r`	32
`lora_alpha`	64
`lora_dropout`	0.05
`target_modules`	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
`bias`	none

Training hyperparameters (from `qwen_lora_train_v2.py`)

Table with columns: parameter, value
parameter	value
trainer	`transformers.Trainer`
optimizer	AdamW (default)
LR (both stages)	`1e-4` (below v1's `2e-4` — continuing a trained adapter)
LR schedule	cosine with 3% warmup
batch size	16
gradient accumulation	1 (effective batch = 16)
precision	bf16

Important: the task scaffold is not baked into the weights

json
[
  {
    "type": "function",
    "function": {
      "name": "emit_caption_schema",
      "description": "Emit the structured caption representation. The parameters follow the qwen-test-runner slot registry.",
      "parameters": {
        "$defs": {
          "SubjectValue": {
            "description": "A single entity in the caption.",
            "properties": {
              "name": { "maxLength": 64, "minLength": 1, "title": "Name", "type": "string" },
              "attributes": { "items": { "type": "string" }, "maxItems": 8, "title": "Attributes", "type": "array" }
            },
            "required": ["name"],
            "title": "SubjectValue",
            "type": "object"
          }
        },
        "properties": {
          "subjects": { "items": { "$ref": "#/$defs/SubjectValue" }, "maxItems": 8, "title": "Subjects", "type": "array" },
          "actions": { "items": { "type": "string" }, "maxItems": 8, "title": "Actions", "type": "array" },
          "setting": { "default": "unknown", "enum": ["indoor", "outdoor", "unknown"], "title": "Setting", "type": "string" },
          "style": { "anyOf": [{ "maxLength": 64, "type": "string" }, { "type": "null" }], "default": null, "title": "Style", "const": null },
          "mood": { "anyOf": [{ "maxLength": 64, "type": "string" }, { "type": "null" }], "default": null, "title": "Mood", "const": null }
        },
        "title": "Caption",
        "type": "object"
      }
    }
  }
]

Usage

python
import json, torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor, AutoModelForImageTextToText

REPO = "AbstractPhil/Qwen3.5-0.8B-json-captioner"

processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
    REPO, dtype=torch.bfloat16, device_map="cuda").eval()
processor.tokenizer.padding_side = "left"
if processor.tokenizer.pad_token_id is None:
    processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id

# Task scaffold (system prompt + tools). Read the JSONL directly: the dataset card
# declares a 'Json' feature type that datasets>=4.0 rejects, so load_dataset() fails
# ("Feature type 'Json' not found") — hf_hub_download + json.loads(first line) is robust.
_p = hf_hub_download("AbstractPhil/json-coco-format", "data/task_1.jsonl", repo_type="dataset")
with open(_p, encoding="utf-8") as f:
    scaffold = json.loads(f.readline())
SYSTEM_PROMPT = scaffold["messages"][0]["content"]
TOOLS         = scaffold["tools"]

image = Image.open("example.jpg").convert("RGB")
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text",  "text": "Extract the structured representation of what this image shows."},
    ]},
]
inputs = processor.apply_chat_template(
    messages, tools=TOOLS, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt", enable_thinking=False).to(model.device)

out = model.generate(
    **inputs, max_new_tokens=768, do_sample=False,
    pad_token_id=processor.tokenizer.pad_token_id,
    stop_strings=["</tool_call>"], tokenizer=processor.tokenizer)

text = processor.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(text)   # -> <tool_call><function=emit_caption_schema><parameter=...>...</tool_call>

Notes

Precision: bfloat16 is recommended (the merge was done in bf16).
Attention backend: sdpa is correct on Blackwell (sm_120) and Turing (sm_75), where flash-attn kernels don't run. On Ampere/Ada/Hopper (sm_80/86/89/90) you can pass attn_implementation="flash_attention_2" if flash-attn is installed, for a faster prefill.
Decoding: deterministic (do_sample=False) with stop_strings=["</tool_call>"] to halt once the tool call closes.

Provenance

License

Model weights: Apache-2.0, inherited from Qwen/Qwen3.5-0.8B.
Training data: AbstractPhil/json-coco-format is CC-BY-4.0. Source captions are MS-COCO (Karpathy split).

Limitations

Small (0.8B): extraction quality is bounded by the task_1 LoRA's training; it is not a general-purpose captioner or chat model.
Image → JSON is a transfer capability. The adapter was trained on text caption → JSON, so image grounding rides on the base VLM's vision encoder plus the LoRA's structuring behavior — it was not directly trained on image inputs. Expect text → JSON to be its strongest mode.
The output schema is fixed by the emit_caption_schema tool — subjects (structured {name, attributes} objects), actions, setting (3-way enum), with style/mood always null. Anything outside that schema is out of scope.
Tuned toward grounded, literal extraction; it is not designed for creative or interpretive captions.

Qwen3.5-0.8B-json-captioner

Get help setting up a custom Dedicated Endpoints.

README

What it is

Intended use

Training

Two-stage curriculum (`qwen_lora_train_v2.py`)

Data format

Schema reference

LoRA config (from `adapter_config.json`)

Training hyperparameters (from `qwen_lora_train_v2.py`)

Important: the task scaffold is not baked into the weights

Usage

Notes

Provenance

License

Limitations

Explore FriendliAI today

README

What it is

Intended use

Training

Two-stage curriculum (`qwen_lora_train_v2.py`)

Data format

Schema reference

LoRA config (from `adapter_config.json`)

Training hyperparameters (from `qwen_lora_train_v2.py`)

Important: the task scaffold is not baked into the weights

Usage

Notes

Provenance

License

Limitations

Qwen3.5-0.8B-json-captioner

Get help setting up a custom Dedicated Endpoints.

What it is

Intended use

Training

Two-stage curriculum (qwen_lora_train_v2.py)

Data format

Schema reference

LoRA config (from adapter_config.json)

Training hyperparameters (from qwen_lora_train_v2.py)

Important: the task scaffold is not baked into the weights

Usage

Notes

Provenance

License

Limitations

Explore FriendliAI today

What it is

Intended use

Training

Two-stage curriculum (qwen_lora_train_v2.py)

Data format

Schema reference

LoRA config (from adapter_config.json)

Training hyperparameters (from qwen_lora_train_v2.py)

Important: the task scaffold is not baked into the weights

Usage

Notes

Provenance

License

Limitations

Two-stage curriculum (`qwen_lora_train_v2.py`)

LoRA config (from `adapter_config.json`)

Training hyperparameters (from `qwen_lora_train_v2.py`)

Two-stage curriculum (`qwen_lora_train_v2.py`)

LoRA config (from `adapter_config.json`)

Training hyperparameters (from `qwen_lora_train_v2.py`)