art-dsit

qwen3.5-4b-no-robots-lora

README

License: apache-2.0

Training

Base: Qwen/Qwen3.5-4B-Base loaded as Qwen3_5ForCausalLM (text-only).
Dataset: HuggingFaceH4/no_robots (~9.5k instruction examples).
LoRA: r=16, alpha=32, dropout=0.05, targeting q_proj / k_proj / v_proj / o_proj. The model has 32 transformer layers but only 8 are full-attention layers; the other 24 are linear-attention layers and are not adapted by this LoRA.
Schedule: 1 epoch, batch size 2 × grad accum 8, lr 2e-4 cosine, bf16, gradient checkpointing.
Format: plain ChatML (<|im_start|>{role}\n{content}<|im_end|>), no <think> blocks.

IFEval results

Evaluated via inspect_evals/ifeval on all 541 samples with a plain ChatML chat template (greedy decoding, max_new_tokens=512):

Table with columns: Metric, Base, This LoRA, Δ
Metric	Base	This LoRA	Δ
`prompt_strict_acc`	0.390	0.440	+5.0 pp
`prompt_loose_acc`	0.403	0.464	+6.1 pp
`inst_strict_acc`	0.513	0.561	+4.8 pp

Stderr ≈ 0.02.

Usage

Install

bash
pip install -U transformers peft accelerate huggingface_hub torch
# Optional but ~5x faster on Qwen3.5's hybrid linear+full attention:
pip install flash-linear-attention causal-conv1d

Requires transformers >= 4.57 (for the Qwen3_5 model code).

Authenticate (this repo is gated)

bash
hf auth login

(In older huggingface_hub versions the CLI is huggingface-cli login / huggingface-cli download — same arguments.)

Run

A self-contained example is in example.py in this repo. Either download and run it:

bash
hf download art-dsit/qwen3.5-4b-no-robots-lora example.py --local-dir .
python example.py

or inspect it on the Files tab for the full code.

The example covers loading the base + adapter, the ChatML prompt format, multi-turn history, and decoding with stop-token trimming.

The tokenizer that ships with this adapter has a plain-ChatML chat_template (no <think> blocks), so tokenizer.apply_chat_template(messages, add_generation_prompt=True) produces exactly the format this adapter was trained on. The literal string format in example.py is equivalent — use whichever you prefer.

Merge into base weights (optional)

If you want a standalone ~8 GB model rather than base + adapter, use merge.py in this repo:

bash
hf download art-dsit/qwen3.5-4b-no-robots-lora merge.py --local-dir .
python merge.py                            # writes ./qwen3.5-4b-no-robots-merged
python merge.py --output-dir my-merged     # custom path
python merge.py --dtype float16            # smaller on disk than bf16

The result is a Qwen3_5ForCausalLM checkpoint that any HF loader can consume directly without needing PEFT at inference time. The merged directory keeps this adapter's ChatML chat_template.

Serve with vLLM

Merge first, then point vLLM at the merged directory:

bash
python merge.py --output-dir qwen3.5-4b-no-robots-merged
vllm serve ./qwen3.5-4b-no-robots-merged --served-model-name qwen3.5-4b-no-robots

Then call the OpenAI-compatible chat API — vLLM applies this repo's ChatML template automatically:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-4b-no-robots",
    "messages": [{"role": "user", "content": "Write a haiku about debugging."}]
  }'

If instead you serve base + LoRA via vLLM's --enable-lora, vLLM will use the base tokenizer's chat template (which injects <think> blocks and isn't what this adapter was trained on). In that case download chat_template.jinja from this repo and pass it explicitly:

bash
hf download art-dsit/qwen3.5-4b-no-robots-lora chat_template.jinja --local-dir .
vllm serve Qwen/Qwen3.5-4B-Base \
  --enable-lora \
  --lora-modules no-robots=art-dsit/qwen3.5-4b-no-robots-lora \
  --chat-template ./chat_template.jinja

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

art-dsit

Model Tree

Base

Qwen/Qwen3.5-4B-Base

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities