art-dsit

qwen3.5-4b-no-robots-lora

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Training

  • Base: Qwen/Qwen3.5-4B-Base loaded as Qwen3_5ForCausalLM (text-only).
  • Dataset: HuggingFaceH4/no_robots (~9.5k instruction examples).
  • LoRA: r=16, alpha=32, dropout=0.05, targeting q_proj / k_proj / v_proj / o_proj. The model has 32 transformer layers but only 8 are full-attention layers; the other 24 are linear-attention layers and are not adapted by this LoRA.
  • Schedule: 1 epoch, batch size 2 × grad accum 8, lr 2e-4 cosine, bf16, gradient checkpointing.
  • Format: plain ChatML (<|im_start|>{role}\n{content}<|im_end|>), no <think> blocks.

IFEval results

Evaluated via inspect_evals/ifeval on all 541 samples with a plain ChatML chat template (greedy decoding, max_new_tokens=512):

Table
MetricBaseThis LoRAΔ
prompt_strict_acc0.3900.440+5.0 pp
prompt_loose_acc0.4030.464+6.1 pp
inst_strict_acc0.5130.561+4.8 pp
inst_loose_acc0.5280.583+5.5 pp
final_acc0.4580.512+5.4 pp

Stderr ≈ 0.02.

Usage

Install

bash

pip install -U transformers peft accelerate huggingface_hub torch
# Optional but ~5x faster on Qwen3.5's hybrid linear+full attention:
pip install flash-linear-attention causal-conv1d

Requires transformers >= 4.57 (for the Qwen3_5 model code).

Authenticate (this repo is gated)

bash

hf auth login

(In older huggingface_hub versions the CLI is huggingface-cli login / huggingface-cli download — same arguments.)

Run

A self-contained example is in example.py in this repo. Either download and run it:

bash

hf download art-dsit/qwen3.5-4b-no-robots-lora example.py --local-dir .
python example.py

or inspect it on the Files tab for the full code.

The example covers loading the base + adapter, the ChatML prompt format, multi-turn history, and decoding with stop-token trimming.

The tokenizer that ships with this adapter has a plain-ChatML chat_template (no <think> blocks), so tokenizer.apply_chat_template(messages, add_generation_prompt=True) produces exactly the format this adapter was trained on. The literal string format in example.py is equivalent — use whichever you prefer.

Merge into base weights (optional)

If you want a standalone ~8 GB model rather than base + adapter, use merge.py in this repo:

bash

hf download art-dsit/qwen3.5-4b-no-robots-lora merge.py --local-dir .
python merge.py # writes ./qwen3.5-4b-no-robots-merged
python merge.py --output-dir my-merged # custom path
python merge.py --dtype float16 # smaller on disk than bf16

The result is a Qwen3_5ForCausalLM checkpoint that any HF loader can consume directly without needing PEFT at inference time. The merged directory keeps this adapter's ChatML chat_template.

Serve with vLLM

Merge first, then point vLLM at the merged directory:

bash

python merge.py --output-dir qwen3.5-4b-no-robots-merged
vllm serve ./qwen3.5-4b-no-robots-merged --served-model-name qwen3.5-4b-no-robots

Then call the OpenAI-compatible chat API — vLLM applies this repo's ChatML template automatically:

bash

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-4b-no-robots",
"messages": [{"role": "user", "content": "Write a haiku about debugging."}]
}'

If instead you serve base + LoRA via vLLM's --enable-lora, vLLM will use the base tokenizer's chat template (which injects <think> blocks and isn't what this adapter was trained on). In that case download chat_template.jinja from this repo and pass it explicitly:

bash

hf download art-dsit/qwen3.5-4b-no-robots-lora chat_template.jinja --local-dir .
vllm serve Qwen/Qwen3.5-4B-Base \
--enable-lora \
--lora-modules no-robots=art-dsit/qwen3.5-4b-no-robots-lora \
--chat-template ./chat_template.jinja

Model provider

art-dsit

Model tree

Base

Qwen/Qwen3.5-4B-Base

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today