art-dsit
qwen3.5-4b-no-robots-lora
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Training
- Base:
Qwen/Qwen3.5-4B-Baseloaded asQwen3_5ForCausalLM(text-only). - Dataset:
HuggingFaceH4/no_robots(~9.5k instruction examples). - LoRA:
r=16, alpha=32, dropout=0.05, targetingq_proj / k_proj / v_proj / o_proj. The model has 32 transformer layers but only 8 are full-attention layers; the other 24 are linear-attention layers and are not adapted by this LoRA. - Schedule: 1 epoch, batch size 2 × grad accum 8, lr 2e-4 cosine, bf16, gradient checkpointing.
- Format: plain ChatML (
<|im_start|>{role}\n{content}<|im_end|>), no<think>blocks.
IFEval results
Evaluated via inspect_evals/ifeval on all 541 samples with a plain ChatML chat template (greedy decoding, max_new_tokens=512):
| Metric | Base | This LoRA | Δ |
|---|---|---|---|
prompt_strict_acc | 0.390 | 0.440 | +5.0 pp |
prompt_loose_acc | 0.403 | 0.464 | +6.1 pp |
inst_strict_acc | 0.513 | 0.561 | +4.8 pp |
inst_loose_acc | 0.528 | 0.583 | +5.5 pp |
final_acc | 0.458 | 0.512 | +5.4 pp |
Stderr ≈ 0.02.
Usage
Install
bash
pip install -U transformers peft accelerate huggingface_hub torch# Optional but ~5x faster on Qwen3.5's hybrid linear+full attention:pip install flash-linear-attention causal-conv1d
Requires transformers >= 4.57 (for the Qwen3_5 model code).
Authenticate (this repo is gated)
bash
hf auth login
(In older huggingface_hub versions the CLI is huggingface-cli login / huggingface-cli download — same arguments.)
Run
A self-contained example is in example.py in this repo. Either download and run it:
bash
hf download art-dsit/qwen3.5-4b-no-robots-lora example.py --local-dir .python example.py
or inspect it on the Files tab for the full code.
The example covers loading the base + adapter, the ChatML prompt format, multi-turn history, and decoding with stop-token trimming.
The tokenizer that ships with this adapter has a plain-ChatML chat_template (no <think> blocks), so tokenizer.apply_chat_template(messages, add_generation_prompt=True) produces exactly the format this adapter was trained on. The literal string format in example.py is equivalent — use whichever you prefer.
Merge into base weights (optional)
If you want a standalone ~8 GB model rather than base + adapter, use merge.py in this repo:
bash
hf download art-dsit/qwen3.5-4b-no-robots-lora merge.py --local-dir .python merge.py # writes ./qwen3.5-4b-no-robots-mergedpython merge.py --output-dir my-merged # custom pathpython merge.py --dtype float16 # smaller on disk than bf16
The result is a Qwen3_5ForCausalLM checkpoint that any HF loader can consume directly without needing PEFT at inference time. The merged directory keeps this adapter's ChatML chat_template.
Serve with vLLM
Merge first, then point vLLM at the merged directory:
bash
python merge.py --output-dir qwen3.5-4b-no-robots-mergedvllm serve ./qwen3.5-4b-no-robots-merged --served-model-name qwen3.5-4b-no-robots
Then call the OpenAI-compatible chat API — vLLM applies this repo's ChatML template automatically:
bash
curl http://localhost:8000/v1/chat/completions \-H "Content-Type: application/json" \-d '{"model": "qwen3.5-4b-no-robots","messages": [{"role": "user", "content": "Write a haiku about debugging."}]}'
If instead you serve base + LoRA via vLLM's --enable-lora, vLLM will use the base tokenizer's chat template (which injects <think> blocks and isn't what this adapter was trained on). In that case download chat_template.jinja from this repo and pass it explicitly:
bash
hf download art-dsit/qwen3.5-4b-no-robots-lora chat_template.jinja --local-dir .vllm serve Qwen/Qwen3.5-4B-Base \--enable-lora \--lora-modules no-robots=art-dsit/qwen3.5-4b-no-robots-lora \--chat-template ./chat_template.jinja
Model provider
art-dsit
Model tree
Base
Qwen/Qwen3.5-4B-Base
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information