armand0e

Qwen3.5-9B-Fable-5-SDFT

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

What SDFT Does

SDFT uses one model in two roles:

  • Student: the trainable model, prompted only with the conversation so far.
  • Teacher: the same base model with training adapters disabled, prompted with the conversation plus an in-context expert reference response.

The student samples its own response first. The teacher then scores that same sampled response token by token, but from the stronger prompt that includes the expert demonstration. Training minimizes divergence between the student distribution and the demonstration-conditioned teacher distribution.

text

expert response c
|
v
conversation x ----> teacher prompt: x + c ----> frozen base model
| |
| v
+---------> student prompt: x ----------> teacher logits over y
|
v
trainable student
|
v
sampled response y
|
v
reverse KL(student logits || teacher logits)

In one update:

text

1. Sample y from the current student:
y ~ pi_theta(. | conversation)
2. Score each sampled token with two distributions:
student: pi_theta(. | conversation, y_<t)
teacher: pi_0(. | conversation, expert_reference, y_<t)
3. Train the student toward the teacher on the sampled trajectory:
loss = KL(pi_theta || pi_0) over the rollout tokens

SDFT vs. SFT

image

Supervised fine-tuning (SFT) trains on fixed expert-written tokens. That is off-policy: the gradient is computed on a sequence the current model may not have produced itself.

text

SFT:
conversation x + expert tokens y*
|
v
cross entropy: -log pi_theta(y* | x)
|
v
off-policy learning on fixed demonstrations

SDFT trains on the model's own sampled tokens. That is on-policy: the update is attached to the current model's actual trajectory, while the teacher prompt uses the expert demonstration to shape the target distribution.

text

SDFT:
conversation x ---> current model samples y
| |
| v
+---- expert c ---> teacher scores y
|
v
on-policy distillation on the student's own rollout

This run uses lambda_on_policy = 1.0, so all training examples are on-policy. There is no plain next-token cross-entropy SFT objective in this run.

Model Details

  • Base model: unsloth/Qwen3.5-9B
  • Final artifact: merged bf16 model, not a standalone PEFT adapter
  • Task shape: long-context assistant responses for coding-agent and tool-use traces
  • Training method: Self-Distillation Fine-Tuning with reverse KL
  • Context target: 65,536 tokens
  • Prompt cap: 57,344 tokens
  • Rollout cap: 8,192 new tokens
  • Training data: 2,693 filtered SDFT examples derived from armand0e/claude-fable-5-claude-code
  • Reasoning traces: private/internal reasoning fields are not included in the teacher reference

Training Data

The examples are per-assistant-turn records from agentic coding traces. Each record contains:

  • the conversation context before an assistant turn
  • the matching expert assistant turn
  • optional tool schemas used to render tool calls through the chat template

During SDFT, the expert turn is injected into the teacher prompt inside an <expert_reference> block. The student does not see that block when it samples its response.

Training Procedure

The Colab training profile used:

Table
SettingValue
Base checkpointunsloth/Qwen3.5-9B
Max sequence length65536
Max teacher prompt tokens57344
Max rollout tokens8192
Optimizer steps600
Batch size1
Learning rate1.0e-5
Warmup steps20
Weight decay0.0
LoRA rank64
LoRA alpha128
LoRA dropout0.0
Distillation lossreverse KL
KL temperature1.0
Rollout temperature0.8
Rollout top-p0.95

LoRA targets only language-trunk modules:

text

q_proj, k_proj, v_proj, o_proj,
gate_proj, up_proj, down_proj,
in_proj_qkv, in_proj_z, out_proj

Vision modules are not LoRA targets in the training script, so the visual tower is not adapted by this text-only run.

How to Use

python

import torch
from transformers import AutoTokenizer
try:
from transformers import AutoModelForMultimodalLM as AutoModel
except ImportError:
from transformers import AutoModelForCausalLM as AutoModel
model_id = "your-name/qwen35-9b-64k-sdft"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Write a small Python function that validates an email address."}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Limitations

  • The model is trained from recorded traces, so it can inherit errors, assumptions, and style from those traces.
  • SDFT is on-policy per assistant turn, but the surrounding environment feedback is still the recorded expert trajectory. It does not replay tools or sandboxes during training.
  • Tool calls are generated as model outputs. Downstream systems should validate tool names, arguments, permissions, and side effects before execution.
  • The run is text/tool-call focused. Multimodal behavior should be validated separately before relying on it.
  • This is not a safety-tuned or policy-aligned model. Do not use it for high-stakes decisions without additional evaluation and safeguards.

Citation

If you use or discuss the training method, cite the SDFT paper:

bibtex

@misc{shenfeld2026selfdistillationenablescontinuallearning,
title = {Self-Distillation Enables Continual Learning},
author = {Shenfeld, Idan and Damani, Mehul and Hubotter, Jonas and Agrawal, Pulkit},
year = {2026},
eprint = {2601.19897},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2601.19897}
}

Model provider

armand0e

Model tree

Base

unsloth/Qwen3.5-9B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today