armand0e
Qwen3.5-9B-Fable-5-SDFT
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
What SDFT Does
SDFT uses one model in two roles:
- Student: the trainable model, prompted only with the conversation so far.
- Teacher: the same base model with training adapters disabled, prompted with the conversation plus an in-context expert reference response.
The student samples its own response first. The teacher then scores that same sampled response token by token, but from the stronger prompt that includes the expert demonstration. Training minimizes divergence between the student distribution and the demonstration-conditioned teacher distribution.
text
expert response c|vconversation x ----> teacher prompt: x + c ----> frozen base model| || v+---------> student prompt: x ----------> teacher logits over y|vtrainable student|vsampled response y|vreverse KL(student logits || teacher logits)
In one update:
text
1. Sample y from the current student:y ~ pi_theta(. | conversation)2. Score each sampled token with two distributions:student: pi_theta(. | conversation, y_<t)teacher: pi_0(. | conversation, expert_reference, y_<t)3. Train the student toward the teacher on the sampled trajectory:loss = KL(pi_theta || pi_0) over the rollout tokens
SDFT vs. SFT

Supervised fine-tuning (SFT) trains on fixed expert-written tokens. That is off-policy: the gradient is computed on a sequence the current model may not have produced itself.
text
SFT:conversation x + expert tokens y*|vcross entropy: -log pi_theta(y* | x)|voff-policy learning on fixed demonstrations
SDFT trains on the model's own sampled tokens. That is on-policy: the update is attached to the current model's actual trajectory, while the teacher prompt uses the expert demonstration to shape the target distribution.
text
SDFT:conversation x ---> current model samples y| || v+---- expert c ---> teacher scores y|von-policy distillation on the student's own rollout
This run uses lambda_on_policy = 1.0, so all training examples are
on-policy. There is no plain next-token cross-entropy SFT objective in this
run.
Model Details
- Base model:
unsloth/Qwen3.5-9B - Final artifact: merged bf16 model, not a standalone PEFT adapter
- Task shape: long-context assistant responses for coding-agent and tool-use traces
- Training method: Self-Distillation Fine-Tuning with reverse KL
- Context target: 65,536 tokens
- Prompt cap: 57,344 tokens
- Rollout cap: 8,192 new tokens
- Training data: 2,693 filtered SDFT examples derived from
armand0e/claude-fable-5-claude-code - Reasoning traces: private/internal reasoning fields are not included in the teacher reference
Training Data
The examples are per-assistant-turn records from agentic coding traces. Each record contains:
- the conversation context before an assistant turn
- the matching expert assistant turn
- optional tool schemas used to render tool calls through the chat template
During SDFT, the expert turn is injected into the teacher prompt inside an
<expert_reference> block. The student does not see that block when it samples
its response.
Training Procedure
The Colab training profile used:
| Setting | Value |
|---|---|
| Base checkpoint | unsloth/Qwen3.5-9B |
| Max sequence length | 65536 |
| Max teacher prompt tokens | 57344 |
| Max rollout tokens | 8192 |
| Optimizer steps | 600 |
| Batch size | 1 |
| Learning rate | 1.0e-5 |
| Warmup steps | 20 |
| Weight decay | 0.0 |
| LoRA rank | 64 |
| LoRA alpha | 128 |
| LoRA dropout | 0.0 |
| Distillation loss | reverse KL |
| KL temperature | 1.0 |
| Rollout temperature | 0.8 |
| Rollout top-p | 0.95 |
LoRA targets only language-trunk modules:
text
q_proj, k_proj, v_proj, o_proj,gate_proj, up_proj, down_proj,in_proj_qkv, in_proj_z, out_proj
Vision modules are not LoRA targets in the training script, so the visual tower is not adapted by this text-only run.
How to Use
python
import torchfrom transformers import AutoTokenizertry:from transformers import AutoModelForMultimodalLM as AutoModelexcept ImportError:from transformers import AutoModelForCausalLM as AutoModelmodel_id = "your-name/qwen35-9b-64k-sdft"tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)model = AutoModel.from_pretrained(model_id,torch_dtype=torch.bfloat16,device_map="auto",trust_remote_code=True,)messages = [{"role": "user", "content": "Write a small Python function that validates an email address."}]prompt = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,enable_thinking=False,)inputs = tokenizer(prompt, return_tensors="pt").to(model.device)with torch.no_grad():output = model.generate(**inputs,max_new_tokens=512,temperature=0.7,top_p=0.95,do_sample=True,)print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
Limitations
- The model is trained from recorded traces, so it can inherit errors, assumptions, and style from those traces.
- SDFT is on-policy per assistant turn, but the surrounding environment feedback is still the recorded expert trajectory. It does not replay tools or sandboxes during training.
- Tool calls are generated as model outputs. Downstream systems should validate tool names, arguments, permissions, and side effects before execution.
- The run is text/tool-call focused. Multimodal behavior should be validated separately before relying on it.
- This is not a safety-tuned or policy-aligned model. Do not use it for high-stakes decisions without additional evaluation and safeguards.
Citation
If you use or discuss the training method, cite the SDFT paper:
bibtex
@misc{shenfeld2026selfdistillationenablescontinuallearning,title = {Self-Distillation Enables Continual Learning},author = {Shenfeld, Idan and Damani, Mehul and Hubotter, Jonas and Agrawal, Pulkit},year = {2026},eprint = {2601.19897},archivePrefix = {arXiv},primaryClass = {cs.LG},url = {https://arxiv.org/abs/2601.19897}}
Model provider
armand0e
Model tree
Base
unsloth/Qwen3.5-9B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information