armand0e

Qwen3.5-9B-Fable-5-SDFT

What SDFT Does

SDFT uses one model in two roles:

Student: the trainable model, prompted only with the conversation so far.
Teacher: the same base model with training adapters disabled, prompted with the conversation plus an in-context expert reference response.

The student samples its own response first. The teacher then scores that same sampled response token by token, but from the stronger prompt that includes the expert demonstration. Training minimizes divergence between the student distribution and the demonstration-conditioned teacher distribution.

text
expert response c
                            |
                            v
conversation x ----> teacher prompt: x + c ----> frozen base model
       |                                             |
       |                                             v
       +---------> student prompt: x ----------> teacher logits over y
                         |
                         v
                 trainable student
                         |
                         v
                sampled response y
                         |
                         v
        reverse KL(student logits || teacher logits)

In one update:

text
1. Sample y from the current student:
      y ~ pi_theta(. | conversation)

2. Score each sampled token with two distributions:
      student: pi_theta(. | conversation, y_<t)
      teacher: pi_0(. | conversation, expert_reference, y_<t)

3. Train the student toward the teacher on the sampled trajectory:
      loss = KL(pi_theta || pi_0) over the rollout tokens

SDFT vs. SFT

Supervised fine-tuning (SFT) trains on fixed expert-written tokens. That is off-policy: the gradient is computed on a sequence the current model may not have produced itself.

text
SFT:
  conversation x + expert tokens y*
          |
          v
  cross entropy: -log pi_theta(y* | x)
          |
          v
  off-policy learning on fixed demonstrations

SDFT trains on the model's own sampled tokens. That is on-policy: the update is attached to the current model's actual trajectory, while the teacher prompt uses the expert demonstration to shape the target distribution.

text
SDFT:
  conversation x ---> current model samples y
          |                    |
          |                    v
          +---- expert c ---> teacher scores y
                               |
                               v
          on-policy distillation on the student's own rollout

This run uses lambda_on_policy = 1.0, so all training examples are on-policy. There is no plain next-token cross-entropy SFT objective in this run.

Model Details

Base model: unsloth/Qwen3.5-9B
Final artifact: merged bf16 model, not a standalone PEFT adapter
Task shape: long-context assistant responses for coding-agent and tool-use traces
Training method: Self-Distillation Fine-Tuning with reverse KL
Context target: 65,536 tokens
Prompt cap: 57,344 tokens
Rollout cap: 8,192 new tokens
Training data: 2,693 filtered SDFT examples derived from armand0e/claude-fable-5-claude-code
Reasoning traces: private/internal reasoning fields are not included in the teacher reference

Training Data

The examples are per-assistant-turn records from agentic coding traces. Each record contains:

the conversation context before an assistant turn
the matching expert assistant turn
optional tool schemas used to render tool calls through the chat template

During SDFT, the expert turn is injected into the teacher prompt inside an <expert_reference> block. The student does not see that block when it samples its response.

Training Procedure

The Colab training profile used:

Table with columns: Setting, Value
Setting	Value
Base checkpoint	`unsloth/Qwen3.5-9B`
Max sequence length	`65536`
Max teacher prompt tokens	`57344`
Max rollout tokens	`8192`
Optimizer steps	`600`
Batch size	`1`

LoRA targets only language-trunk modules:

text
q_proj, k_proj, v_proj, o_proj,
gate_proj, up_proj, down_proj,
in_proj_qkv, in_proj_z, out_proj

Vision modules are not LoRA targets in the training script, so the visual tower is not adapted by this text-only run.

How to Use

python
import torch
from transformers import AutoTokenizer

try:
    from transformers import AutoModelForMultimodalLM as AutoModel
except ImportError:
    from transformers import AutoModelForCausalLM as AutoModel

model_id = "your-name/qwen35-9b-64k-sdft"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Write a small Python function that validates an email address."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Limitations

The model is trained from recorded traces, so it can inherit errors, assumptions, and style from those traces.
SDFT is on-policy per assistant turn, but the surrounding environment feedback is still the recorded expert trajectory. It does not replay tools or sandboxes during training.
Tool calls are generated as model outputs. Downstream systems should validate tool names, arguments, permissions, and side effects before execution.
The run is text/tool-call focused. Multimodal behavior should be validated separately before relying on it.
This is not a safety-tuned or policy-aligned model. Do not use it for high-stakes decisions without additional evaluation and safeguards.

Citation

If you use or discuss the training method, cite the SDFT paper:

bibtex
@misc{shenfeld2026selfdistillationenablescontinuallearning,
  title = {Self-Distillation Enables Continual Learning},
  author = {Shenfeld, Idan and Damani, Mehul and Hubotter, Jonas and Agrawal, Pulkit},
  year = {2026},
  eprint = {2601.19897},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.19897}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

armand0e

Model Tree

Base

unsloth/Qwen3.5-9B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

What SDFT Does

SDFT uses one model in two roles:

Student: the trainable model, prompted only with the conversation so far.
Teacher: the same base model with training adapters disabled, prompted with the conversation plus an in-context expert reference response.

text
expert response c
                            |
                            v
conversation x ----> teacher prompt: x + c ----> frozen base model
       |                                             |
       |                                             v
       +---------> student prompt: x ----------> teacher logits over y
                         |
                         v
                 trainable student
                         |
                         v
                sampled response y
                         |
                         v
        reverse KL(student logits || teacher logits)

In one update:

text
1. Sample y from the current student:
      y ~ pi_theta(. | conversation)

2. Score each sampled token with two distributions:
      student: pi_theta(. | conversation, y_<t)
      teacher: pi_0(. | conversation, expert_reference, y_<t)

3. Train the student toward the teacher on the sampled trajectory:
      loss = KL(pi_theta || pi_0) over the rollout tokens

SDFT vs. SFT

Supervised fine-tuning (SFT) trains on fixed expert-written tokens. That is off-policy: the gradient is computed on a sequence the current model may not have produced itself.

text
SFT:
  conversation x + expert tokens y*
          |
          v
  cross entropy: -log pi_theta(y* | x)
          |
          v
  off-policy learning on fixed demonstrations

text
SDFT:
  conversation x ---> current model samples y
          |                    |
          |                    v
          +---- expert c ---> teacher scores y
                               |
                               v
          on-policy distillation on the student's own rollout

This run uses lambda_on_policy = 1.0, so all training examples are on-policy. There is no plain next-token cross-entropy SFT objective in this run.

Model Details

Base model: unsloth/Qwen3.5-9B
Final artifact: merged bf16 model, not a standalone PEFT adapter
Task shape: long-context assistant responses for coding-agent and tool-use traces
Training method: Self-Distillation Fine-Tuning with reverse KL
Context target: 65,536 tokens
Prompt cap: 57,344 tokens
Rollout cap: 8,192 new tokens
Training data: 2,693 filtered SDFT examples derived from armand0e/claude-fable-5-claude-code
Reasoning traces: private/internal reasoning fields are not included in the teacher reference

Training Data

The examples are per-assistant-turn records from agentic coding traces. Each record contains:

the conversation context before an assistant turn
the matching expert assistant turn
optional tool schemas used to render tool calls through the chat template

During SDFT, the expert turn is injected into the teacher prompt inside an <expert_reference> block. The student does not see that block when it samples its response.

Training Procedure

The Colab training profile used:

Table with columns: Setting, Value
Setting	Value
Base checkpoint	`unsloth/Qwen3.5-9B`
Max sequence length	`65536`
Max teacher prompt tokens	`57344`
Max rollout tokens	`8192`
Optimizer steps	`600`
Batch size	`1`

LoRA targets only language-trunk modules:

text
q_proj, k_proj, v_proj, o_proj,
gate_proj, up_proj, down_proj,
in_proj_qkv, in_proj_z, out_proj

Vision modules are not LoRA targets in the training script, so the visual tower is not adapted by this text-only run.

How to Use

python
import torch
from transformers import AutoTokenizer

try:
    from transformers import AutoModelForMultimodalLM as AutoModel
except ImportError:
    from transformers import AutoModelForCausalLM as AutoModel

model_id = "your-name/qwen35-9b-64k-sdft"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Write a small Python function that validates an email address."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.95,
        do_sample=True,
    )

print(tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Limitations

The model is trained from recorded traces, so it can inherit errors, assumptions, and style from those traces.
SDFT is on-policy per assistant turn, but the surrounding environment feedback is still the recorded expert trajectory. It does not replay tools or sandboxes during training.
Tool calls are generated as model outputs. Downstream systems should validate tool names, arguments, permissions, and side effects before execution.
The run is text/tool-call focused. Multimodal behavior should be validated separately before relying on it.
This is not a safety-tuned or policy-aligned model. Do not use it for high-stakes decisions without additional evaluation and safeguards.

Citation

If you use or discuss the training method, cite the SDFT paper:

bibtex
@misc{shenfeld2026selfdistillationenablescontinuallearning,
  title = {Self-Distillation Enables Continual Learning},
  author = {Shenfeld, Idan and Damani, Mehul and Hubotter, Jonas and Agrawal, Pulkit},
  year = {2026},
  eprint = {2601.19897},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2601.19897}
}

Qwen3.5-9B-Fable-5-SDFT

README

What SDFT Does

SDFT vs. SFT

Model Details

Training Data

Training Procedure

How to Use

Limitations

Citation

Explore FriendliAI today

README

What SDFT Does

SDFT vs. SFT

Model Details

Training Data

Training Procedure

How to Use

Limitations

Citation