distil-labs

distil-qwen3-1.7b-customer-support-deferral

README

License: apache-2.0

Results

Evaluated on a held-out set of airline customer-support turns, scored by an independent GLM-5 judge (score = fraction of responses rated correct).

Table with columns: System, Quality, Frontier-model calls
System	Quality	Frontier-model calls
Frontier model alone (GLM-5)	0.80	100%
This model + escalation (local)	~0.75	~4%
Untrained Qwen3-1.7B	0.42	0%

Fine-tuning lifts the local 1.7B from 0.42 to ~0.75 (closing roughly 85% of the gap to its frontier-scale teacher), while running ~96% of turns locally and escalating only the hardest ~4% to the larger model.

Score is a reference-free LLM-as-a-judge rating on held-out turns, not exact-match accuracy. The escalation (defer_to_larger_model) is a cost/safety mechanism that reserves the frontier model for the hard minority, not a quality boost over the small model alone.

What the model does

Given the airline policy (as the system prompt), the available tools, and the conversation so far, the model produces the next single tool call:

Talk to the customer: respond_to_user(message=...) (terminal, ends the turn).
Act / look up: get_reservation_details, book_reservation, send_certificate, and so on.
Reason silently: think(thought=...).
Escalate to a larger model: defer_to_larger_model(reason=...) on turns whose correct action depends on non-obvious policy eligibility, combining several rules, a multi-step calculation, or a genuinely ambiguous judgement call.
Hand off to a human: transfer_to_human_agents(summary=...) for out-of-scope requests or explicit human requests (distinct from deferral, which stays automated).

Deferral vs. human transfer

defer_to_larger_model is a capability escalation: a larger, more capable model takes over the same conversation with the same tools and policy, and the customer keeps being served automatically. transfer_to_human_agents is for requests outside the tools' scope or when the user asks for a person. Judging when to defer, by the absolute structure of the problem rather than the model's own confidence, is the core skill this model is distilled for.

Quick Start

Using Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "distil-labs/distil-qwen3-1.7b-customer-support-deferral"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# The full airline policy (system prompt) and the 16 tool schemas ship with the demo
# app as `job_description.json`. Wrap the policy in the distil tool-calling preamble:
TASK_DESCRIPTION = "# Airline Agent Policy\n... (see job_description.json) ..."
SYSTEM = (
    "You are a tool-calling model working on:\n"
    f"<task_description>{TASK_DESCRIPTION}</task_description>\n\n"
    "Respond to the conversation history by generating an appropriate tool call that "
    "satisfies the user request. Generate only the tool call according to the provided "
    "tool schema, do not generate anything else. Always respond with a tool call."
)
TOOLS = [ ... ]  # 16 tools from job_description.json

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "Can I get a refund for reservation 8JX2WO?"},
]
text = tokenizer.apply_chat_template(
    messages, tools=TOOLS, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# <tool_call>
# {"name": "defer_to_larger_model", "arguments": {"reason": "refund eligibility depends on fare class + travel insurance"}}
# </tool_call>

Using the Demo App

This model powers the Flexible Customer Support Bot demo, a terminal cascade where a local SLM handles most airline-support turns and defers hard turns to a larger, OpenAI-compatible model.

Using llama.cpp

For local serving, use the GGUF build at distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf:

bash
llama-server --model distil-qwen3-1.7b-customer-support-deferral.gguf --port 8000 --jinja

Model Details

Table with columns: Property, Value
Property	Value
Base Model	Qwen/Qwen3-1.7B
Parameters	1.7 billion
Architecture	Qwen3ForCausalLM
Context Length	40,960 tokens
Precision	bfloat16 (merged)
Teacher Model	GLM-5 (`zai.glm-5`)
Task	Multi-turn tool calling (closed book) with model deferral

Training

The model is distilled with the Distil Labs platform:

Traces: airline customer-support conversations (tau-bench airline tool set), processed and cleaned through the distil trace-processing pipeline.
Deferral signal: a defer_to_larger_model tool and policy guidance, so the teacher marks genuinely-hard turns for escalation while the student learns the rest.
Synthetic expansion + fine-tuning: distilled onto Qwen3-1.7B with GLM-5 as teacher.

Supported Functions (16 tools)

Table with columns: Function, Description
Function	Description
`book_reservation`	Book a new flight reservation
`cancel_reservation`	Cancel an existing reservation
`get_reservation_details`	Look up a reservation
`get_user_details`	Look up a user / profile
`list_all_airports`	List supported airports
`search_direct_flight`

Use Cases

Cost-efficient customer-support assistants: a small local model handles the bulk of traffic, a larger model is invoked only on the hard minority of turns.
Any multi-turn tool-calling task with a bounded tool catalog and a difficulty signal worth routing on.

Limitations

English airline customer-support only, not a general-purpose tool caller.
Deferral calibration depends on the policy and tool catalog it was trained with.

License

Released under the Apache 2.0 license. See STUDENT_LICENSE (base model) and TEACHER_LICENSE (teacher model) for upstream terms.

Citation

bibtex
@misc{distil-qwen3-1.7b-customer-support-deferral,
  author = {Distil Labs},
  title  = {Distil-Qwen3-1.7B-Customer-Support-Deferral: A Fine-tuned SLM for Airline Support with Model Deferral},
  year   = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/distil-labs/distil-qwen3-1.7b-customer-support-deferral}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

distil-labs

Model Tree

Base

Qwen/Qwen3-1.7B

Fine-tuned

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Results

Evaluated on a held-out set of airline customer-support turns, scored by an independent GLM-5 judge (score = fraction of responses rated correct).

Table with columns: System, Quality, Frontier-model calls
System	Quality	Frontier-model calls
Frontier model alone (GLM-5)	0.80	100%
This model + escalation (local)	~0.75	~4%
Untrained Qwen3-1.7B	0.42	0%

What the model does

Given the airline policy (as the system prompt), the available tools, and the conversation so far, the model produces the next single tool call:

Talk to the customer: respond_to_user(message=...) (terminal, ends the turn).
Act / look up: get_reservation_details, book_reservation, send_certificate, and so on.
Reason silently: think(thought=...).
Escalate to a larger model: defer_to_larger_model(reason=...) on turns whose correct action depends on non-obvious policy eligibility, combining several rules, a multi-step calculation, or a genuinely ambiguous judgement call.
Hand off to a human: transfer_to_human_agents(summary=...) for out-of-scope requests or explicit human requests (distinct from deferral, which stays automated).

Deferral vs. human transfer

Quick Start

Using Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "distil-labs/distil-qwen3-1.7b-customer-support-deferral"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# The full airline policy (system prompt) and the 16 tool schemas ship with the demo
# app as `job_description.json`. Wrap the policy in the distil tool-calling preamble:
TASK_DESCRIPTION = "# Airline Agent Policy\n... (see job_description.json) ..."
SYSTEM = (
    "You are a tool-calling model working on:\n"
    f"<task_description>{TASK_DESCRIPTION}</task_description>\n\n"
    "Respond to the conversation history by generating an appropriate tool call that "
    "satisfies the user request. Generate only the tool call according to the provided "
    "tool schema, do not generate anything else. Always respond with a tool call."
)
TOOLS = [ ... ]  # 16 tools from job_description.json

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": "Can I get a refund for reservation 8JX2WO?"},
]
text = tokenizer.apply_chat_template(
    messages, tools=TOOLS, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# <tool_call>
# {"name": "defer_to_larger_model", "arguments": {"reason": "refund eligibility depends on fare class + travel insurance"}}
# </tool_call>

Using the Demo App

This model powers the Flexible Customer Support Bot demo, a terminal cascade where a local SLM handles most airline-support turns and defers hard turns to a larger, OpenAI-compatible model.

Using llama.cpp

For local serving, use the GGUF build at distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf:

bash
llama-server --model distil-qwen3-1.7b-customer-support-deferral.gguf --port 8000 --jinja

Model Details

Table with columns: Property, Value
Property	Value
Base Model	Qwen/Qwen3-1.7B
Parameters	1.7 billion
Architecture	Qwen3ForCausalLM
Context Length	40,960 tokens
Precision	bfloat16 (merged)
Teacher Model	GLM-5 (`zai.glm-5`)
Task	Multi-turn tool calling (closed book) with model deferral

Training

The model is distilled with the Distil Labs platform:

Traces: airline customer-support conversations (tau-bench airline tool set), processed and cleaned through the distil trace-processing pipeline.
Deferral signal: a defer_to_larger_model tool and policy guidance, so the teacher marks genuinely-hard turns for escalation while the student learns the rest.
Synthetic expansion + fine-tuning: distilled onto Qwen3-1.7B with GLM-5 as teacher.

Supported Functions (16 tools)

Table with columns: Function, Description
Function	Description
`book_reservation`	Book a new flight reservation
`cancel_reservation`	Cancel an existing reservation
`get_reservation_details`	Look up a reservation
`get_user_details`	Look up a user / profile
`list_all_airports`	List supported airports
`search_direct_flight`

Use Cases

Cost-efficient customer-support assistants: a small local model handles the bulk of traffic, a larger model is invoked only on the hard minority of turns.
Any multi-turn tool-calling task with a bounded tool catalog and a difficulty signal worth routing on.

Limitations

English airline customer-support only, not a general-purpose tool caller.
Deferral calibration depends on the policy and tool catalog it was trained with.

License

Released under the Apache 2.0 license. See STUDENT_LICENSE (base model) and TEACHER_LICENSE (teacher model) for upstream terms.

Citation

bibtex
@misc{distil-qwen3-1.7b-customer-support-deferral,
  author = {Distil Labs},
  title  = {Distil-Qwen3-1.7B-Customer-Support-Deferral: A Fine-tuned SLM for Airline Support with Model Deferral},
  year   = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/distil-labs/distil-qwen3-1.7b-customer-support-deferral}
}

distil-qwen3-1.7b-customer-support-deferral

README

Results

What the model does

Deferral vs. human transfer

Quick Start

Using Transformers

Using the Demo App

Using llama.cpp

Model Details

Training

Supported Functions (16 tools)

Use Cases

Limitations

License

Links

Citation

Explore FriendliAI today

README

Results

What the model does

Deferral vs. human transfer

Quick Start

Using Transformers

Using the Demo App

Using llama.cpp

Model Details

Training

Supported Functions (16 tools)

Use Cases

Limitations

License

Links

Citation