Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Results

Populated when training completes.

ModelParametersTool Call AccuracyROUGEDeferral PrecisionDeferral Recall
GLM-5 (teacher)
This model (tuned)1.7B
Qwen3-1.7B (base)1.7B

What the model does

Given the airline policy (as the system prompt), the available tools, and the conversation so far, the model produces the next single tool call:

  • Talk to the customerrespond_to_user(message=...) (terminal; ends the turn).
  • Act / look upget_reservation_details, book_reservation, send_certificate, … .
  • Reason silentlythink(thought=...).
  • Escalate to a larger modeldefer_to_larger_model(reason=...) on turns whose correct action depends on non-obvious policy eligibility, combining several rules, a multi-step calculation, or a genuinely ambiguous judgement call.
  • Hand off to a humantransfer_to_human_agents(summary=...) for out-of-scope requests or explicit human requests (distinct from deferral, which stays automated).

Deferral vs. human transfer

defer_to_larger_model is a capability escalation: a larger, more capable model takes over the same conversation with the same tools and policy — the customer keeps being served automatically. transfer_to_human_agents is for requests outside the tools' scope or when the user asks for a person. Judging when to defer — by the absolute structure of the problem, not the model's own confidence — is the core skill this model is distilled for.

Quick Start

Using Transformers

python

import json
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "distil-labs/distil-qwen3-1.7b-customer-support-deferral"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The full airline policy (system prompt) and the 16 tool schemas ship with the demo
# app as `job_description.json`. Wrap the policy in the distil tool-calling preamble:
TASK_DESCRIPTION = "# Airline Agent Policy\n... (see job_description.json) ..."
SYSTEM = (
"You are a tool-calling model working on:\n"
f"<task_description>{TASK_DESCRIPTION}</task_description>\n\n"
"Respond to the conversation history by generating an appropriate tool call that "
"satisfies the user request. Generate only the tool call according to the provided "
"tool schema, do not generate anything else. Always respond with a tool call."
)
TOOLS = [ ... ] # 16 tools from job_description.json
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Can I get a refund for reservation 8JX2WO?"},
]
text = tokenizer.apply_chat_template(
messages, tools=TOOLS, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# <tool_call>
# {"name": "defer_to_larger_model", "arguments": {"reason": "refund eligibility depends on fare class + travel insurance"}}
# </tool_call>

Using the Demo App

This model powers the Dual-size Customer-Support Bot demo — a terminal cascade where a local SLM handles most airline-support turns and defers hard turns to a larger, OpenAI-compatible model. See the demo repo for the orchestrator and serving setup.

Using llama.cpp

Serve a GGUF build with llama-server and query the OpenAI-compatible API at http://127.0.0.1:8000/v1:

bash

llama-server --model distil-qwen3-1.7b-customer-support-deferral.gguf --port 8000 --jinja

Model Details

PropertyValue
Base ModelQwen/Qwen3-1.7B
Parameters1.7 billion
ArchitectureQwen3ForCausalLM
Context Length40,960 tokens
Precisionbfloat16
Teacher ModelGLM-5 (zai.glm-5)
TaskMulti-turn tool calling (closed book) with model deferral

Training

Details finalized when training completes. The model is distilled with the Distil Labs platform:

  1. Traces — airline customer-support conversations (tau-bench airline tool set), processed and cleaned through the distil trace-processing pipeline.
  2. Deferral signal — a defer_to_larger_model tool and policy guidance, so the teacher marks genuinely-hard turns for escalation while the student learns the rest.
  3. Synthetic expansion + fine-tuning — distilled onto Qwen3-1.7B with GLM-5 as teacher.

Supported Functions (16 tools)

FunctionDescription
book_reservationBook a new flight reservation
cancel_reservationCancel an existing reservation
get_reservation_detailsLook up a reservation
get_user_detailsLook up a user / profile
list_all_airportsList supported airports
search_direct_flightSearch direct flights
search_onestop_flightSearch one-stop flights
update_reservation_flightsChange flights on a reservation
update_reservation_baggagesUpdate baggage on a reservation
update_reservation_passengersUpdate passengers on a reservation
send_certificateIssue a travel certificate / compensation
calculatePerform an arithmetic calculation
thinkPrivate step-by-step reasoning (no side effects)
respond_to_userSend a natural-language message to the customer (ends the turn)
transfer_to_human_agentsHand off to a human agent (out-of-scope / explicit request)
defer_to_larger_modelEscalate this turn to a larger model (capability escalation)

Use Cases

  • Cost-efficient customer-support assistants: a small local model handles the bulk of traffic, a larger model is invoked only on the hard minority of turns.
  • Any multi-turn tool-calling task with a bounded tool catalog and a difficulty signal worth routing on.

Limitations

  • English airline customer-support only; not a general-purpose tool caller.
  • Deferral calibration depends on the policy and tool catalog it was trained with.
  • Current weights are base Qwen3-1.7B placeholders — they do not yet follow the airline policy or defer reliably. Replace with the distilled weights for real performance.

License

Released under the Apache 2.0 license. See STUDENT_LICENSE (base model) and TEACHER_LICENSE (teacher model) for upstream terms.

Links

Citation

bibtex

@misc{distil-qwen3-1.7b-customer-support-deferral,
author = {Distil Labs},
title = {Distil-Qwen3-1.7B-Customer-Support-Deferral: A Fine-tuned SLM for Airline Support with Model Deferral},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/distil-labs/distil-qwen3-1.7b-customer-support-deferral}
}

Model provider

distil-labs

Model tree

Base

Qwen/Qwen3-1.7B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today