Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
✨ Highlights
- 9B parameters, distilled from frontier teachers.
- FP8 quantized weights — ~13 GB on disk, fits comfortably on a single 24 GB GPU.
- ~200K context with KV-cache on a 24 GB GPU (tested on
vllm==0.20.2). - Optimized for agentic coding loops: long tool-call chains, file I/O, shell, and code-edit tools.
- Recommended GPU: single 24 GB card (RTX 4090, RTX 4000 BLACKWELL, RTX 4500 Ada, etc.).
📟 Serving with vLLM
bash
# install vllm >= 0.20.2, see: https://vllm.ai/vllm serve "ytgui/Qwen3.5-Sonnet-9B" \--port=8000 \--host=localhost \--max-model-len='128K' \--reasoning-parser=qwen3 \--enable-auto-tool-choice \--tool-call-parser=qwen3_coder \--gpu-memory-utilization=0.95
🗜️ GGUF Model
The GGUF model is available at: 👉 Qwen3.5-Sonnet-9B-GGUF
Multiple quantization levels are provided for use with llama.cpp and compatible runtimes.
🤖 System Prompt
- We've set
You are a helpful AI assistant.as default system prompt for general (non-coding) conversations. You may alter this behavior in your settings.
🧪 Distillation Recipe
Teacher mixture
The post-training corpus is a curated mixture from multiple frontier teachers, each chosen for what it does best:
| Teacher | Role in the mixture |
|---|---|
claude-opus-4.6 | General chain-of-thought reasoning |
deepseek-v4 | Tool-call traces (tool calls, LLM-as-judge) |
minimax-m2.7 | Tool-call traces (multi-tool orchestration) |
Training method
- Supervised Fine-Tuning (SFT) on the distilled trajectories.
- Offline Reinforcement Learning on preference and outcome-labeled rollouts (successful vs. failed tool calls, completed vs. aborted sessions).
What is trained, what is frozen
To preserve the base model's pretrained knowledge and tokenizer alignment:
- Frozen: vision encoder,
lm_head, and token embeddings. - Trained: transformer backbone parameters only.
Training framework
A custom training stack built on:
torchlightningtransformers
The framework supports mixed SFT + offline-RL objectives, gradient checkpointing, and FP8 weight casting at the end of post-training.
🛠️ Agentic Coding — Goals & Behavior
The distillation objective explicitly targets agent reliability, not just benchmark scores:
- Fewer malformed tool calls (schema, JSON, argument errors).
- Better recovery after a failed tool invocation.
- Longer stable trajectories without collapse, repetition, or premature termination.
Long-running session screenshots
The screenshots below show the model running continuously for up to 10 minutes inside
opencodeandclaude-codewithout interruption or tool call failure.
- claude-code session: ask for locate "multi-head attention implementation" in pytorch project

- claude-code session: ask for "understand project layout" in sqlite project

- opencode session: ask for "explain terminologies" in pgvector project

Multilingual behavior
As a result of post-training alignment, the model sometimes performs its internal reasoning in English and then produces the final response in the user’s language.
- The figure illustrates a Chinese query asking whether coffee is considered a type of soy milk.

Agentic Coding Benchmarks
Our findings show that the following benchmarks from BenchLocal are strongly correlated with agentic coding performance. Configuration: 3x runs, no retry. Notably, this model approaches or exceeds models 3x its size in agentic coding tasks:
| Bench | Score |
|---|---|
| ToolCall-15 | 97 |
| BugFind-15 | 86 |
| StructOutput-15 | 92 |
| HermesAgent-20 | 19 |
- We recommand armand0e/Qwen3.5-9B-Agent and Jackrong/Qwopus3.5-9B-Coder if you're looking for local LLMs for HermesAgent.
⚠️ Limitations
- Vision encoder is preserved but not the focus of this post-training; multimodal performance is inherited from the base model.
- Distilled behavior reflects the teacher mixture and may exhibit teacher-specific stylistic patterns.
- The model may not strictly adhere to
PLAN MODE. This happens because the training pipeline did not account for this specific scenario. Workaround: Add a strict instruction in CLAUDE.md or AGENTS.md to constrain the model’s behavior. For example:Important: When operating in PLAN MODE, you must not edit files or make any changes..
Infinite Reasoning Loops
- Like other ~9B models, complex questions beyond the model's capacity can trigger infinite reasoning loops where the model continuously doubts itself and never reaches a conclusion, running until the context limit is hit.
- A simple way to reproduce this is to ask the model to find a bug in a large, high-quality codebase such as the Linux kernel or SQLite.
- To mitigate this, try increasing temperature and/or repetition_penalty at inference time.
Model provider
ytgui
Model tree
Base
Qwen/Qwen3.5-9B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information