ytgui

Qwen3.5-Sonnet-9B

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

✨ Highlights

9B parameters, distilled from frontier teachers.
FP8 quantized weights — ~13 GB on disk, fits comfortably on a single 24 GB GPU.
~200K context with KV-cache on a 24 GB GPU (tested on vllm==0.20.2).
Optimized for agentic coding loops: long tool-call chains, file I/O, shell, and code-edit tools.
Recommended GPU: single 24 GB card (RTX 4090, RTX 4000 BLACKWELL, RTX 4500 Ada, etc.).

📟 Serving with vLLM

bash
# install vllm >= 0.20.2, see: https://vllm.ai/

vllm serve "ytgui/Qwen3.5-Sonnet-9B" \
	--port=8000  \
	--host=localhost   \
	--max-model-len='128K'  \
	--reasoning-parser=qwen3   \
	--enable-auto-tool-choice  \
	--tool-call-parser=qwen3_coder  \
	--gpu-memory-utilization=0.95

🗜️ GGUF Model

The GGUF model is available at: 👉 Qwen3.5-Sonnet-9B-GGUF

Multiple quantization levels are provided for use with llama.cpp and compatible runtimes.

🤖 System Prompt

We've set You are a helpful AI assistant. as default system prompt for general (non-coding) conversations. You may alter this behavior in your settings.

🧪 Distillation Recipe

Teacher mixture

The post-training corpus is a curated mixture from multiple frontier teachers, each chosen for what it does best:

Table with columns: Teacher, Role in the mixture
Teacher	Role in the mixture
`claude-opus-4.6`	General chain-of-thought reasoning
`deepseek-v4`	Tool-call traces (tool calls, LLM-as-judge)
`minimax-m2.7`	Tool-call traces (multi-tool orchestration)

Training method

Supervised Fine-Tuning (SFT) on the distilled trajectories.
Offline Reinforcement Learning on preference and outcome-labeled rollouts (successful vs. failed tool calls, completed vs. aborted sessions).

What is trained, what is frozen

To preserve the base model's pretrained knowledge and tokenizer alignment:

Frozen: vision encoder, lm_head, and token embeddings.
Trained: transformer backbone parameters only.

Training framework

A custom training stack built on:

torch
lightning
transformers

The framework supports mixed SFT + offline-RL objectives, gradient checkpointing, and FP8 weight casting at the end of post-training.

🛠️ Agentic Coding — Goals & Behavior

The distillation objective explicitly targets agent reliability, not just benchmark scores:

Fewer malformed tool calls (schema, JSON, argument errors).
Better recovery after a failed tool invocation.
Longer stable trajectories without collapse, repetition, or premature termination.

Long-running session screenshots

The screenshots below show the model running continuously for up to 10 minutes inside opencode and claude-code without interruption or tool call failure.

claude-code session: ask for locate "multi-head attention implementation" in pytorch project

Claude Code session — torch

claude-code session: ask for "understand project layout" in sqlite project

Claude Code session — sqlite

opencode session: ask for "explain terminologies" in pgvector project

OpenCode session + pgvector

Multilingual behavior

As a result of post-training alignment, the model sometimes performs its internal reasoning in English and then produces the final response in the user’s language.

The figure illustrates a Chinese query asking whether coffee is considered a type of soy milk.

Chinese

Agentic Coding Benchmarks

Our findings show that the following benchmarks from BenchLocal are strongly correlated with agentic coding performance. Configuration: 3x runs, no retry. Notably, this model approaches or exceeds models 3x its size in agentic coding tasks:

Table with columns: Bench, Score
Bench	Score
ToolCall-15	97
BugFind-15	86
StructOutput-15	92
HermesAgent-20	19

We recommand armand0e/Qwen3.5-9B-Agent and Jackrong/Qwopus3.5-9B-Coder if you're looking for local LLMs for HermesAgent.

⚠️ Limitations

Vision encoder is preserved but not the focus of this post-training; multimodal performance is inherited from the base model.
Distilled behavior reflects the teacher mixture and may exhibit teacher-specific stylistic patterns.
The model may not strictly adhere to PLAN MODE. This happens because the training pipeline did not account for this specific scenario. Workaround: Add a strict instruction in CLAUDE.md or AGENTS.md to constrain the model’s behavior. For example: Important: When operating in PLAN MODE, you must not edit files or make any changes..

Infinite Reasoning Loops

Like other ~9B models, complex questions beyond the model's capacity can trigger infinite reasoning loops where the model continuously doubts itself and never reaches a conclusion, running until the context limit is hit.
A simple way to reproduce this is to ask the model to find a bug in a large, high-quality codebase such as the Linux kernel or SQLite.
To mitigate this, try increasing temperature and/or repetition_penalty at inference time.

Model provider

ytgui

Model tree

Base

Qwen/Qwen3.5-9B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

✨ Highlights

9B parameters, distilled from frontier teachers.
FP8 quantized weights — ~13 GB on disk, fits comfortably on a single 24 GB GPU.
~200K context with KV-cache on a 24 GB GPU (tested on vllm==0.20.2).
Optimized for agentic coding loops: long tool-call chains, file I/O, shell, and code-edit tools.
Recommended GPU: single 24 GB card (RTX 4090, RTX 4000 BLACKWELL, RTX 4500 Ada, etc.).

📟 Serving with vLLM

bash
# install vllm >= 0.20.2, see: https://vllm.ai/

vllm serve "ytgui/Qwen3.5-Sonnet-9B" \
	--port=8000  \
	--host=localhost   \
	--max-model-len='128K'  \
	--reasoning-parser=qwen3   \
	--enable-auto-tool-choice  \
	--tool-call-parser=qwen3_coder  \
	--gpu-memory-utilization=0.95

🗜️ GGUF Model

The GGUF model is available at: 👉 Qwen3.5-Sonnet-9B-GGUF

Multiple quantization levels are provided for use with llama.cpp and compatible runtimes.

🤖 System Prompt

We've set You are a helpful AI assistant. as default system prompt for general (non-coding) conversations. You may alter this behavior in your settings.

🧪 Distillation Recipe

Teacher mixture

The post-training corpus is a curated mixture from multiple frontier teachers, each chosen for what it does best:

Table with columns: Teacher, Role in the mixture
Teacher	Role in the mixture
`claude-opus-4.6`	General chain-of-thought reasoning
`deepseek-v4`	Tool-call traces (tool calls, LLM-as-judge)
`minimax-m2.7`	Tool-call traces (multi-tool orchestration)

Training method

Supervised Fine-Tuning (SFT) on the distilled trajectories.
Offline Reinforcement Learning on preference and outcome-labeled rollouts (successful vs. failed tool calls, completed vs. aborted sessions).

What is trained, what is frozen

To preserve the base model's pretrained knowledge and tokenizer alignment:

Frozen: vision encoder, lm_head, and token embeddings.
Trained: transformer backbone parameters only.

Training framework

A custom training stack built on:

torch
lightning
transformers

The framework supports mixed SFT + offline-RL objectives, gradient checkpointing, and FP8 weight casting at the end of post-training.

🛠️ Agentic Coding — Goals & Behavior

The distillation objective explicitly targets agent reliability, not just benchmark scores:

Fewer malformed tool calls (schema, JSON, argument errors).
Better recovery after a failed tool invocation.
Longer stable trajectories without collapse, repetition, or premature termination.

Long-running session screenshots

The screenshots below show the model running continuously for up to 10 minutes inside opencode and claude-code without interruption or tool call failure.

claude-code session: ask for locate "multi-head attention implementation" in pytorch project

Claude Code session — torch

claude-code session: ask for "understand project layout" in sqlite project

Claude Code session — sqlite

opencode session: ask for "explain terminologies" in pgvector project

OpenCode session + pgvector

Multilingual behavior

As a result of post-training alignment, the model sometimes performs its internal reasoning in English and then produces the final response in the user’s language.

The figure illustrates a Chinese query asking whether coffee is considered a type of soy milk.

Chinese

Agentic Coding Benchmarks

Our findings show that the following benchmarks from BenchLocal are strongly correlated with agentic coding performance. Configuration: 3x runs, no retry. Notably, this model approaches or exceeds models 3x its size in agentic coding tasks:

Table with columns: Bench, Score
Bench	Score
ToolCall-15	97
BugFind-15	86
StructOutput-15	92
HermesAgent-20	19

We recommand armand0e/Qwen3.5-9B-Agent and Jackrong/Qwopus3.5-9B-Coder if you're looking for local LLMs for HermesAgent.

⚠️ Limitations

Vision encoder is preserved but not the focus of this post-training; multimodal performance is inherited from the base model.
Distilled behavior reflects the teacher mixture and may exhibit teacher-specific stylistic patterns.
The model may not strictly adhere to PLAN MODE. This happens because the training pipeline did not account for this specific scenario. Workaround: Add a strict instruction in CLAUDE.md or AGENTS.md to constrain the model’s behavior. For example: Important: When operating in PLAN MODE, you must not edit files or make any changes..

Infinite Reasoning Loops

Like other ~9B models, complex questions beyond the model's capacity can trigger infinite reasoning loops where the model continuously doubts itself and never reaches a conclusion, running until the context limit is hit.
A simple way to reproduce this is to ask the model to find a bug in a large, high-quality codebase such as the Linux kernel or SQLite.
To mitigate this, try increasing temperature and/or repetition_penalty at inference time.

Qwen3.5-Sonnet-9B

Get help setting up a custom Dedicated Endpoints.

README

✨ Highlights

📟 Serving with vLLM

🗜️ GGUF Model

🤖 System Prompt

🧪 Distillation Recipe

Teacher mixture

Training method

What is trained, what is frozen

Training framework

🛠️ Agentic Coding — Goals & Behavior

Long-running session screenshots

Multilingual behavior

Agentic Coding Benchmarks

⚠️ Limitations

Infinite Reasoning Loops

Explore FriendliAI today

README

✨ Highlights

📟 Serving with vLLM

🗜️ GGUF Model

🤖 System Prompt

🧪 Distillation Recipe

Teacher mixture

Training method

What is trained, what is frozen

Training framework

🛠️ Agentic Coding — Goals & Behavior

Long-running session screenshots

Multilingual behavior

Agentic Coding Benchmarks

⚠️ Limitations

Infinite Reasoning Loops