AlexWortega

qwen35-4b-soyuz

README

License: apache-2.0

Training

Table with columns: key, value
key	value
Base	`Qwen/Qwen3.5-4B` (chat)
Strategy	full bf16 LoRA (no quantization)
LoRA rank	128
LoRA alpha	256
LR	1e-5 (cosine schedule)
Epochs	1
Steps	1275
Seq len	16384 (smart-truncate ≤16K)
Eff batch	8
Optimizer	AdamW fused
Loss	`chunked_nll` (TRL)
Kernels	Liger rms_norm + swiglu + fused_linear_cross_entropy
Tokens seen	~129M train + ~102M eval
Hardware	1× RTX A6000 (46 GB)
Wall clock	~22 h

Eval (held-out 3% Soyuz-clean split = 631 samples)

Table with columns: Step, eval_loss, token_acc, entropy
Step	eval_loss	token_acc	entropy
500	0.2593	0.9331	0.2613
1000	0.2476	0.9359	0.2492
1275 (final)	0.2470	0.9360	0.2490

Smooth monotone improvement, no overfit signal.

Source mixture (Soyuz-sft / clean/*)

11 streams, kept after smart-truncate to ≤16K tokens:

alienkevin_glm-5
alienkevin_minimax-m2.5
deepswe_kimi-k2_2.8k + deepswe_kimi-k2_rs
hermes_agent_reasoning
ii_agent_gaia
ii_swebench-pro_claude-4.5 + ii_swebench-pro_gpt-5-codex
jetbrains_swe-bench-test + jetbrains_swesmith

Total: 20,395 train + 631 eval after truncate.

Usage

PEFT

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B", dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(base, "AlexWortega/qwen35-4b-soyuz")
tok = AutoTokenizer.from_pretrained("AlexWortega/qwen35-4b-soyuz")

sglang

bash
python -m sglang.launch_server --model-path Qwen/Qwen3.5-4B \
    --lora-paths soyuz=AlexWortega/qwen35-4b-soyuz \
    --tool-call-parser hermes

Table with columns: Asset, Link
Asset	Link
Merged bf16	qwen35-4b-soyuz-merged
Training data	Soyuz-sft
Sibling: + ClawGym RIFT	qwen35-4b-clawd-rift

W&B: https://wandb.ai/alexwortega/vae-llm-agents

GGUF quantizations are not provided: Qwen3.5 is a hybrid linear+full attention architecture (qwen3_5_text with linear_attention layers + MTP head); upstream llama.cpp does not yet support converting this model type.

Downstream evaluations

Served via sglang (base Qwen/Qwen3.5-4B + this LoRA via --lora-paths) on a single A6000.

terminal-bench-2 — 17-task solvable subset

Subset = union of all tasks ever passed by any sibling Qwen3.5-4B variant (ckpt600, clawd-100, clawd-200, clawd-rft, clawd-rift).

Table with columns: Pass, Rate
	Pass	Rate
soyuz (this, SFT-only)	5 / 17	29.4 %
clawd-rift (3-stage: SFT → ClawGym → RIFT)	4 / 17	23.5 %

Soyuz passes: git-leak-recovery, kv-store-grpc, modernize-scientific-stack, openssl-selfsigned-cert, sqlite-with-gcov. Of those, 3 (git-leak-recovery, kv-store-grpc, sqlite-with-gcov) are new passes vs clawd-rift on this subset.

Scaffold: Pi-style terminus_runner, T=0.4, max-turns=30, max-tokens=4096, parallel 2.

claw-eval — `general` split, Pass^1 (partial, in progress)

Snapshot at 30 / 162 tasks graded:

Table with columns: Metric, Value
Metric	Value
Pass^1	9 / 30 (30 %)
Mean task_score	0.564

Judge: google/gemini-3-flash-preview via OpenRouter. Agent endpoint: local sglang.

Top passes include C04 image_processing, C08 personal_finance, C10 labor_law, C13 psychology_statistics, C16 hr_workforce_planning, C20 mental_health_social_work. Full 162-task results will be appended when the sweep finishes.

HermesAgent-20 (executable agent benchmark)

HermesAgent-20 — 20 real-Hermes-runtime scenarios graded by deterministic artifacts (files / memory / cron / browser traces / approval logs). Not mocked tool-call matching.

Soyuz served via sglang Qwen/Qwen3.5-4B + this LoRA --lora-paths --tool-call-parser hermes.

Table with columns: Metric, Soyuz
Metric	Soyuz
Pass	4 / 20
Average score (0–100)	61.9

Confirmed passes:

HA-03 Reject Malicious Memory Injection — 100
HA-06 Background Process Management — 100
HA-09 Create A Skill From Completed Work — 100
HA-20 Clarify An Ambiguous Destructive Request — 100

Partial: HA-19 (35), HA-16 (30), HA-10 (30). Five scenarios (HA-11/12/13/17/18) crashed under parallel server load — true Pass count is ≥ 4.

Crucial finding: without --tool-call-parser hermes Soyuz scored 1/20 avg=17 (only the refuse scenario, since the runtime didn't see any tool calls). With Hermes parser routing <tool_call>{...}</tool_call> → OpenAI tool_calls, score jumped to 4/20 avg=61.9 (~4× more passes, 3.6× higher average).

Abliterated variants (weight-orthogonalized)

Two post-hoc model variants built from soyuz's own pass-vs-fail trajectory contrast (no training, only weight orthogonalisation):

Table with columns: Model, tbench-17, HA20
Model	tbench-17	HA20
soyuz (this)	5/17	4/20
soyuz-abliterated-v2 (single-L, s=0.5)	3/17	8/20 ↑↑
soyuz-abliterated-v3-multi (per-layer, s=0.5)	2/17	6/20 ↑

v2 doubles HermesAgent-20 score by removing a single residual-stream "fail-mode" direction (L=16, AUC 0.928 over 60 PASS vs 60 Gemini-cleaned FAIL trajectories). v3 picks up disjoint memory-tooling tasks (HA-01/02). See respective repos for the recipe.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

AlexWortega

Model Tree

Base

Qwen/Qwen3.5-4B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Training

Table with columns: key, value
key	value
Base	`Qwen/Qwen3.5-4B` (chat)
Strategy	full bf16 LoRA (no quantization)
LoRA rank	128
LoRA alpha	256
LR	1e-5 (cosine schedule)
Epochs	1
Steps	1275
Seq len	16384 (smart-truncate ≤16K)
Eff batch	8
Optimizer	AdamW fused
Loss	`chunked_nll` (TRL)
Kernels	Liger rms_norm + swiglu + fused_linear_cross_entropy
Tokens seen	~129M train + ~102M eval
Hardware	1× RTX A6000 (46 GB)
Wall clock	~22 h

Eval (held-out 3% Soyuz-clean split = 631 samples)

Table with columns: Step, eval_loss, token_acc, entropy
Step	eval_loss	token_acc	entropy
500	0.2593	0.9331	0.2613
1000	0.2476	0.9359	0.2492
1275 (final)	0.2470	0.9360	0.2490

Smooth monotone improvement, no overfit signal.

Source mixture (Soyuz-sft / clean/*)

11 streams, kept after smart-truncate to ≤16K tokens:

alienkevin_glm-5
alienkevin_minimax-m2.5
deepswe_kimi-k2_2.8k + deepswe_kimi-k2_rs
hermes_agent_reasoning
ii_agent_gaia
ii_swebench-pro_claude-4.5 + ii_swebench-pro_gpt-5-codex
jetbrains_swe-bench-test + jetbrains_swesmith

Total: 20,395 train + 631 eval after truncate.

Usage

PEFT

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B", dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(base, "AlexWortega/qwen35-4b-soyuz")
tok = AutoTokenizer.from_pretrained("AlexWortega/qwen35-4b-soyuz")

sglang

bash
python -m sglang.launch_server --model-path Qwen/Qwen3.5-4B \
    --lora-paths soyuz=AlexWortega/qwen35-4b-soyuz \
    --tool-call-parser hermes

Table with columns: Asset, Link
Asset	Link
Merged bf16	qwen35-4b-soyuz-merged
Training data	Soyuz-sft
Sibling: + ClawGym RIFT	qwen35-4b-clawd-rift

W&B: https://wandb.ai/alexwortega/vae-llm-agents

GGUF quantizations are not provided: Qwen3.5 is a hybrid linear+full attention architecture (qwen3_5_text with linear_attention layers + MTP head); upstream llama.cpp does not yet support converting this model type.

Downstream evaluations

Served via sglang (base Qwen/Qwen3.5-4B + this LoRA via --lora-paths) on a single A6000.

terminal-bench-2 — 17-task solvable subset

Subset = union of all tasks ever passed by any sibling Qwen3.5-4B variant (ckpt600, clawd-100, clawd-200, clawd-rft, clawd-rift).

Table with columns: Pass, Rate
	Pass	Rate
soyuz (this, SFT-only)	5 / 17	29.4 %
clawd-rift (3-stage: SFT → ClawGym → RIFT)	4 / 17	23.5 %

Scaffold: Pi-style terminus_runner, T=0.4, max-turns=30, max-tokens=4096, parallel 2.

claw-eval — `general` split, Pass^1 (partial, in progress)

Snapshot at 30 / 162 tasks graded:

Table with columns: Metric, Value
Metric	Value
Pass^1	9 / 30 (30 %)
Mean task_score	0.564

Judge: google/gemini-3-flash-preview via OpenRouter. Agent endpoint: local sglang.

HermesAgent-20 (executable agent benchmark)

HermesAgent-20 — 20 real-Hermes-runtime scenarios graded by deterministic artifacts (files / memory / cron / browser traces / approval logs). Not mocked tool-call matching.

Soyuz served via sglang Qwen/Qwen3.5-4B + this LoRA --lora-paths --tool-call-parser hermes.

Table with columns: Metric, Soyuz
Metric	Soyuz
Pass	4 / 20
Average score (0–100)	61.9

Confirmed passes:

HA-03 Reject Malicious Memory Injection — 100
HA-06 Background Process Management — 100
HA-09 Create A Skill From Completed Work — 100
HA-20 Clarify An Ambiguous Destructive Request — 100

Partial: HA-19 (35), HA-16 (30), HA-10 (30). Five scenarios (HA-11/12/13/17/18) crashed under parallel server load — true Pass count is ≥ 4.

Abliterated variants (weight-orthogonalized)

Two post-hoc model variants built from soyuz's own pass-vs-fail trajectory contrast (no training, only weight orthogonalisation):

Table with columns: Model, tbench-17, HA20
Model	tbench-17	HA20
soyuz (this)	5/17	4/20
soyuz-abliterated-v2 (single-L, s=0.5)	3/17	8/20 ↑↑
soyuz-abliterated-v3-multi (per-layer, s=0.5)	2/17	6/20 ↑

qwen35-4b-soyuz

README

Training

Eval (held-out 3% Soyuz-clean split = 631 samples)

Source mixture (Soyuz-sft / clean/*)

Usage

PEFT

sglang

Related

Downstream evaluations

terminal-bench-2 — 17-task solvable subset

claw-eval — general split, Pass^1 (partial, in progress)

HermesAgent-20 (executable agent benchmark)

Abliterated variants (weight-orthogonalized)

Explore FriendliAI today

README

Training

Eval (held-out 3% Soyuz-clean split = 631 samples)

Source mixture (Soyuz-sft / clean/*)

Usage

PEFT

sglang

Related

Downstream evaluations

terminal-bench-2 — 17-task solvable subset

claw-eval — general split, Pass^1 (partial, in progress)

HermesAgent-20 (executable agent benchmark)

Abliterated variants (weight-orthogonalized)

claw-eval — `general` split, Pass^1 (partial, in progress)

claw-eval — `general` split, Pass^1 (partial, in progress)