empero-ai

Qwable-9B-Claude-Fable-5

README

License: apache-2.0

Model details

Developed by: Empero
Base model: Qwen3.5-9B — a dense, natively multimodal model with a hybrid attention stack (3:1 Gated DeltaNet linear-attention to Gated full-attention), ~152k vocabulary, long native context.
Fine-tune type: full parameter (all text-backbone weights trained). The vision tower was frozen — training was text-only, so vision behavior is inherited from the base and was not tuned or tested.
Objective: supervised fine-tuning, assistant-only loss (the model is scored only on the assistant/completion tokens; prompts are masked out).
Languages: primarily English.
License: apache-2.0, inherited from the base weights — but see the data-provenance caveat below.

Training data

Table with columns: Source, Role, Approx. examples (after holdout)
Source	Role	Approx. examples (after holdout)
`Glint-Research/Fable-5-traces`	Claude Fable 5 reasoning + coding traces (`context` → `completion`)	~4,585
`Roman1111111/gpt5.5-terminal`	GPT-5.5 terminal/agent task solutions (`system` + `prompt` → `solution`)	~111

Both sources were normalized to a single chat format (user/assistant, with an optional system turn for the terminal tasks) and concatenated. The natural mix is heavily skewed toward Fable traces (~97%); no re-weighting was applied to the training set.

Held-out eval split: 100 examples were withheld from training — deliberately composed 80% Fable / 20% terminal so the held-out loss carries signal on both task types rather than being dominated by Fable.

Training procedure

Full-parameter supervised fine-tuning with TRL, using:

Full-length traces, zero truncation (max_length = 76,800) — even the longest multi-turn traces (~74k tokens) are trained in full.
Assistant-only loss — the model is scored only on assistant/completion tokens; prompt tokens are masked.
Chunked cross-entropy for memory-efficient long-context training.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Epochs	2
Effective batch size	16
Max sequence length	76,800 (no truncation)
Learning rate	1e-5 (cosine, 3% warmup)
Optimizer	AdamW (8-bit)
Precision	bf16
Loss	chunked NLL, assistant-only

Evaluation

Training quality was tracked via held-out validation loss and token-accuracy on a 100-example split and supplemented with a qualitative generation review (below). A full suite of coding, agentic, and safety benchmarks is in progress and will be published here. Validation was run periodically during training:

Table with columns: Step, eval loss, eval token-acc
Step	eval loss	eval token-acc
100	0.743	0.784
200	0.722	0.789
300 (≈ epoch 1)	0.714	0.791
400	0.7135	0.791
500	0.713	0.791

No overfitting observed. Held-out loss decreased monotonically and then plateaued (~0.71) through the second epoch — it never rose, even as train loss fell to ~0.64. Epoch-1 and final (epoch-2) checkpoints generalize equivalently on held-out data.

Note: token-accuracy is teacher-forced, per-token next-token accuracy over completion tokens only. It is not end-to-end correctness and tends to read high on consistent-style distillation data.

Qualitative generation review

34 prompts spanning coding, terminal/agentic tasks, reasoning, explanation, instruction-following, and honesty/calibration probes were run against the final checkpoint using Qwen3.5's recommended sampling settings. Full unedited transcripts are in sample_generations.md.

Strengths. Coding and terminal/agentic prompts were the strongest — correct, idiomatic solutions using current tooling (e.g. ss over netstat, git-filter-repo, Argon2id) with security-aware judgment (rotating a leaked key first, constant-time comparison, generic auth errors). Reasoning, instruction/format following, and calibration probes were handled well. Roughly 27 of 34 responses were clean and correct.

The model is a reasoning model: every answer begins with a <think> block followed by the final response — downstream consumers should parse out and strip the <think>...</think> span. See Limitations for usage tips.

How to use

The base is a multimodal (image-text-to-text) architecture; for text-only use load it with AutoModelForImageTextToText. Build the prompt with tokenize=False and then tokenize the string (the recommended path for this tokenizer):

python
import torch
from transformers import AutoModelForImageTextToText, AutoTokenizer

model_id = "empero-ai/Qwable-9B-Claude-Fable-5"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto"
)

messages = [{"role": "user", "content": "Write a Python function that merges two sorted lists."}]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)

out = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    temperature=0.7, top_p=0.95, top_k=20, repetition_penalty=1.05,
)
# Output begins with a <think>...</think> reasoning block, then the final answer.
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

repetition_penalty=1.05 is a small deviation from Qwen's default (1.0) that prevents rare non-terminating reasoning loops; allow generous max_new_tokens since the model reasons before answering.

Requirements: a recent transformers (Qwen3.5 support) plus the Gated DeltaNet kernels (flash-linear-attention and a CUDA-matched causal_conv1d build) — without them the linear-attention layers fall back to slow, memory-hungry PyTorch ops.

Limitations

Qwable-9B-Claude-Fable-5 is a focused 9B model that shines on the coding, agentic, and reasoning tasks it was trained for. A few characteristics are worth knowing to get the best out of it:

It's a reasoning model. Each response opens with a <think> block before the final answer, so parse and strip the <think>...</think> span for end users. On open-ended or creative prompts it may reason at length — allow generous max_new_tokens and use repetition_penalty≈1.05 (as in the snippet above) for consistently crisp completions.
Strongest within its domain. Capability is concentrated in coding and agentic/tool-use tasks. For general-knowledge or long-form factual questions, treat specifics as you would any 9B model's — verify before relying on them, and don't expect knowledge of events outside the base model's training.
Reflects its base and teachers. As a distillation fine-tune of Qwen3.5-9B on Claude Fable 5 and GPT-5.5 traces, it carries the style and limits of those sources and received no extra safety tuning beyond the base model's. Add your own review/safety layer for production use.
Text-only fine-tune. The base is multimodal, but only the text path was trained (vision left untouched and not evaluated here).

These are normal considerations for a compact, domain-focused model rather than blockers — used within its wheelhouse with the sampling settings above, it's a capable and dependable coding/agentic assistant.

Provenance & licensing

The model weights are released under Apache-2.0, inherited from the Qwen3.5-9B base. The fine-tuning data comes from generated traces of Claude Fable 5 and GPT-5.5 (via the linked public datasets). Because those traces originate from third-party assistants, the providers' terms may apply to downstream training and distillation — so if you plan to build on this model commercially, it's worth confirming your use aligns with those terms. Shared with the community for research and experimentation, as-is.

Support / Donate

If this model helped you, consider supporting the project:

BTC: bc1qx6zepu6sfkvshgdmc4ewu6pk6rpadvpgffpp7v
LTC: ltc1qv2mefzps2vtjcpwfx8xxdrpplrcvltswm68r7x
XMR: 42Dbm5xg5Nq26fdyzfEU7KBnAJfhi7Cvz5J2ex5CzHXkfKuNEJzYCcmJ1GTbgjFZ5MBx72sdG1G9239Cd6rsZfv4QeDkYJY

Acknowledgements

Developed and released by Empero
Base model: Qwen3.5-9B (Alibaba Qwen team)
Datasets: Glint-Research/Fable-5-traces, Roman1111111/gpt5.5-terminal
Training: TRL + Transformers

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

empero-ai

Model Tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities