slevinw/Harmonic-9B API & Inference Endpoint

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi

Training Approach

Pipeline

799 curated rows. That's it. A small, precisely curated dataset instead of tens of thousands of unfiltered examples. The base model already has the knowledge from pretraining - the fine-tune teaches it a reasoning behavior pattern.

Every training row contains explicit self-correction ("wait, that's not right"), verification ("let me check by plugging back in"), and multi-path exploration ("alternatively, I could try..."). The data was generated from multiple frontier models and filtered through a custom structural quality pipeline that enforces reasoning depth, coherence, and flow patterns. 100% of rows pass all quality gates simultaneously.

Training Data Quality

Training Quality

The reasoning data was curated using a custom structural process supervision pipeline. Key metrics:

Metric	Value
Signal quality score	78.7 mean (61.5 min, 90.0 max)
Thinking trace depth	1,667 words average
Self-correction	100% of rows (17.2 per row avg)
Verification	100% of rows (10.3 per row avg)
Exploration	100% of rows (6.3 per row avg)
Quality gate pass rate	100%

Every row was scored across multiple structural dimensions and only rows passing all thresholds simultaneously were included. No rows were manually curated - the pipeline is fully automated and reproducible.

How It Compares

Competitor Comparison

We ran our structural quality analysis against every major public reasoning dataset used for Opus/Qwen distillation. The results:

Dataset	Rows	Think Words	Self-Correction	Verification	Exploration	Signal Score	Gate Pass
Harmonic (ours)	799	1,667	100%	100%	100%	78.7	100%
Crownelius/Opus-3300x	2,160	188	5.9%	22.6%	5.2%	28.0	0.1%
nohurry/Opus-Filtered	2,326	191	6.7%	24.1%	5.3%	28.5	0.1%
TeichAI/Opus-250x	250	323	17.2%	26.8%	6.8%	24.6	0.4%
Jackrong/Qwen-700x	633	6,653	97.5%	97.6%	69.8%	75.6	22.7%
Bespoke-Stratos-17k	16,710	1,322	88.2%	72.7%	59.7%	71.7	49.0%
glaiveai/reasoning-20m	22M+	799	64.1%	41.4%	37.3%	46.2	12.8%
KingNish/reasoning-20k	19,944	132	0.7%	4.2%	4.3%	27.4	0.0%

The popular Opus distillation datasets (Crownelius, nohurry, TeichAI) have less than 1% quality gate pass rate. Their thinking traces average under 200 words with near-zero self-correction. Models trained on this data learn to produce short, shallow chain-of-thought that looks like reasoning but lacks the structural behaviors that make reasoning reliable.

Jackrong and Stratos are closer competitors but still fall short on consistency. Jackrong has massive traces (6,653 words avg) but only 22.7% pass the quality gate - the thinking is verbose but wanders. Stratos has decent markers but 49% of rows still fail, meaning half the gradient updates during training push the model toward shallow patterns.

Harmonic's data is smaller by design. Every row passes. Every gradient update reinforces genuine reasoning behavior.

Reasoning Flow

Marker density measured across 20 equal segments of each thinking trace. The characteristic curve shows reasoning intensity building through the middle of the trace and peaking in the later segments as the model enters verification and self-correction before committing to an answer.

Training Configuration

markdown
base_model: Qwen/Qwen3.5-9B
dataset: 799 curated reasoning rows
epochs: 1
learning_rate: 1e-4
lr_scheduler: cosine
warmup_ratio: 0.1
max_seq_length: 8192
lora_rank: 32
lora_alpha: 32
dropout: 0.05
micro_batch_size: 1
gradient_accumulation_steps: 4
weight_decay: 0.01

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("DJLougen/Harmonic-9B")
tokenizer = AutoTokenizer.from_pretrained("DJLougen/Harmonic-9B")

Reasoning format

The model uses <think> blocks for reasoning:

markdown
<think>
The user is asking about X. Let me consider two approaches...

Approach 1: ...
Approach 2: ...

I'll go with Approach 1 because...

Wait, I need to be careful here - this assumes Y, which may not hold.
Let me verify by checking a special case...

Yes, that confirms the result.
</think>

[Final answer here]

Intended Use

Reasoning tasks requiring genuine multi-step thinking
Mathematical problem-solving with self-correction
Code analysis and generation with structured verification
General conversation (conversational ability preserved through training design)
Base model for Stage 2 agentic fine-tuning

Limitations

9B parameter model - not suitable for tasks requiring extensive world knowledge
Reasoning traces can be verbose for simple questions
Not optimized for tool calling - see Harmonic-Hermes-9B (coming soon) for agentic use
Benchmark evaluation is ongoing

Architecture

Base: Qwen 3.5 9B (9.65B parameters)
Training: LoRA fine-tuning, merged into base weights
Precision: BF16
Context: 8192 tokens

License

Apache 2.0 - same as the base model. All training data is from Apache 2.0 or MIT licensed sources. Fully commercial use permitted.

Harmonic-9B

Get help setting up a custom Dedicated Endpoints.

README