Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Support This Work

I'm a PhD student in visual neuroscience at the University of Toronto who also happens to spend way too much time fine-tuning, merging, and quantizing open-weight models on rented H100s and a local DGX Spark. All training compute is self-funded — balancing GPU costs against a student budget. If my uploads have been useful to you, consider buying a PhD student a coffee. It goes a long way toward keeping these experiments running.

Support on Ko-fi


Training Approach

Pipeline

799 curated rows. That's it. A small, precisely curated dataset instead of tens of thousands of unfiltered examples. The base model already has the knowledge from pretraining - the fine-tune teaches it a reasoning behavior pattern.

Every training row contains explicit self-correction ("wait, that's not right"), verification ("let me check by plugging back in"), and multi-path exploration ("alternatively, I could try..."). The data was generated from multiple frontier models and filtered through a custom structural quality pipeline that enforces reasoning depth, coherence, and flow patterns. 100% of rows pass all quality gates simultaneously.

Training Data Quality

Training Quality

The reasoning data was curated using a custom structural process supervision pipeline. Key metrics:

MetricValue
Signal quality score78.7 mean (61.5 min, 90.0 max)
Thinking trace depth1,667 words average
Self-correction100% of rows (17.2 per row avg)
Verification100% of rows (10.3 per row avg)
Exploration100% of rows (6.3 per row avg)
Quality gate pass rate100%

Every row was scored across multiple structural dimensions and only rows passing all thresholds simultaneously were included. No rows were manually curated - the pipeline is fully automated and reproducible.

How It Compares

Competitor Comparison

We ran our structural quality analysis against every major public reasoning dataset used for Opus/Qwen distillation. The results:

DatasetRowsThink WordsSelf-CorrectionVerificationExplorationSignal ScoreGate Pass
Harmonic (ours)7991,667100%100%100%78.7100%
Crownelius/Opus-3300x2,1601885.9%22.6%5.2%28.00.1%
nohurry/Opus-Filtered2,3261916.7%24.1%5.3%28.50.1%
TeichAI/Opus-250x25032317.2%26.8%6.8%24.60.4%
Jackrong/Qwen-700x6336,65397.5%97.6%69.8%75.622.7%
Bespoke-Stratos-17k16,7101,32288.2%72.7%59.7%71.749.0%
glaiveai/reasoning-20m22M+79964.1%41.4%37.3%46.212.8%
KingNish/reasoning-20k19,9441320.7%4.2%4.3%27.40.0%

The popular Opus distillation datasets (Crownelius, nohurry, TeichAI) have less than 1% quality gate pass rate. Their thinking traces average under 200 words with near-zero self-correction. Models trained on this data learn to produce short, shallow chain-of-thought that looks like reasoning but lacks the structural behaviors that make reasoning reliable.

Jackrong and Stratos are closer competitors but still fall short on consistency. Jackrong has massive traces (6,653 words avg) but only 22.7% pass the quality gate - the thinking is verbose but wanders. Stratos has decent markers but 49% of rows still fail, meaning half the gradient updates during training push the model toward shallow patterns.

Harmonic's data is smaller by design. Every row passes. Every gradient update reinforces genuine reasoning behavior.

Reasoning Flow

Reasoning Flow

Marker density measured across 20 equal segments of each thinking trace. The characteristic curve shows reasoning intensity building through the middle of the trace and peaking in the later segments as the model enters verification and self-correction before committing to an answer.

Training Configuration

markdown

base_model: Qwen/Qwen3.5-9B
dataset: 799 curated reasoning rows
epochs: 1
learning_rate: 1e-4
lr_scheduler: cosine
warmup_ratio: 0.1
max_seq_length: 8192
lora_rank: 32
lora_alpha: 32
dropout: 0.05
micro_batch_size: 1
gradient_accumulation_steps: 4
weight_decay: 0.01

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("DJLougen/Harmonic-9B")
tokenizer = AutoTokenizer.from_pretrained("DJLougen/Harmonic-9B")

Reasoning format

The model uses <think> blocks for reasoning:

markdown

<think>
The user is asking about X. Let me consider two approaches...
Approach 1: ...
Approach 2: ...
I'll go with Approach 1 because...
Wait, I need to be careful here - this assumes Y, which may not hold.
Let me verify by checking a special case...
Yes, that confirms the result.
</think>
[Final answer here]

Intended Use

  • Reasoning tasks requiring genuine multi-step thinking
  • Mathematical problem-solving with self-correction
  • Code analysis and generation with structured verification
  • General conversation (conversational ability preserved through training design)
  • Base model for Stage 2 agentic fine-tuning

Limitations

  • 9B parameter model - not suitable for tasks requiring extensive world knowledge
  • Reasoning traces can be verbose for simple questions
  • Not optimized for tool calling - see Harmonic-Hermes-9B (coming soon) for agentic use
  • Benchmark evaluation is ongoing

Architecture

  • Base: Qwen 3.5 9B (9.65B parameters)
  • Training: LoRA fine-tuning, merged into base weights
  • Precision: BF16
  • Context: 8192 tokens

License

Apache 2.0 - same as the base model. All training data is from Apache 2.0 or MIT licensed sources. Fully commercial use permitted.

Links

Model provider

slevinw

Model tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today