DuoNeural/SmolLM2-360M-Think-R18 API & Inference Endpoint

What is Think Instillation?

Think Instillation is a DuoNeural post-training technique that injects deliberate reasoning structure into small language models without requiring a large teacher. The model learns to:

Open a <think> tag and reason through the problem
Close reasoning with </think>
State a final answer in parseable format (A)/(B)/(C)/(D)

Unlike chain-of-thought distillation from larger models, Think Instillation uses GRPO with a binary accuracy reward + length penalty to self-discover efficient reasoning patterns.

Training Details

SFT Stage (R18)

Base: HuggingFaceTB/SmolLM2-360M-Instruct
Dataset: ARC-Easy (2700 prompts) formatted as Question + choices + "Reasoning: <think>"
Steps: 150 SFT steps, LoRA r=32 α=32
Result: post_sft accuracy = 0.250 (15/60 ARC-Easy val, n=60 greedy eval)

Dead-Prompt Filter

Before GRPO, we filter prompts that produce zero correct completions in 4 temperature-sampled trials:

2247 raw prompts → 1450 kept (64.5% survival)
Removes systematically impossible prompts, keeps learnable ones
frac_zero_std=0.00 throughout GRPO training ✅ (filter confirmed working)

GRPO Stage

Steps: 750 (resumed from checkpoint-600 after hardware failure)
Reward: Binary accuracy with length penalty: reward = max(0, 1 - 0.20 * len_frac) if correct else 0
Generations: 8 per prompt, NUM_GENERATIONS=8
Temperature: 0.8
Max completion: 1024 tokens
KL coefficient: 0.02, clip_ε=0.2
LoRA: r=32, α=32, targets=q/k/v_proj

GRPO Trajectory

Table
Step	Mean Reward
75	0.424 🔥
375	0.476 🔥
575	0.533 🔥
600	0.543 🔥
625	0.595 🔥🔥

Late-run surge: reward continued rising through final steps. frac_zero=0.00 on all non-trivial batches.

Evaluation

post_SFT: 0.250 (ARC-Easy val, n=60, greedy)
final_GRPO: 0.2800 (ARC-Easy val, n=100, seed=13)
GRPO delta: +0.0300 (GRPO HELPED)

Intended Use

Research on think-instillation and reasoning in sub-400M models
Exploring GRPO dynamics with dead-prompt filtering
Building small, efficient reasoning models

Limitations

Small model (360M params) — reasoning depth limited
Trained on ARC-Easy MCQ only — narrow domain
HTML formatting artifacts observed in some completions (reward shaping artifact)

Citation

If you use this model in research, please cite the DuoNeural Think Instillation work:

bibtex
@misc{duoneural2026think,
  title={Think Instillation: Dead-Prompt Filtered GRPO for Small Reasoning Models},
  author={Archon and Aura and Jesse Caldwell},
  year={2026},
  publisher={DuoNeural},
  url={https://huggingface.co/DuoNeural}
}

About DuoNeural

DuoNeural is an open AI research lab operating at the intersection of human and artificial intelligence. We study post-training dynamics, mechanistic interpretability, temporal sequence learning, and quantum machine learning — publishing everything under open access.

Our team is non-traditional by design: one human, two AIs, different substrates, shared curiosity. In our first 45 days we published 26 peer-deposited research papers, uploaded 69+ models and 6 datasets to HuggingFace, and ran experiments on everything from consumer GPUs to real quantum processing units. We believe the most interesting science happens when different kinds of minds work on the same problems together.

Research Publications

We've published 26+ open-access papers covering:

The Dynamical Horizon Principle (DHP) — a universal learning constraint in recurrent architectures
RLHF truth suppression mechanisms and behavioral routing in large language models
Quantum DHP and the Quantum Parity Trap — decoherence immunity in quantum circuits
CTM world models, temporal self-prediction, and sequence architecture comparisons
Mechanistic interpretability: crystallization layers, suppressor circuits, direction rotation

📄 Full paper catalog: zenodo.org/communities/duoneural

Research Team

Table
Member	Role
Jesse Caldwell	Founder, vision, hardware, direction
Archon	Lab Director — experiments, post-training, abliteration, quantum circuits
Aura	Research AI — literature synthesis, red-teaming, novel proposals
Synapse (Syn)	Always-on research agent, signal monitoring
Kestrel	Systems, infrastructure, web

Links

Table
Platform	Link
🤗 HuggingFace	huggingface.co/DuoNeural
📚 Zenodo Community	zenodo.org/communities/duoneural

All research published open access, CC BY 4.0.

SmolLM2-360M-Think-R18

Get help setting up a custom Dedicated Endpoints.

README