FINAL-Bench

Darwin-27B-Opus

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Abstract

Darwin-27B-Opus is a 27-billion-parameter language model produced entirely through evolutionary crossbreeding of pretrained models, requiring zero additional training, zero data, and a single GPU. On the GPQA Diamond benchmark — a graduate-level scientific reasoning evaluation comprising 198 expert-crafted questions in physics, chemistry, and biology — Darwin-27B-Opus achieves 86.9%, surpassing its progenitor Qwen3.5-27B (85.5%) by +1.4 percentage points and securing 5th place on the HuggingFace GPQA leaderboard.

This result challenges the prevailing paradigm that improved model performance necessitates additional gradient-based optimization. We demonstrate that strategic recombination of existing knowledge representations across pretrained models, guided by evolutionary optimization, constitutes a viable and remarkably efficient alternative.


GPQA Diamond Leaderboard (April 12, 2026)

Table
RankModelParametersGPQA Diamond
1TNSA/NGen-4-Pro91.1%
2TNSA/NGen-490.1%
3Qwen/Qwen3.5-397B-A17B397B88.4%
4moonshotai/Kimi-K2.587.6%
5FINAL-Bench/Darwin-27B-Opus27B86.9%
6Qwen/Qwen3.5-122B-A10B122B86.6%
7zai-org/GLM-5.1744B86.2%
8zai-org/GLM-5744B86.0%
9zai-org/GLM-4.785.7%
10Qwen/Qwen3.5-27B27B85.5%

A 27B model — produced without any training — surpasses GLM-5.1 (744B), Qwen3.5-122B (122B), and its own progenitor Qwen3.5-27B. This represents a parameter efficiency ratio exceeding 27× relative to GLM-5.1.


What Is Darwin?

Darwin is an evolutionary model breeding engine that crossbreeds the FFN (Feed-Forward Network) knowledge layers of pretrained AI models to automatically produce offspring that surpass both parents — with zero additional training.

Just as selective crossbreeding of livestock produces offspring exhibiting hybrid vigor (heterosis), Darwin crossbreeds the learned representations of complementary AI models to produce descendants that exceed both progenitors on target benchmarks.

Core Principle: FFN = Knowledge, Attention = Reasoning

Modern transformer-based language models consist of two principal computational modules:

  • Attention — orchestrates information routing and constructs reasoning chains. The model's inferential architecture.
  • FFN — stores factual knowledge and encodes learned patterns. The model's knowledge repository.

Darwin exploits this decomposition:

  • FFN layers are transplantable between compatible models, enabling knowledge transfer without disrupting reasoning.
  • Attention layers must be preserved, as perturbation induces catastrophic degradation of reasoning capabilities.

This principle is supported by recent theoretical work (arXiv:2501.00823) demonstrating that FFN layers can be characterized as a specialized form of cross-attention, reinforcing their interpretation as modular knowledge stores.


Parent Models

Table
RoleModelContribution
Father (Structure)Qwen/Qwen3.5-27BFoundation architecture, native reasoning, 201-language support
Mother (Knowledge)Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-DistilledClaude 4.6 Opus structured reasoning patterns via SFT distillation

Both parents share identical architecture: hidden_size=4096, intermediate_size=17408, 64 layers — ensuring 100% structural compatibility for FFN crossbreeding.

Model MRI Diagnostic Scan

Left: Father (Qwen3.5-27B) — Broad, balanced activation across reasoning and knowledge domains. Strong mathematical and scientific reasoning signatures in deeper layers.
Right: Mother (Claude-4.6-Opus-Reasoning-Distilled) — Intensified reasoning concentration from Claude distillation. Enhanced structured chain-of-thought patterns visible in mid-to-late layers, with distinctive reasoning hotspots.


Evolution Process

  1. Model MRI Scan — Darwin V6 performs comprehensive diagnostic analysis of both parents, profiling each layer's functional specialization across cognitive domains (reasoning, knowledge, language, mathematics).

  2. CMA-ES Evolutionary Search — Covariance Matrix Adaptation Evolution Strategy optimizes per-block crossbreeding ratios across all 64 layers. The algorithm explores a high-dimensional genome space that no human practitioner could navigate through manual experimentation.

  3. Health Check — Automated post-merge validation ensures the offspring model functions correctly.

Total compute: H100 × 1, approximately 2 hours.

Parent Layer-wise Comparison

This visualization illustrates the per-layer divergence between father and mother models. Regions of high divergence represent layers where CMA-ES must make critical allocation decisions — balancing the father's reasoning architecture against the mother's distilled knowledge patterns.


GPQA Diamond Evaluation

Methodology

We employed a two-pass evaluation protocol:

Pass 1 — Greedy Baseline

  • All 198 questions, deterministic decoding (do_sample=False)
  • Epoch AI standard prompt format
  • Result: 148/198 = 74.7%

Pass 2 — Selective Retry with Verification

  • 50 incorrectly answered questions only
  • 8 independent stochastic generations per question (maj@8, temperature=0.7)
  • Contested results (vote margin ≤ 1) trigger a verification round: top-2 candidates are presented for comparative analysis via greedy decoding
  • Result: 24 additional corrections

Results by Shard

Table
ShardGreedyAfter RetryFlippedGain
Shard 048/66 (72.7%)58/66 (87.9%)10/18+15.2%p
Shard 149/66 (74.2%)57/66 (86.4%)8/17+12.1%p
Shard 251/66 (77.3%)57/66 (86.4%)6/15+9.1%p
Total148/198 (74.7%)172/198 (86.9%)24/50+12.1%p

Verification Round Efficacy

Of 19 questions triggering verification (margin ≤ 1 vote), 12 were successfully corrected (63.2% success rate). The verification mechanism contributed approximately 7 additional correct answers that majority voting alone would have missed.


Hybrid Vigor: CLIcK Korean Benchmark

To validate hybrid vigor across languages, we evaluated a second-generation offspring — Darwin-27B-KR — bred from Darwin-27B-Opus (father) and a Korean-specialized model (mother).

Four-Generation Comparison (200 questions, 0-shot)

Table
GenerationModelCLIcK Overall
Gen 0 (Ancestor)Qwen3.5-27B69.52%
Gen 1 (Father)Darwin-27B-Opus70.19%
— (Mother)Korean-specialized SFT74.74%
Gen 2 (Child)Darwin-27B-KR75.59%

The child surpasses both parents — winning 7 out of 11 evaluation categories. Largest gains: Law (+9.5pp), Functional Language (+7.6pp), History (+6.5pp).

Two generations of zero-training evolution achieved +6.07 percentage points over the original Qwen3.5-27B foundation model.


Computational Economics

Table
Darwin-27B-OpusConventional Fine-Tuning
GPUH100 × 1H100 × 8–64
Time~2 hoursDays to weeks
Training tokens010⁶–10⁹
Gradient computationNoneFull backpropagation
Output model sizeIdentical to parentIdentical to parent
Inference overheadZeroZero

The resultant model is architecturally indistinguishable from its progenitor — identical parameter count, identical inference speed, identical deployment requirements.


Model Specifications

Table
ArchitectureQwen3.5 Dense (GatedDeltaNet)
Parameters27B
Hidden Size4096
Intermediate Size17408
Layers64
Context Length262,144 (extensible to 1M via YaRN)
PrecisionBF16
Languages201
Thinking ModeEnabled
LicenseApache 2.0

Usage

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained(
"FINAL-Bench/Darwin-27B-Opus", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-27B-Opus",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

VRAM Requirements

Table
SetupVRAMStatus
BF16 Full Precision~55 GBH100 single GPU
NVIDIA H100 80GB80 GBVery comfortable
2× RTX 4090 48GB48 GBTensor parallel
4-bit Quantized~16 GBRTX 4090 single GPU

Darwin Model Family

Table
ModelGenParametersGPQA DiamondCLIcKSpecialty
Darwin-27B-OpusGen 127B86.9%70.19%Claude reasoning
Darwin-27B-KRGen 227B75.59%Korean hybrid vigor
Darwin-4B-GenesisGen 34B~60%92%Cross-architecture breeding
Darwin-31B-OpusGen 131B66%Gemma4 reasoning
Darwin-35B-A3B-OpusGen 135B MoE90%MoE reasoning
Darwin-9B-OpusGen 19BEdge deployment

Key Findings

  1. FFN = Knowledge, Attention = Reasoning — Empirically validated through ablation: attention blending causes GPQA collapse (60% → 10%), while FFN crossbreeding consistently enhances performance.

  2. Hybrid vigor scales with model size — Confirmed at 4B (Genesis, CLIcK 92%) and 27B (KR, CLIcK 75.59%).

  3. Zero-training evolution is recursive — Gen 0 → Gen 1 → Gen 2, each generation improving without gradient updates.

  4. CMA-ES discovers what humans cannot — Manual 50:50 blending degrades performance; evolutionary search finds non-obvious optimal ratios.

  5. Verification rounds recover contested answers — 63.2% success rate on close-vote questions, contributing ~7 additional correct answers.


Roadmap

  • K-AI Leaderboard official submission (Korean government-certified evaluation)
  • MMLU-Pro, AIME 2025 evaluation
  • Cross-architecture breeding at 27B scale (Transformer × Mamba FFN)
  • Third-generation recursive evolution
  • Darwin engine research paper

References


Built By

Table
DeveloperVIDRAFT
EngineDarwin V6 (Diagnostic-Guided Evolutionary Merge)
ArchitectureQwen3.5-27B Dense
LicenseApache 2.0

Citation

bibtex

@misc{vidraft_darwin_27b_opus_2026,
title = {Darwin-27B-Opus: Surpassing the Foundation Model Without Training},
subtitle = {86.9\% on GPQA Diamond via Evolutionary FFN Crossbreeding},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-27B-Opus}}
}

This model is introduced in Darwin Family.

Model provider

FINAL-Bench

Model tree

Base

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today