Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Overview

Darwin-31B-Opus is a reasoning-enhanced model created by merging google/gemma-4-31B-it (Father) and TeichAI/gemma-4-31B-it-Claude-Opus-Distill (Mother) using the Darwin V6 engine.

Darwin V6 diagnoses both parent models at the tensor level before merging, assigning an independent optimal ratio to each of the 1,188 tensors. This is fundamentally different from conventional merging tools that apply a single uniform ratio across all tensors.


Parent Models

RoleModelCharacteristics
Fathergoogle/gemma-4-31B-itGemma 4 Dense 31B, multimodal, 256K context, LMArena 1452 (open model #3)
MotherTeichAI/gemma-4-31B-it-Claude-Opus-DistillClaude 4.6 Opus high-effort reasoning distillation, code/science/analysis

Model Diagnostic Scan (MDS)

Left: Father (gemma-4-31B-it) — balanced generalist with low activation across most probes. Right: Mother (Claude-Opus-Distill) — strong REASONING concentration in L50-L60, CODE activation in late layers, KOREAN at start and end. The Mother shows significantly more specialized layer patterns from Claude Opus distillation.


🏆 Benchmark — GPQA Diamond (198 questions)

GPQA Diamond is a 198-question, PhD-level graduate science reasoning benchmark.

BenchmarkDarwin-31B-OpusEngine
GPQA Diamond🥇 85.9%Darwin-DELPHI test-time engine
ARC-Challenge82.89%evolutionary-selection metric (loglikelihood, 0-shot, 200Q)

The 85.9 % GPQA Diamond result is produced with the Darwin-DELPHI test-time reasoning engine applied on top of this model. The evaluation methodology is protected; sample counts, staging, and thresholds are a trade secret. ARC-Challenge 82.89 % is the internal evolutionary-selection score used during the Darwin V6 merge search.

Note: the Gemma 4 architecture (Gemma4ForConditionalGeneration) has a multimodal wrapper that limits lm-eval loglikelihood compatibility; generative evaluation is the valid path for Gemma 4 based models, and Darwin-DELPHI evaluates generatively accordingly.


Darwin V6 vs Conventional Merging

Capabilitymergekit (DARE-TIES)Darwin V6
ImplementationLibrary call (mergekit CLI)Direct PyTorch tensor operations, no external dependency
Ratio selectionUniform ratio across all tensorsPer-tensor ratio from MDS diagnostic (1,188 independent ratios)
Pre-merge analysisNoneStatic tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes)
Ratio formulaHuman-set or grid searchcombined = static × 0.4 + probe × 0.6, then evolutionary optimization
TransplantNot supportedratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise)
Post-merge validationBenchmark score onlyLayer-by-layer Health Check: child vs both parents, interference and function loss detection
Search methodManual tuningCMA-ES evolution with adaptive 14-dimensional genome
ReproducibilityConfig filegenome_hash seed guarantees identical output for identical genome
GPU efficiencySingle merge per runPhase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated)

How Darwin V6 Works

Darwin V6 does not use mergekit or any external merge library. It re-implements DARE-TIES (Yadav et al., 2023) directly via PyTorch tensor operations with per-tensor diagnostic ratios.

Before merging, Darwin performs a Model Diagnostic Scan (MDS) on both parents. For every tensor, it measures Shannon entropy (information density), standard deviation (activation spread), and L2 norm (energy). Additionally, 5 diagnostic probes (REASONING, CODE, MATH, KNOWLEDGE, LANGUAGE) are passed through the model, measuring cosine distance when each layer is skipped to determine functional importance.

The final merge ratio for each tensor:

markdown

static_score = entropy × 0.3 + std × 0.2 + clamp(norm, 100) × 0.002
probe_score = Σ(cosine_distance[probe_i] × weight_i)
combined = static × 0.4 + probe × 0.6
mri_ratio = combined_b / (combined_a + combined_b)
final_ratio = mri_ratio × mri_trust + genome_ratio × (1 - mri_trust)

The mri_trust parameter itself is optimized by the CMA-ES evolutionary algorithm, allowing the system to automatically determine the optimal balance between diagnostic prescription and evolutionary search for each model pair.

After merging, a Health Check compares the child model against both parents layer-by-layer, detecting interference (child importance >> parent max) or function loss (parent importance high but child dropped).

Parent Comparison (MDS Result)


Evolution Result

Best Score (ARC-Challenge)0.8289
Merge MethodDARE-TIES (direct PyTorch)
Tensors Merged1,188
Health Checkhealthy
Phase 2 Steps4 (early stop, patience=5)
Total Time134 min
Infrastructure4 x NVIDIA H100 NVL (100GB)

Optimal Genome (14-dimensional adaptive):

markdown

global_ratio: 0.5147 (overall merge ratio)
attn_ratio: 0.3169 (Attention layers — Father dominant)
ffn_ratio: 0.9316 (FFN layers — Mother dominant)
embed_ratio: 0.7748 (Embedding)
density_a: 0.8997 (Father DARE density)
density_b: 0.9539 (Mother DARE density)
block_0_ratio: 0.6628 (L0-L9)
block_1_ratio: 0.6431 (L10-L19)
block_2_ratio: 0.5146 (L20-L29, balanced)
block_3_ratio: 0.5971 (L30-L39)
block_4_ratio: 0.6339 (L40-L49)
block_5_ratio: 0.8583 (L50-L59, reasoning core — Mother dominant)
mri_trust: 0.3631 (MDS 36% + Genome 64%)
merge_method_weight: 0.6897

Key observations from the genome: ffn_ratio=0.93 indicates the FFN layers strongly favor the Mother (Claude Opus Distill), and block_5 (L50-L59)=0.86 shows the reasoning core layers also favor Mother. This aligns with the MDS heatmap pattern where Mother's reasoning capability concentrated in the final layers. Meanwhile, attn_ratio=0.32 preserves Father's attention structure, maintaining the original Gemma 4 multimodal and long-context capabilities.


Model Specifications

ArchitectureGemma 4 Dense (Hybrid Attention: Sliding Window + Global)
Parameters31B
PrecisionBF16
Context256,072
Languages140+
Thinkingenable_thinking=True chain-of-thought
LicenseApache 2.0

Usage

Transformers

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-31B-Opus", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-31B-Opus",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

VRAM Requirements

SetupVRAMStatus
BF16 Full Precision~62 GB
NVIDIA H100 80GB80 GBSingle GPU
NVIDIA A100 80GB x 2160 GBComfortable
NVIDIA RTX 4090 24GB x 496 GBdevice_map=auto

References


Built By

DeveloperVIDRAFT
EngineDarwin V6 (Diagnostic-Guided Evolutionary Merge)
ArchitectureGemma-4-31B
LicenseApache 2.0

Citation

bibtex

@misc{vidraft_darwin_31b_opus,
title = {Darwin-31B-Opus: Diagnostic-Guided Evolutionary Merge on Gemma 4},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-31B-Opus}}
}

This model is introduced in Darwin Family.

Model provider

FINAL-Bench

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today