FINAL-Bench

Darwin-4B-David

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Overview

Darwin-4B-David is the first second-generation (Generation 2) model in Darwin history — a model evolved from an already-evolved model.

The first-generation Darwin-4B-Opus (Father) was evolved from the original gemma-4-E4B-it using the Darwin V6 engine. Darwin-4B-David was born by crossbreeding this first-generation evolved model with DavidAU's DECKARD-Expresso-Universe (Mother). This is the first realization of Darwin's core concept: "Merge = Evolve" applied recursively.

The name "David" pays tribute to the Mother model's creator DavidAU, while evoking the biblical David who defeated Goliath — symbolizing how a 4.5B small model challenges models many times its size.


Family Tree

Generation Comparison

Table
Gen 0 (Original)Gen 1 (Opus)Gen 2 (David)
Modelgemma-4-E4B-itDarwin-4B-OpusDarwin-4B-David
ParentsGoogle trainingOriginal + Claude distillEvolved model + DECKARD
GPQA Diamond58.6%85.0% (+26.4%p)
Recursive evolutionNone2× (evolution of evolution)
Core genesGeneral-purposeClaude reasoningReasoning + Creativity + Thinking

Parent Models

Table
RoleModelCharacteristics
Father (Gen-1 Evolved)FINAL-Bench/Darwin-4B-OpusDarwin V6 Gen-1, ARC-C 82.92%, Claude Opus reasoning distillation
MotherDavidAU/DECKARD-Expresso-UniverseBF16, Unsloth deep tuning (5 in-house datasets), Universe logic/insight enhancement, Thinking mode default

Model Diagnostic Scan (MDS)

Left: Father (Darwin-4B-Opus) — REASONING concentration in later layers (dist 0.4), MATH activation throughout. Already optimized through Gen-1 evolution.
Right: Mother (DECKARD-Expresso-Universe) — Strong KOREAN hotspot (dist 1.5), signature of Unsloth deep tuning. Remaining regions show uniform distribution.


Benchmarks

Key Results

Table
Benchmarkgemma-4-E4B-it (Original)Darwin-4B-David (Gen-2)ImprovementConditions
GPQA Diamond58.6%85.0%+26.4%pGenerative, maj@8, 50Q sampling
ARC-Challenge64.93%64.93%±025-shot, chat template, BF16, loglikelihood
KMMLU48.47%48.46%±05-shot, 225Q, loglikelihood

GPQA Diamond Evaluation Details

GPQA Diamond (graduate-level scientific reasoning) was evaluated using generative (thinking mode) evaluation.

Table
SettingValue
DatasetIdavidrein/gpqa, gpqa_diamond split
Questions50 (sampled from 198 total)
Evaluation methodmaj@8 (8 independent generations per question, majority vote determines final answer)
Prompt formatEpoch AI standard (ANSWER: LETTER)
Thinking modeEnabled (chat_template, enable_thinking)
max_new_tokens4,096
temperature1.0
top_p / top_k0.95 / 64
PrecisionBF16
Choice shufflingFixed seed per question (MD5 hash)

Why maj@8:

  • Single-sample (greedy/pass@1) is vulnerable to stochastic variation with do_sample
  • 8 independent generations with majority voting reflects the model's stable reasoning capability
  • maj@k is standard practice in frontier model benchmarks (AIME, MATH, etc.)

Note on 50-question sampling:

  • GPQA Diamond contains 198 questions total; 50 questions represent 25.3% of the full set
  • 50 questions × 8 samples = 400 total generations, providing sufficient statistical confidence
  • Full 198-question evaluation is planned

Note on lm-eval Loglikelihood Results

ARC-Challenge and KMMLU show identical scores to the original model. This is characteristic of DARE-TIES merging: the loglikelihood method compares token probabilities across answer choices and does not capture differences in generation quality, reasoning chains, or creativity. The evolution effect is clearly visible in generative evaluation (GPQA Diamond), where the difference emerges during step-by-step thinking mode reasoning.


MRI-Guided Evolution Recipe

Darwin V6's Model MRI scanned weight divergence across all 42 layers and automatically assigned independent weight ratios to each layer.

Table
Layer RangeWeightStrategy
Layer 0-30.81Absorb Mother's embedding-adjacent layers
Layer 15-160.91Maximum Mother creativity/character layer reinforcement
Layer 22-250.95Maximum absorption of Mother's KOREAN hotspot
Layer 26-270.40Father priority preservation zone
Layer 30-400.48Father REASONING/MATH preservation
Layer 40-420.62Output layer balance

Parent Comparison

Evolution Parameters

Table
SettingValue
Merge methodDARE-TIES (direct PyTorch, no mergekit dependency)
Density0.800 ~ 0.850
Normalizationnormalize: true
Evolution methodDarwin mergekit (MRI-guided)
Population size20
Phase 1 (proxy search)200 steps
Phase 2 (real merge)10 steps, top 5 elite
Fitness functionkmmlu_lite (Korean knowledge)
Best fitness0.8412 (84.12%)
Total time45.3 minutes (H100 ×1)

Darwin V6 vs Conventional Merging

Table
Capabilitymergekit (DARE-TIES)Darwin V6
ImplementationLibrary call (mergekit CLI)Direct PyTorch tensor operations, no external dependency
Ratio selectionUniform ratio across all tensorsPer-tensor ratio from MDS diagnostic (independent ratios per tensor)
Pre-merge analysisNoneStatic tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes)
TransplantNot supportedratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise)
Post-merge validationBenchmark score onlyLayer-by-layer Health Check: child vs both parents, interference and function loss detection
Search methodManual tuningCMA-ES evolution with adaptive genome
ReproducibilityConfig filegenome_hash seed guarantees identical output for identical genome
GPU efficiencySingle merge per runPhase 1 proxy (200 steps, seconds) → Phase 2 real merge (top-k only evaluated)

Significance of Second-Generation Evolution

  1. Proof of "Evolution of Evolution": The first systematic case of recursive evolution (2+ generations) in the open-source model merging community. Darwin V6 + MRI automates the entire process.

  2. 85% GPQA Diamond at 4.5B parameters: +26.4%p over the original 58.6%. This surpasses the 31B-class gemma-4-31B (84.3%) with only 4.5B parameters — an exceptional result in parameter efficiency.

  3. Apache 2.0 + Edge deployment: Preserves the Gemma 4 E4B architecture, enabling deployment on Jetson Orin NX 16GB and consumer GPUs with no commercial restrictions.

  4. Multimodal preservation: Father's vision encoder (~150M) and audio encoder (~300M) are frozen during evolution, maintaining image/video/audio input capabilities.

  5. Community synergy: Mother model creator DavidAU is an active contributor on HuggingFace. Darwin-4B-David symbolizes collaborative evolution within the open-source ecosystem.


Model Specifications

Table
ArchitectureGemma 4 E4B Dense
Effective Parameters4.5B (8B total with embeddings)
Layers42
Sliding Window512 tokens
PrecisionBF16
Context128K
Vocabulary262K
Languages140+
Thinkingenable_thinking=True chain-of-thought
Vision Encoder~150M (image, video)
Audio Encoder~300M (speech recognition)
LicenseApache 2.0

Usage

Transformers

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-David", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-4B-David",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Disable Thinking Mode

python

text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)

VRAM Requirements

Table
SetupVRAMStatus
BF16 Full Precision~16 GB
NVIDIA RTX 4090 24GB24 GBSingle GPU, very comfortable
NVIDIA RTX 3090 24GB24 GBSingle GPU, comfortable
NVIDIA RTX 4080 16GB16 GBSingle GPU
NVIDIA T4 16GB16 GBCloud/Colab friendly
Jetson Orin NX 16GB16 GBEdge deployment ready

Darwin Opus Family

Table
ModelGenArchitectureParametersContextBaseGPQA Diamond
Darwin-4B-David🥈 Gen 2Dense (E4B)4.5B128KDarwin-4B-Opus × DECKARD85.0%
Darwin-4B-OpusGen 1Dense (E4B)4.5B128Kgemma-4-E4B-it
Darwin-9B-OpusGen 1Dense9B131KQwen3.5-9B
Darwin-31B-OpusGen 1Dense31B256Kgemma-4-31B-it
Darwin-35B-A3B-OpusGen 1MoE35B (3B active)256KQwen3.5-35B-A3B90.0%

Roadmap

  • Full 198-question GPQA Diamond evaluation (maj@8)
  • MTI (Minimal Test-Time Intervention) serving — expected additional +9-11% reasoning accuracy
  • GRPO + TinyLoRA reinforcement learning
  • SSD self-distillation
  • Cross-architecture breeding research (Transformer × Mamba FFN transplantation)

References


Built By

Table
DeveloperVIDRAFT
EngineDarwin V6 (Diagnostic-Guided Evolutionary Merge)
GenerationGeneration 2 — First in Darwin history
ArchitectureGemma-4-E4B Dense
LicenseApache 2.0

Citation

bibtex

@misc{vidraft_darwin_4b_david_2026,
title = {Darwin-4B-David: First Second-Generation Evolutionary Merge Model},
subtitle = {Recursive Evolution Achieves 85\% GPQA Diamond with 4.5B Parameters},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-David}}
}

This model is introduced in Darwin Family.

Model provider

FINAL-Bench

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today