FINAL-Bench

Darwin-4B-David

README

License: apache-2.0

Overview

Darwin-4B-David is the first second-generation (Generation 2) model in Darwin history — a model evolved from an already-evolved model.

The first-generation Darwin-4B-Opus (Father) was evolved from the original gemma-4-E4B-it using the Darwin V6 engine. Darwin-4B-David was born by crossbreeding this first-generation evolved model with DavidAU's DECKARD-Expresso-Universe (Mother). This is the first realization of Darwin's core concept: "Merge = Evolve" applied recursively.

The name "David" pays tribute to the Mother model's creator DavidAU, while evoking the biblical David who defeated Goliath — symbolizing how a 4.5B small model challenges models many times its size.

Family Tree

Generation Comparison

Table with columns: Gen 0 (Original), Gen 1 (Opus), Gen 2 (David)
	Gen 0 (Original)	Gen 1 (Opus)	Gen 2 (David)
Model	gemma-4-E4B-it	Darwin-4B-Opus	Darwin-4B-David
Parents	Google training	Original + Claude distill	Evolved model + DECKARD
GPQA Diamond	58.6%	—	85.0% (+26.4%p)
Recursive evolution	None	1×	2× (evolution of evolution)
Core genes	General-purpose	Claude reasoning	Reasoning + Creativity + Thinking

Parent Models

Table with columns: Role, Model, Characteristics
Role	Model	Characteristics
Father (Gen-1 Evolved)	FINAL-Bench/Darwin-4B-Opus	Darwin V6 Gen-1, ARC-C 82.92%, Claude Opus reasoning distillation
Mother	DavidAU/DECKARD-Expresso-Universe	BF16, Unsloth deep tuning (5 in-house datasets), Universe logic/insight enhancement, Thinking mode default

Model Diagnostic Scan (MDS)

Left: Father (Darwin-4B-Opus) — REASONING concentration in later layers (dist 0.4), MATH activation throughout. Already optimized through Gen-1 evolution.
Right: Mother (DECKARD-Expresso-Universe) — Strong KOREAN hotspot (dist 1.5), signature of Unsloth deep tuning. Remaining regions show uniform distribution.

Benchmarks

Key Results

Table with columns: Benchmark, gemma-4-E4B-it (Original), Darwin-4B-David (Gen-2), Improvement, Conditions
Benchmark	gemma-4-E4B-it (Original)	Darwin-4B-David (Gen-2)	Improvement	Conditions
GPQA Diamond	58.6%	85.0%	+26.4%p	Generative, maj@8, 50Q sampling
ARC-Challenge	64.93%	64.93%	±0	25-shot, chat template, BF16, loglikelihood
KMMLU	48.47%	48.46%	±0

GPQA Diamond Evaluation Details

GPQA Diamond (graduate-level scientific reasoning) was evaluated using generative (thinking mode) evaluation.

Table with columns: Setting, Value
Setting	Value
Dataset	Idavidrein/gpqa, gpqa_diamond split
Questions	50 (sampled from 198 total)
Evaluation method	maj@8 (8 independent generations per question, majority vote determines final answer)
Prompt format	Epoch AI standard (`ANSWER: LETTER`)
Thinking mode	Enabled (chat_template, enable_thinking)
max_new_tokens	4,096
temperature	1.0
top_p / top_k

Why maj@8:

Single-sample (greedy/pass@1) is vulnerable to stochastic variation with do_sample
8 independent generations with majority voting reflects the model's stable reasoning capability
maj@k is standard practice in frontier model benchmarks (AIME, MATH, etc.)

Note on 50-question sampling:

GPQA Diamond contains 198 questions total; 50 questions represent 25.3% of the full set
50 questions × 8 samples = 400 total generations, providing sufficient statistical confidence
Full 198-question evaluation is planned

Note on lm-eval Loglikelihood Results

ARC-Challenge and KMMLU show identical scores to the original model. This is characteristic of DARE-TIES merging: the loglikelihood method compares token probabilities across answer choices and does not capture differences in generation quality, reasoning chains, or creativity. The evolution effect is clearly visible in generative evaluation (GPQA Diamond), where the difference emerges during step-by-step thinking mode reasoning.

MRI-Guided Evolution Recipe

Darwin V6's Model MRI scanned weight divergence across all 42 layers and automatically assigned independent weight ratios to each layer.

Table with columns: Layer Range, Weight, Strategy
Layer Range	Weight	Strategy
Layer 0-3	0.81	Absorb Mother's embedding-adjacent layers
Layer 15-16	0.91	Maximum Mother creativity/character layer reinforcement
Layer 22-25	0.95	Maximum absorption of Mother's KOREAN hotspot
Layer 26-27	0.40	Father priority preservation zone
Layer 30-40	0.48	Father REASONING/MATH preservation
Layer 40-42

Parent Comparison

Evolution Parameters

Table with columns: Setting, Value
Setting	Value
Merge method	DARE-TIES (direct PyTorch, no mergekit dependency)
Density	0.800 ~ 0.850
Normalization	normalize: true
Evolution method	Darwin mergekit (MRI-guided)
Population size	20
Phase 1 (proxy search)	200 steps
Phase 2 (real merge)	10 steps, top 5 elite
Fitness function	kmmlu_lite (Korean knowledge)
Best fitness

Darwin V6 vs Conventional Merging

Table with columns: Capability, mergekit (DARE-TIES), Darwin V6
Capability	mergekit (DARE-TIES)	Darwin V6
Implementation	Library call (mergekit CLI)	Direct PyTorch tensor operations, no external dependency
Ratio selection	Uniform ratio across all tensors	Per-tensor ratio from MDS diagnostic (independent ratios per tensor)
Pre-merge analysis	None	Static tensor profiling (entropy, std, norm) + probe-based functional importance (5 probes)
Transplant	Not supported	ratio < 0.15 → Father 100%, ratio > 0.85 → Mother 100% (zero interpolation noise)
Post-merge validation	Benchmark score only	Layer-by-layer Health Check: child vs both parents, interference and function loss detection

Significance of Second-Generation Evolution

Proof of "Evolution of Evolution": The first systematic case of recursive evolution (2+ generations) in the open-source model merging community. Darwin V6 + MRI automates the entire process.
85% GPQA Diamond at 4.5B parameters: +26.4%p over the original 58.6%. This surpasses the 31B-class gemma-4-31B (84.3%) with only 4.5B parameters — an exceptional result in parameter efficiency.
Apache 2.0 + Edge deployment: Preserves the Gemma 4 E4B architecture, enabling deployment on Jetson Orin NX 16GB and consumer GPUs with no commercial restrictions.
Multimodal preservation: Father's vision encoder (~150M) and audio encoder (~300M) are frozen during evolution, maintaining image/video/audio input capabilities.
Community synergy: Mother model creator DavidAU is an active contributor on HuggingFace. Darwin-4B-David symbolizes collaborative evolution within the open-source ecosystem.

Model Specifications

Table

Architecture	Gemma 4 E4B Dense
Effective Parameters	4.5B (8B total with embeddings)
Layers	42
Sliding Window	512 tokens
Precision	BF16
Context	128K
Vocabulary	262K
Languages	140+
Thinking	enable_thinking=True chain-of-thought

Usage

Transformers

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("FINAL-Bench/Darwin-4B-David", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "FINAL-Bench/Darwin-4B-David",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Disable Thinking Mode

python
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)

VRAM Requirements

Table with columns: Setup, VRAM, Status
Setup	VRAM	Status
BF16 Full Precision	~16 GB
NVIDIA RTX 4090 24GB	24 GB	Single GPU, very comfortable
NVIDIA RTX 3090 24GB	24 GB	Single GPU, comfortable
NVIDIA RTX 4080 16GB	16 GB	Single GPU
NVIDIA T4 16GB	16 GB	Cloud/Colab friendly
Jetson Orin NX 16GB	16 GB

Darwin Opus Family

Table with columns: Model, Gen, Architecture, Parameters, Context, Base, GPQA Diamond
Model	Gen	Architecture	Parameters	Context	Base	GPQA Diamond
Darwin-4B-David	🥈 Gen 2	Dense (E4B)	4.5B	128K	Darwin-4B-Opus × DECKARD	85.0%
Darwin-4B-Opus	Gen 1	Dense (E4B)	4.5B	128K

Roadmap

Full 198-question GPQA Diamond evaluation (maj@8)
MTI (Minimal Test-Time Intervention) serving — expected additional +9-11% reasoning accuracy
GRPO + TinyLoRA reinforcement learning
SSD self-distillation
Cross-architecture breeding research (Transformer × Mamba FFN transplantation)

References

DARE-TIES: Yadav et al., 2023 (https://arxiv.org/abs/2311.03099) — re-implemented, not library-dependent
Darwin V6 Engine: https://huggingface.co/spaces/ginigen-ai/DARWIN-V5-BACKUP
FINAL Bench: https://huggingface.co/spaces/FINAL-Bench/Leaderboard
DavidAU DECKARD Series: https://huggingface.co/DavidAU
MTI: Minimal Test-Time Intervention (arXiv:2510.13940)

Built By

Table

Developer	VIDRAFT
Engine	Darwin V6 (Diagnostic-Guided Evolutionary Merge)
Generation	Generation 2 — First in Darwin history
Architecture	Gemma-4-E4B Dense
License	Apache 2.0

Citation

bibtex
@misc{vidraft_darwin_4b_david_2026,
  title        = {Darwin-4B-David: First Second-Generation Evolutionary Merge Model},
  subtitle     = {Recursive Evolution Achieves 85\% GPQA Diamond with 4.5B Parameters},
  author       = {VIDRAFT},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-4B-David}}
}

This model is introduced in Darwin Family.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.