Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Technical Definitions

Before describing the methodology, we define the terms used throughout this document. These are not metaphors — they refer to specific, measurable quantities.

TermDefinitionMeasurement
Model MRILayer-level profiling of expert activation patterns and layer importance1K-sample calibration set, per-layer expert activation frequency, routing entropy, probe cosine distance
Dead ExpertA MoE expert rarely selected by the routerActivation frequency < 5% across calibration dataset
Routing EntropyShannon entropy of the router's softmax distributionH = -sum(p_i * log2(p_i)). Healthy range for top-8-of-256: 3.0-4.5 bits
Expert Activation FrequencySelection rate of each expert by the routerCount per expert across 1K samples, normalized to percentage
MRI-Guided MergePer-block merge ratios derived from parent diagnosticsLayers with high dead-expert counts get higher donor weight; healthy layers retain recipient weight
Health CheckPost-merge structural validationLayer-by-layer importance comparison: child vs both parents. Flags interference or function loss
Golden LayerLayer with highest measured importance for a target capabilityIdentified by peak probe cosine distance (e.g., L38 for reasoning)

Benchmark Results

GPQA Diamond (198 Questions, Graduate-Level Reasoning)

ModelAccuracyMultimodalArchitecture
Darwin-35B-A3B-Opus (Child)90.0%Image/VideoQwen3.5-35B-A3B
Mother (Jackrong Claude 4.6 Opus Distilled)85.0%Text-only trainingQwen3.5-35B-A3B (same)
Father (Qwen3.5-35B-A3B Official)84.2%Image/VideoQwen3.5-35B-A3B

Evaluation: SGLang, context 32768, temperature 0, greedy decoding, official GPQA prompt format

MMMLU (Multilingual Knowledge, 29 Languages)

ModelAccuracy
Darwin-35B-A3B-Opus (Child)85.0%
Father (Qwen3.5-35B-A3B Official)85.2%
  • GPQA vs Father: +6.9% relative improvement
  • GPQA vs Mother: +5.9% relative improvement
  • MMMLU: Father-level multilingual knowledge preserved (85.0% vs 85.2%)

Parent Models

Both parents share the identical Qwen3.5-35B-A3B architecture (40 layers, 256 experts, GDN+MoE hybrid). The Mother is a LoRA SFT on the same base — not a different architecture. "Text-only" refers to the training data (Claude 4.6 Opus reasoning chains), not the model structure.

RoleModelArchitectureTraining
FatherQwen/Qwen3.5-35B-A3BQwen3.5-35B-A3BOriginal pre-training + RLHF
MotherJackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-DistilledQwen3.5-35B-A3B (same)LoRA SFT with text-only Claude reasoning chains

Methodology: Darwin V5

Relationship to Existing Tools

Darwin V5 uses mergekit as its merge backend. We do not claim to have invented evolutionary merging — mergekit's evolve feature already provides this capability. What Darwin adds is a three-phase diagnostic pipeline that wraps mergekit with pre-merge profiling and post-merge verification.

Pipeline

markdown

Standard mergekit evolve:
Random initial params --> Evolve --> Best score
Darwin V5:
Phase 0: Profile both parents (40 layers x 256 experts)
| Measure: expert activation frequency, routing entropy,
| probe cosine distance per layer
v
Phase 1: Evolution with diagnostic-informed initial genome
| Search space constrained by dead expert map + layer importance
v
Phase 2: mergekit DARE-TIES merge + benchmark evaluation
| (same merge backend as standard mergekit)
v
Phase 3: Profile the child, compare against both parents
| Detect: interference, function loss, dead expert inheritance
v
Final model

What Darwin V5 Adds Over Standard mergekit evolve

Capabilitymergekit evolveDarwin V5
Merge backendmergekitmergekit (same)
Evolution algorithmCMA-ES / random searchCMA-ES with diagnostic-informed initial population
Pre-merge parent analysisNoneExpert activation frequency, routing entropy, probe cosine distance across 40L x 256E
Initial search spaceFull parameter spaceConstrained by parent diagnostics
Dead expert awarenessNoneDetects dead experts, adjusts density to compensate
Post-merge validationBenchmark score onlyLayer-by-layer child vs parents comparison
Failure diagnosis"Score went down""L23 interference: child importance 2.3x parent, weight conflict at attention heads"

How Diagnostics Changed the Merge

Without diagnostics (V4 blind evolution):

  • ratio=0.481, attn=0.168, ffn=0.841
  • Uniform across all 40 layers

With diagnostics (V5):

  • L0-L37: t=0.599 (Mother 60%), Mother's router
  • L38: t=0.900 (Mother 90%), Mother's router — identified as reasoning core by probe cosine distance
  • L39: t=0.534 (Father 47%), Father's router — preserves output/multimodal routing

The diagnostic profile identified L38 as having the highest cosine distance on REASONING and CODE probes. This informed the per-block strategy rather than relying on blind search to discover it.


Parent Model Diagnostics

Mother: Expert Activation Analysis

MetricValueInterpretation
Router Entropy~1.0 across all layersHealthy — experts evenly distributed among active ones
Dead Expert %50-65% in middle layersLoRA SFT only updated parameter subsets; multimodal/multilingual experts became inactive
Expert Similarity0.001-0.008Healthy — surviving experts remain diverse

L34-L38 shows high cosine distance across REASONING, CODE, LOGIC probes — this is where the Claude distillation concentrated its reasoning patterns.

Father: Baseline Profile

The Father shows uniform expert activation across all 40 layers — all experts active. This makes it suitable as a donor for the Mother's inactive expert slots.

Parent Comparison

  • Above zero: Father stronger — L0-L5 (embedding/early layers)
  • Below zero: Mother stronger — L5-L35 consistent advantage
  • L34-L38: Mother peaks on REASONING and CODE probes
  • L39: Father recovers — output layer

This advantage map directly informed the 3-block merge recipe.


Merge Configuration

yaml

# Darwin V5 diagnostic-guided layer-wise merge
# Method: DARE-TIES via mergekit
# Genome: ratio=0.800 attn=0.320 ffn=0.590 density=0.799
L0-L37: t=0.5988 (Mother 60%) — router from Mother
L38: t=0.9000 (Mother 90%) — reasoning core
L39: t=0.5336 (Father 47%) — router from Father (output routing)
ParameterV4 (Blind)V5 (Guided)Rationale
global_ratio0.4810.800Mother weight increased — diagnostics confirmed her reasoning layers are high quality
attn_ratio0.1680.320More Mother attention — probe data showed reasoning concentration in attention patterns
ffn_ratio0.8410.590More conservative — Father's FFN experts fill dead slots
density_b0.9710.799Reduced — compensates for Mother's 50-65% dead experts

Post-Merge Health Check

Layer-by-layer importance comparison between the child and both parents:

  • Layer 0 (Embedding): Child 0.42, parents 0.35-0.50. No interference.
  • Layers 1-33: Near-zero across all three. Normal for MoE middle layers.
  • Layers 34-39: Importance rises. Child matches or exceeds parents — reasoning transfer confirmed.
  • Layer 39 (Output): Child 0.48, matching parents. Output intact.

No interference detected. No function loss detected.


Inherited Capabilities

From Father (Qwen3.5-35B-A3B):

  • Multimodal: Image and video understanding
  • 201 Languages: Multilingual coverage
  • 262K Context: Native long-context (extendable to 1M via YaRN)
  • Gated DeltaNet + MoE architecture
  • Multi-Token Prediction

From Mother (Claude 4.6 Opus Distilled):

  • Structured step-by-step reasoning within <think> tags
  • Coding agent compatibility
  • Tool calling stability

Performance

MetricValue
Generation Speed147.8 tok/s
EnvironmentSingle NVIDIA H100 93GB NVL, SGLang, BF16
SetupVRAMStatus
BF16 Full Precision65.5 GiB
Single H100 93GB93 GBComfortable
Single A100 80GB80 GBTight
Q4_K_M Quantized~18 GiB
Single RTX 4090 24GB24 GBComfortable

Model Specifications

ArchitectureQwen3.5 MoE (Gated DeltaNet + MoE)
Total Parameters35B
Active Parameters3B per forward pass
Layers40
Layout10 x (3 x GDN-MoE + 1 x Attention-MoE)
Experts256 (8 routed + 1 shared active)
Context Length262,144 native
Languages201
MultimodalImage and Video
LicenseApache 2.0

Usage

SGLang (Recommended)

bash

python -m sglang.launch_server \
--model-path FINAL-Bench/Darwin-35B-A3B-Opus \
--tp 1 \
--mem-fraction-static 0.90 \
--context-length 32768 \
--trust-remote-code

vLLM

bash

vllm serve FINAL-Bench/Darwin-35B-A3B-Opus \
--trust-remote-code \
--enforce-eager

Transformers

python

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(
"FINAL-Bench/Darwin-35B-A3B-Opus",
trust_remote_code=True,
use_fast=True,
)
model = AutoModelForCausalLM.from_pretrained(
"FINAL-Bench/Darwin-35B-A3B-Opus",
dtype="bfloat16",
device_map="auto",
trust_remote_code=True,
)

Evolution Details

EngineDarwin V5 (Evolutionary Merge + Layer-Level Diagnostics)
Merge Backendmergekit (DARE-TIES)
EvolutionCMA-ES, Phase 1 (200 steps proxy) + Phase 2 (30 steps real benchmark)
Final real_score0.8405
Merge Time181.6 seconds
Merge Commit109838c2
Infrastructure4 x NVIDIA H100 93GB NVL

Acknowledgements

  • Korean Government — GPU Support Program research grant
  • Qwen Team — Qwen3.5-35B-A3B base architecture
  • Jackrong — Claude 4.6 Opus Reasoning Distilled model
  • mergekit — Merge backend infrastructure
  • nohurry, TeichAI — Distillation datasets

Citation

bibtex

@misc{vidraft_darwin_35b_opus,
title = {Darwin-35B-A3B-Opus: Diagnostic-Guided Evolutionary Merge},
author = {VIDRAFT},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}}
}

FAQ

This model is introduced in Darwin Family.

Model provider

FINAL-Bench

Model tree

Base

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today