FINAL-Bench

Darwin-398B-JGOS

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Overview

Darwin-398B-JGOS is the largest and highest-scoring member of the Darwin family. Built on Qwen 3.5 397B as the base, it transplants the FFN (expert) strengths of multiple high-performance models through the Darwin V9 platform, producing a 397B-parameter Mixture-of-Experts model with ~17B active parameters per token.

It reaches 90.9 % on GPQA Diamond with pure greedy decoding (single sample) โ€” surpassing Darwin-28B-REASON (89.39 %, achieved with the Darwin-DELPHI test-time engine) without using any test-time engine at all. This is the highest GPQA Diamond score in the Darwin family to date.


๐Ÿงฌ Darwin Platform & Research

Darwin is VIDRAFT's measuring-result-driven reasoning model family โ€” approximately 20 official models plus 400+ community derivatives, ranking among the top open models on GPQA.

  • Darwin V9 platform โ€” evolutionary FFN/expert transplant and trust-weighted merging onto large-scale MoE backbones.
  • FINAL Bench โ€” VIDRAFT's evaluation framework.
  • 4-layer Pre-AGI roadmap โ€” Darwin โ†’ AETHER โ†’ PROMETHEUS โ†’ HEPHAESTUS.

๐Ÿงฌ Model Lineage

Table
RoleModelContribution
BaseQwen 3.5 397B (A17B)397B Mixture-of-Experts backbone (~17B active).
FFN transplantDarwin V9 platform (proprietary)Transplants the FFN (expert) strengths of multiple high-performance models onto the base.
ResultDarwin-398B-JGOS (this model)397B MoE โ†’ 90.9 % GPQA Diamond, pure greedy.

The full Darwin V9 merge recipe โ€” source models, weighting, and density โ€” is proprietary and not disclosed (trade secret).


โš™๏ธ Technical Specifications

Table
ComponentValue
ArchitectureQwen3_5MoeForConditionalGeneration (Qwen 3.5 generation MoE)
Parameters~397 B total / ~17 B active (Mixture-of-Experts)
BaseQwen 3.5 397B (A17B)
Precisionbfloat16
Licenseapache-2.0

๐Ÿ”ฌ Core Technique โ€” Darwin V9 Platform

Darwin V9 transplants the FFN (expert) strengths of multiple high-performance models onto a Qwen 3.5 397B MoE base, then applies trust-weighted evolutionary merging.

The source models, merge weights, and density schedule are proprietary and constitute a trade secret; they are not published.


๐Ÿ† Benchmark โ€” GPQA Diamond (198 questions)

GPQA Diamond is a 198-question, PhD-level graduate science reasoning benchmark.

Table
ModelEngineAccuracy
Darwin-28B-OpusStandard88.89 % (176 / 198)
Darwin-28B-REASONDarwin-DELPHI (test-time)89.39 % (177 / 198)
Darwin-398B-JGOSGreedy (single-sample, no engine)๐Ÿฅ‡ 90.9 % (180 / 198)

Reproducible evaluation settings:

  • Greedy decoding (temperature = 0), single sample โ€” no voting / self-consistency / test-time engine
  • Max generation: 16,384 tokens
  • Answer options shuffled (seed = 42)
  • Hardware: NVIDIA B200 (tensor-parallel 2 ร— pipeline-parallel 3, 6 GPUs)
  • Inference engine: vLLM, bfloat16, max_model_len = 18432

Darwin-398B-JGOS achieves the family's top GPQA Diamond score using nothing but greedy decoding โ€” no Darwin-DELPHI, no majority voting.


๐Ÿ“Š Benchmark โ€” MMLU-Pro (12,032 questions)

MMLU-Pro is a substantially harder successor to MMLU โ€” 10 answer choices (vs 4) and 12,032 reasoning-focused questions across 14 domains.

Darwin-398B-JGOS scores 88.08 % (10,598 / 12,032) with 5-shot Chain-of-Thought and pure greedy decoding (temperature = 0, single sample) โ€” top-tier territory.

Table
CategoryAccuracyCategoryAccuracy
Math95.9 %Computer Science88.5 %
Biology94.7 %Psychology87.7 %
Physics92.6 %Philosophy86.6 %
Chemistry92.3 %Engineering85.3 %
Business92.0 %Other83.4 %
Economics89.3 %Health81.8 %
History80.1 %Law75.3 %
Overall๐Ÿฅ‡ 88.08 %

Reproducible evaluation settings:

  • 5-shot Chain-of-Thought, greedy decoding (temperature = 0), single sample โ€” no voting / self-consistency / test-time engine
  • Max generation: 14,000 tokens
  • Hardware: NVIDIA B200 (tensor-parallel 2 ร— pipeline-parallel 3, 6 GPUs)
  • Inference engine: vLLM, bfloat16, max_model_len = 18432

Strongest in STEM โ€” Math 95.9 %, Biology 94.7 %, Physics 92.6 %, Chemistry 92.3 %.


๐Ÿš€ Usage (vLLM)

bash

vllm serve FINAL-Bench/Darwin-398B-JGOS --tensor-parallel-size 2 --pipeline-parallel-size 3 --dtype bfloat16 --trust-remote-code

  • Graduate-level STEM reasoning (GPQA / science qualifying exams)
  • Mathematical problem solving
  • Complex multi-step chain-of-thought
  • Code generation and debugging
  • Bilingual reasoning (strong English + Korean; also Chinese / Japanese)

โš ๏ธ Limitations

  • 397B MoE in bfloat16 requires multi-GPU serving (e.g. B200 ร—6 with TP2ร—PP3).
  • The 90.9 % figure is a single-run greedy measurement on GPQA Diamond (198 items).
  • Reasoning traces can be verbose โ€” control with max tokens.

๐Ÿ“š Citation

bibtex

@misc{darwin397b_jgos_2026,
title = {Darwin-398B-JGOS: Darwin V9 Platform FFN Transplant on a 397B MoE Base},
author = {FINAL-Bench / Darwin Research Team},
year = {2026},
howpublished = {https://huggingface.co/FINAL-Bench/Darwin-398B-JGOS},
note = {Darwin V9 - 90.9 percent GPQA Diamond (greedy, single-sample)}
}

  • Darwin-28B-REASON โ€” RTD + Darwin-DELPHI, GPQA 89.39 %
  • Darwin-28B-Opus โ€” base, GPQA 88.89 % (HF-official GPQA top tier)
  • Darwin-36B-Opus โ€” MoE 36B, GPQA 88.4 %
  • Darwin-27B-Opus โ€” 27B dense, GPQA 86.9 %
  • Darwin-9B-NEG โ€” 9B Negentropy, GPQA 84.3 %

Darwin-398B-JGOS ยท Darwin V9 Platform ยท 90.9 % GPQA Diamond (pure greedy) ยท FINAL-Bench

Model provider

FINAL-Bench

Model tree

Base

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today