FINAL-Bench
Darwin-398B-JGOS
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Overview
Darwin-398B-JGOS is the largest and highest-scoring member of the Darwin family. Built on Qwen 3.5 397B as the base, it transplants the FFN (expert) strengths of multiple high-performance models through the Darwin V9 platform, producing a 397B-parameter Mixture-of-Experts model with ~17B active parameters per token.
It reaches 90.9 % on GPQA Diamond with pure greedy decoding (single sample) โ surpassing Darwin-28B-REASON (89.39 %, achieved with the Darwin-DELPHI test-time engine) without using any test-time engine at all. This is the highest GPQA Diamond score in the Darwin family to date.
๐งฌ Darwin Platform & Research
Darwin is VIDRAFT's measuring-result-driven reasoning model family โ approximately 20 official models plus 400+ community derivatives, ranking among the top open models on GPQA.
- Darwin V9 platform โ evolutionary FFN/expert transplant and trust-weighted merging onto large-scale MoE backbones.
- FINAL Bench โ VIDRAFT's evaluation framework.
- 4-layer Pre-AGI roadmap โ Darwin โ AETHER โ PROMETHEUS โ HEPHAESTUS.
๐งฌ Model Lineage
| Role | Model | Contribution |
|---|---|---|
| Base | Qwen 3.5 397B (A17B) | 397B Mixture-of-Experts backbone (~17B active). |
| FFN transplant | Darwin V9 platform (proprietary) | Transplants the FFN (expert) strengths of multiple high-performance models onto the base. |
| Result | Darwin-398B-JGOS (this model) | 397B MoE โ 90.9 % GPQA Diamond, pure greedy. |
The full Darwin V9 merge recipe โ source models, weighting, and density โ is proprietary and not disclosed (trade secret).
โ๏ธ Technical Specifications
| Component | Value |
|---|---|
| Architecture | Qwen3_5MoeForConditionalGeneration (Qwen 3.5 generation MoE) |
| Parameters | ~397 B total / ~17 B active (Mixture-of-Experts) |
| Base | Qwen 3.5 397B (A17B) |
| Precision | bfloat16 |
| License | apache-2.0 |
๐ฌ Core Technique โ Darwin V9 Platform
Darwin V9 transplants the FFN (expert) strengths of multiple high-performance models onto a Qwen 3.5 397B MoE base, then applies trust-weighted evolutionary merging.
The source models, merge weights, and density schedule are proprietary and constitute a trade secret; they are not published.
๐ Benchmark โ GPQA Diamond (198 questions)
GPQA Diamond is a 198-question, PhD-level graduate science reasoning benchmark.
| Model | Engine | Accuracy |
|---|---|---|
| Darwin-28B-Opus | Standard | 88.89 % (176 / 198) |
| Darwin-28B-REASON | Darwin-DELPHI (test-time) | 89.39 % (177 / 198) |
| Darwin-398B-JGOS | Greedy (single-sample, no engine) | ๐ฅ 90.9 % (180 / 198) |
Reproducible evaluation settings:
- Greedy decoding (temperature = 0), single sample โ no voting / self-consistency / test-time engine
- Max generation: 16,384 tokens
- Answer options shuffled (seed = 42)
- Hardware: NVIDIA B200 (tensor-parallel 2 ร pipeline-parallel 3, 6 GPUs)
- Inference engine: vLLM, bfloat16,
max_model_len = 18432
Darwin-398B-JGOS achieves the family's top GPQA Diamond score using nothing but greedy decoding โ no Darwin-DELPHI, no majority voting.
๐ Benchmark โ MMLU-Pro (12,032 questions)
MMLU-Pro is a substantially harder successor to MMLU โ 10 answer choices (vs 4) and 12,032 reasoning-focused questions across 14 domains.
Darwin-398B-JGOS scores 88.08 % (10,598 / 12,032) with 5-shot Chain-of-Thought and pure greedy decoding (temperature = 0, single sample) โ top-tier territory.
| Category | Accuracy | Category | Accuracy |
|---|---|---|---|
| Math | 95.9 % | Computer Science | 88.5 % |
| Biology | 94.7 % | Psychology | 87.7 % |
| Physics | 92.6 % | Philosophy | 86.6 % |
| Chemistry | 92.3 % | Engineering | 85.3 % |
| Business | 92.0 % | Other | 83.4 % |
| Economics | 89.3 % | Health | 81.8 % |
| History | 80.1 % | Law | 75.3 % |
| Overall | ๐ฅ 88.08 % |
Reproducible evaluation settings:
- 5-shot Chain-of-Thought, greedy decoding (temperature = 0), single sample โ no voting / self-consistency / test-time engine
- Max generation: 14,000 tokens
- Hardware: NVIDIA B200 (tensor-parallel 2 ร pipeline-parallel 3, 6 GPUs)
- Inference engine: vLLM, bfloat16,
max_model_len = 18432
Strongest in STEM โ Math 95.9 %, Biology 94.7 %, Physics 92.6 %, Chemistry 92.3 %.
๐ Usage (vLLM)
bash
vllm serve FINAL-Bench/Darwin-398B-JGOS --tensor-parallel-size 2 --pipeline-parallel-size 3 --dtype bfloat16 --trust-remote-code
๐ฏ Recommended Use-Cases
- Graduate-level STEM reasoning (GPQA / science qualifying exams)
- Mathematical problem solving
- Complex multi-step chain-of-thought
- Code generation and debugging
- Bilingual reasoning (strong English + Korean; also Chinese / Japanese)
โ ๏ธ Limitations
- 397B MoE in bfloat16 requires multi-GPU serving (e.g. B200 ร6 with TP2รPP3).
- The 90.9 % figure is a single-run greedy measurement on GPQA Diamond (198 items).
- Reasoning traces can be verbose โ control with max tokens.
๐ Citation
bibtex
@misc{darwin397b_jgos_2026,title = {Darwin-398B-JGOS: Darwin V9 Platform FFN Transplant on a 397B MoE Base},author = {FINAL-Bench / Darwin Research Team},year = {2026},howpublished = {https://huggingface.co/FINAL-Bench/Darwin-398B-JGOS},note = {Darwin V9 - 90.9 percent GPQA Diamond (greedy, single-sample)}}
๐ Related Darwin Models
- Darwin-28B-REASON โ RTD + Darwin-DELPHI, GPQA 89.39 %
- Darwin-28B-Opus โ base, GPQA 88.89 % (HF-official GPQA top tier)
- Darwin-36B-Opus โ MoE 36B, GPQA 88.4 %
- Darwin-27B-Opus โ 27B dense, GPQA 86.9 %
- Darwin-9B-NEG โ 9B Negentropy, GPQA 84.3 %
Darwin-398B-JGOS ยท Darwin V9 Platform ยท 90.9 % GPQA Diamond (pure greedy) ยท FINAL-Bench
Model provider
FINAL-Bench
Model tree
Base
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information