ramankrishna10

npc-nano-0.5b-v2-math

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What this model is

  • Base: NPC Nano 0.5B (from-scratch, 501M params, 8.93B tokens)
  • Continued pretraining: +15B tokens (60% open-web-math, 30% arxiv, both from EleutherAI/proof-pile-2; 10% fineweb-edu anti-forgetting buffer)
  • Total training: ~24B tokens
  • License: Apache 2.0 (lineage preserved)

Results

The headline finding: 15B additional math-dense tokens (≈2.7× the original training budget, ≈6× the math weighting) produced no measurable GSM8K improvement at 0.5B parameters.

Table
Metricv1 basev2-math (+15B)Δ
GSM8K (5-shot, flex)1.67%1.82%+0.15pp (within noise)
ARC-easy49.96%58.71%+8.75pp
HellaSwag36.82%36.77%−0.05pp
PIQA65.02%65.45%+0.43pp
OpenBookQA30.00%30.00%+0.00pp
WinoGrande49.49%50.28%+0.79pp

GSM8K trajectory across checkpoints: 1.67% (v1) → 1.82% (+3B) → 1.90% (+7B) → 1.82% (+15B). Every delta is within one standard error (±0.37pp).

Interpretation

The one real signal is ARC-easy +8.75pp (science multiple-choice), which saturated by +3B tokens. The model demonstrably absorbed the math/science distribution — it improved at recognizing scientific answers — but did not improve at generating multi-step arithmetic solutions (GSM8K).

This sharpens the capacity-bottleneck argument from the v1 paper: at 0.5B parameters, the GSM8K ceiling is not purely a matter of insufficient math exposure during pretraining. Adding substantially more math content moved some reasoning capabilities (science MCQ) but not arithmetic generation. The bottleneck is the model's capacity for the specific skill of multi-step number generation, not its exposure to math content.

Intended use

This model is primarily of interest for:

  • Reproducing the continued-pretraining experiment in the NPC Nano paper's Future Work section
  • Studying capability transfer vs. non-transfer at small scale
  • The improved science-MCQ capability (ARC-easy) if that specific capability is useful

For general use, the v1 SFT model remains the recommended NPC Nano artifact.

Honest notes

  • Held-out perplexity was not measured for this run (the validation split was cleaned during the multi-week training). The 6-task lm-eval suite is the authoritative signal.
  • Training spanned two W&B runs due to one pod restart, cleanly recovered via checkpoint resume. Training was contiguous.

Citation

Built on the methodology documented in:

bibtex

@misc{bachu2026npcnano,
author = {Bachu, Rama Krishna},
title = {NPC Nano 0.5B: From-Scratch Pretraining and the Post-Training
Capability Ceiling at Sub-1B Parameters},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20349362},
url = {https://doi.org/10.5281/zenodo.20349362}
}

Attribution

Continued-pretraining data: EleutherAI/proof-pile-2 (open-web-math, arxiv subsets) and HuggingFaceFW/fineweb-edu.

Author: Rama Krishna Bachu / Bottensor (Independent Research). ORCID 0009-0000-1298-0681.

Model provider

ramankrishna10

Model tree

Base

ramankrishna10/npc-nano-0.5b-base

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today