gabrielebeltramo/NemotronH-300M-stories API & Inference Endpoint

Quick start

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "gabrielebeltramo/NemotronH-300M-stories"

INPUT_PROMPTS = [
    "In a land full of wonder",
    "A loud cry broke the silence of the woods",
]

if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
    tokenizer.padding_side = "left"

    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        dtype=torch.bfloat16,
        device_map="auto",
    )
    model.eval()

    prompts = [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": input_prompt}],
            tokenize=False,
            add_generation_prompt=False,
        )
        for input_prompt in INPUT_PROMPTS
    ]

    input_ids_attn_mask = tokenizer(
        prompts,
        padding=True,
        trucation=False,
        add_special_tokens=False,
        return_tensors="pt",
    )
    input_ids = input_ids_attn_mask["input_ids"]
    attention_mask = input_ids_attn_mask["attention_mask"]

    input_ids = input_ids.to(model.device)
    attention_mask = attention_mask.to(model.device)

    with torch.inference_mode():
        output_ids = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=1024,
            num_beams=5,
            do_sample=True,
            temperature=0.15,
            use_cache=True,
            cache_implementation="dynamic",
        )

    for idx, input_prompt in enumerate(INPUT_PROMPTS):
        generated_token_ids = output_ids[idx][input_ids.shape[-1] :]
        generated_text = tokenizer.decode(generated_token_ids, skip_special_tokens=True)

        print(f"{input_prompt=}")
        print(f"{generated_text=}")
        print("-" * 30)

Intended use

✅ In scope	❌ Out of scope
Short creative / children's story generation	Factual question answering
Experimenting with the NemotronH architecture	Safety-critical applications
Studying small-model story coherence	Long-document generation (> ~1 k tokens)
Educational / research use	Languages other than English

This model was trained for research and experimentation, not production deployment. It has not been instruction-tuned, RLHF-aligned, or red-teamed.

Model architecture

NemotronH-300M-stories uses the NemotronHForCausalLM architecture — a hybrid that interleaves Mamba SSM layers, Mixture-of-Experts (MoE) feed-forward blocks, and occasional full attention layers.

Layer schedule

markdown
mamba → moe → mamba → moe → mamba → attention → moe → mamba → moe → mamba → moe → mamba

Key dimensions

Hyperparameter	Value
Hidden size	512
Intermediate size (MLP / MoE)	2 048
Number of layers	12
Attention heads	8
Head dim	64
Mamba heads / head dim	16 / 32
Routed experts	16
Active experts per token	2
Shared experts	1
Max position embeddings	8 192
Chunk size (Mamba)	128
Conv kernel (Mamba)	4
Dtype	bfloat16
Approx. total parameters	~300 M

MoE uses relu² activations; Mamba SSM uses silu.

Training data

Field	Detail
Dataset	SimpleStories
Language	English
Domain	Short children's / simple narrative stories
Preprocessing	Standard tokenization; no deduplication beyond what the dataset provides

The model was trained from scratch — no pretrained checkpoint was used as a starting point.

Training procedure

A custom single-GPU training loop was used (no Accelerate or DeepSpeed).

Setting	Value
Hardware	NVIDIA A4000 16 GB
Precision	bfloat16 weights
Optimizer	AdamW, weight_decay = 0.001
Num epochs	1
Batch size	1 (micro-batch)
Gradient accumulation	128 steps (effective batch ≈ 128)
Peak LR	2e-4
Initial LR (warmup start)	3e-5
Minimum LR (cosine end)	5e-5
LR schedule	Linear warmup → cosine annealing
Step time (128 micro-batches)	~7.4 s

Estimated cost

Initial experimentation: ~ $9
Main training run: ~ $9 (about 2 days on GPU)
Miscellaneous experiments: ~ $0 -$ 10
Total: approximately $18 -$ 28

Training code

python
train_losses, track_lrs = [], []
tokens_seen, global_step = 0, 0
accum_counter = 0
accum_loss = 0.0 

model.train(True)
for epoch in range(n_epochs):
    start_batch_time = time.time()
    for (input_ids,) in train_dataloader:
        input_ids_cuda = input_ids.to(device)

        # --- Forward + scaled loss ---
        output = model(input_ids=input_ids_cuda, labels=input_ids_cuda)
        # Divide loss by accum steps so gradients are averaged,
        # not summed, across the accumulated micro-batches.
        loss = output.loss / grad_accum_steps
        loss.backward()

        accum_loss += loss.item()
        accum_counter += 1
        tokens_seen += input_ids.numel()

        del output, loss, input_ids_cuda

        # --- Optimizer step only every grad_accum_steps micro-batches ---
        if accum_counter == grad_accum_steps:
            # --- LR schedule ---
            lr = _get_lr(global_step)
            for param_group in optimizer.param_groups:
                param_group["lr"] = lr
            track_lrs.append(lr)

            # --- Grad clipping (after warmup) ---
            if global_step >= warmup_steps:
                torch.nn.utils.clip_grad_norm_(
                    model.parameters(),
                    max_norm=max_grad_norm,
                )

            optimizer.step()
            optimizer.zero_grad()

            # --- Logging ---
            if (global_step % eval_freq) == 0:
                train_losses.append(accum_loss)
                logger.info(
                    f"Ep {epoch + 1} (Step {global_step:06d}) | "
                    f"Loss: {accum_loss:.4f} | "
                    f"LR: {lr:.2e} | "
                    f"Tokens seen: {tokens_seen:,}"
                )

            if (global_step % 10) == 0:
                gc.collect()
                torch.cuda.synchronize()
                torch.cuda.empty_cache()

            global_step += 1
            accum_counter = 0
            accum_loss = 0.0
            end_batch_time = time.time()
            if (global_step % eval_freq) == 0:
                logger.info(
                    f"Time for {grad_accum_steps} micro-batches accumulated and then "
                    f"used in one optimizer.step() was {end_batch_time - start_batch_time:.2f} seconds"
                )
            start_batch_time = time.time()

Training logs

text
2026-06-02 17:38:37,118 - __main__ - INFO - Starting training NemotronH
2026-06-02 17:38:38,380 - __main__ - INFO - tokenizer.vocab_size=131072
2026-06-02 17:38:38,381 - __main__ - INFO - nemotron_config.num_hidden_layers=12
2026-06-02 17:38:38,382 - __main__ - INFO - nemotron_config.hybrid_override_pattern='MEMEM*EMEMEM'
2026-06-02 17:38:42,093 - __main__ - INFO - Total number of parameters: 321_141_536
--snip--
2026-06-04 03:23:30,844 - __main__ - INFO - Time for 128 micro-batches accumulated and then used in one optimizer.step() was 7.56 seconds
2026-06-04 03:23:38,414 - __main__ - INFO - Ep 1 (Step 015584) | Loss: 2.0747 | LR: 5.00e-05 | Tokens seen: 589,322,724
2026-06-04 03:24:30,851 - __main__ - INFO - Total train time: 119567.91 seconds
2026-06-04 03:24:30,851 - __main__ - INFO - Final loss  => 2.0747
2026-06-04 03:24:30,851 - __main__ - INFO - Final LR    => 5.00e-05

Evaluation

⚠️ Formal benchmark evaluation is pending. The notes below are qualitative.

Informal inspection shows the model produces grammatically coherent short stories with simple vocabulary, consistent with the SimpleStories training distribution. Stories typically resolve within a few paragraphs.

Known limitations

Trained exclusively on a simple English children's story corpus — vocabulary and sentence complexity are intentionally constrained.
No formal evaluation against standard LM benchmarks (perplexity, HellaSwag, etc.) has been run yet; contributions welcome.
May hallucinate character names or produce repetitive endings at low temperatures.
Not suitable for any task outside short story generation.

Requirements

Tested on Python 3.12 with the following Python packages.

markdown
causal-conv1d==1.6.2.post1
kaggle==2.1.2
kagglehub==1.0.1
mamba-ssm==2.3.1
numpy==2.3.5
nvidia-cuda-runtime==13.0.96
tokenizers==0.22.2
torch==2.11.0+cu130
transformers==5.9.0
triton==3.6.0

Minimum VRAM: ~1 GB (bfloat16). CPU inference is possible but slow.

Repository contents

markdown
.
|-- README.md
|-- config.json
|-- generation_config.json
|-- model.safetensors
|-- tokenizer.json
`-- tokenizer_config.json

Citation

If you use this model, a simple acknowledgment pointing to this repo is appreciated. No formal citation is required.

Contact / feedback

Open a Discussion on the Hub — bug reports, generated story samples, and benchmark results are all welcome.

NemotronH-300M-stories

Get help setting up a custom Dedicated Endpoints.

README