Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "gabrielebeltramo/NemotronH-300M-stories"
INPUT_PROMPTS = [
"In a land full of wonder",
"A loud cry broke the silence of the woods",
]
if __name__ == "__main__":
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
prompts = [
tokenizer.apply_chat_template(
[{"role": "user", "content": input_prompt}],
tokenize=False,
add_generation_prompt=False,
)
for input_prompt in INPUT_PROMPTS
]
input_ids_attn_mask = tokenizer(
prompts,
padding=True,
trucation=False,
add_special_tokens=False,
return_tensors="pt",
)
input_ids = input_ids_attn_mask["input_ids"]
attention_mask = input_ids_attn_mask["attention_mask"]
input_ids = input_ids.to(model.device)
attention_mask = attention_mask.to(model.device)
with torch.inference_mode():
output_ids = model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
max_new_tokens=1024,
num_beams=5,
do_sample=True,
temperature=0.15,
use_cache=True,
cache_implementation="dynamic",
)
for idx, input_prompt in enumerate(INPUT_PROMPTS):
generated_token_ids = output_ids[idx][input_ids.shape[-1] :]
generated_text = tokenizer.decode(generated_token_ids, skip_special_tokens=True)
print(f"{input_prompt=}")
print(f"{generated_text=}")
print("-" * 30)

Intended use

✅ In scope❌ Out of scope
Short creative / children's story generationFactual question answering
Experimenting with the NemotronH architectureSafety-critical applications
Studying small-model story coherenceLong-document generation (> ~1 k tokens)
Educational / research useLanguages other than English

This model was trained for research and experimentation, not production deployment. It has not been instruction-tuned, RLHF-aligned, or red-teamed.


Model architecture

NemotronH-300M-stories uses the NemotronHForCausalLM architecture — a hybrid that interleaves Mamba SSM layers, Mixture-of-Experts (MoE) feed-forward blocks, and occasional full attention layers.

Layer schedule

markdown

mamba → moe → mamba → moe → mamba → attention → moe → mamba → moe → mamba → moe → mamba

Key dimensions

HyperparameterValue
Hidden size512
Intermediate size (MLP / MoE)2 048
Number of layers12
Attention heads8
Head dim64
Mamba heads / head dim16 / 32
Routed experts16
Active experts per token2
Shared experts1
Max position embeddings8 192
Chunk size (Mamba)128
Conv kernel (Mamba)4
Dtypebfloat16
Approx. total parameters~300 M

MoE uses relu² activations; Mamba SSM uses silu.


Training data

FieldDetail
DatasetSimpleStories
LanguageEnglish
DomainShort children's / simple narrative stories
PreprocessingStandard tokenization; no deduplication beyond what the dataset provides

The model was trained from scratch — no pretrained checkpoint was used as a starting point.


Training procedure

A custom single-GPU training loop was used (no Accelerate or DeepSpeed).

SettingValue
HardwareNVIDIA A4000 16 GB
Precisionbfloat16 weights
OptimizerAdamW, weight_decay = 0.001
Num epochs1
Batch size1 (micro-batch)
Gradient accumulation128 steps (effective batch ≈ 128)
Peak LR2e-4
Initial LR (warmup start)3e-5
Minimum LR (cosine end)5e-5
LR scheduleLinear warmup → cosine annealing
Step time (128 micro-batches)~7.4 s

Estimated cost

  • Initial experimentation: ~ $9
  • Main training run: ~ $9 (about 2 days on GPU)
  • Miscellaneous experiments: ~ 010
  • Total: approximately 1828

Training code

python

train_losses, track_lrs = [], []
tokens_seen, global_step = 0, 0
accum_counter = 0
accum_loss = 0.0
model.train(True)
for epoch in range(n_epochs):
start_batch_time = time.time()
for (input_ids,) in train_dataloader:
input_ids_cuda = input_ids.to(device)
# --- Forward + scaled loss ---
output = model(input_ids=input_ids_cuda, labels=input_ids_cuda)
# Divide loss by accum steps so gradients are averaged,
# not summed, across the accumulated micro-batches.
loss = output.loss / grad_accum_steps
loss.backward()
accum_loss += loss.item()
accum_counter += 1
tokens_seen += input_ids.numel()
del output, loss, input_ids_cuda
# --- Optimizer step only every grad_accum_steps micro-batches ---
if accum_counter == grad_accum_steps:
# --- LR schedule ---
lr = _get_lr(global_step)
for param_group in optimizer.param_groups:
param_group["lr"] = lr
track_lrs.append(lr)
# --- Grad clipping (after warmup) ---
if global_step >= warmup_steps:
torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=max_grad_norm,
)
optimizer.step()
optimizer.zero_grad()
# --- Logging ---
if (global_step % eval_freq) == 0:
train_losses.append(accum_loss)
logger.info(
f"Ep {epoch + 1} (Step {global_step:06d}) | "
f"Loss: {accum_loss:.4f} | "
f"LR: {lr:.2e} | "
f"Tokens seen: {tokens_seen:,}"
)
if (global_step % 10) == 0:
gc.collect()
torch.cuda.synchronize()
torch.cuda.empty_cache()
global_step += 1
accum_counter = 0
accum_loss = 0.0
end_batch_time = time.time()
if (global_step % eval_freq) == 0:
logger.info(
f"Time for {grad_accum_steps} micro-batches accumulated and then "
f"used in one optimizer.step() was {end_batch_time - start_batch_time:.2f} seconds"
)
start_batch_time = time.time()

Training logs

text

2026-06-02 17:38:37,118 - __main__ - INFO - Starting training NemotronH
2026-06-02 17:38:38,380 - __main__ - INFO - tokenizer.vocab_size=131072
2026-06-02 17:38:38,381 - __main__ - INFO - nemotron_config.num_hidden_layers=12
2026-06-02 17:38:38,382 - __main__ - INFO - nemotron_config.hybrid_override_pattern='MEMEM*EMEMEM'
2026-06-02 17:38:42,093 - __main__ - INFO - Total number of parameters: 321_141_536
--snip--
2026-06-04 03:23:30,844 - __main__ - INFO - Time for 128 micro-batches accumulated and then used in one optimizer.step() was 7.56 seconds
2026-06-04 03:23:38,414 - __main__ - INFO - Ep 1 (Step 015584) | Loss: 2.0747 | LR: 5.00e-05 | Tokens seen: 589,322,724
2026-06-04 03:24:30,851 - __main__ - INFO - Total train time: 119567.91 seconds
2026-06-04 03:24:30,851 - __main__ - INFO - Final loss => 2.0747
2026-06-04 03:24:30,851 - __main__ - INFO - Final LR => 5.00e-05

Evaluation

⚠️ Formal benchmark evaluation is pending. The notes below are qualitative.

Informal inspection shows the model produces grammatically coherent short stories with simple vocabulary, consistent with the SimpleStories training distribution. Stories typically resolve within a few paragraphs.

Known limitations

  • Trained exclusively on a simple English children's story corpus — vocabulary and sentence complexity are intentionally constrained.
  • No formal evaluation against standard LM benchmarks (perplexity, HellaSwag, etc.) has been run yet; contributions welcome.
  • May hallucinate character names or produce repetitive endings at low temperatures.
  • Not suitable for any task outside short story generation.

Requirements

Tested on Python 3.12 with the following Python packages.

markdown

causal-conv1d==1.6.2.post1
kaggle==2.1.2
kagglehub==1.0.1
mamba-ssm==2.3.1
numpy==2.3.5
nvidia-cuda-runtime==13.0.96
tokenizers==0.22.2
torch==2.11.0+cu130
transformers==5.9.0
triton==3.6.0

Minimum VRAM: ~1 GB (bfloat16). CPU inference is possible but slow.


Repository contents

markdown

.
|-- README.md
|-- config.json
|-- generation_config.json
|-- model.safetensors
|-- tokenizer.json
`-- tokenizer_config.json

Citation

If you use this model, a simple acknowledgment pointing to this repo is appreciated. No formal citation is required.


Contact / feedback

Open a Discussion on the Hub — bug reports, generated story samples, and benchmark results are all welcome.

Model provider

gabrielebeltramo

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today