Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick start
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerMODEL_ID = "gabrielebeltramo/NemotronH-300M-stories"INPUT_PROMPTS = ["In a land full of wonder","A loud cry broke the silence of the woods",]if __name__ == "__main__":tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)tokenizer.padding_side = "left"model = AutoModelForCausalLM.from_pretrained(MODEL_ID,dtype=torch.bfloat16,device_map="auto",)model.eval()prompts = [tokenizer.apply_chat_template([{"role": "user", "content": input_prompt}],tokenize=False,add_generation_prompt=False,)for input_prompt in INPUT_PROMPTS]input_ids_attn_mask = tokenizer(prompts,padding=True,trucation=False,add_special_tokens=False,return_tensors="pt",)input_ids = input_ids_attn_mask["input_ids"]attention_mask = input_ids_attn_mask["attention_mask"]input_ids = input_ids.to(model.device)attention_mask = attention_mask.to(model.device)with torch.inference_mode():output_ids = model.generate(input_ids=input_ids,attention_mask=attention_mask,max_new_tokens=1024,num_beams=5,do_sample=True,temperature=0.15,use_cache=True,cache_implementation="dynamic",)for idx, input_prompt in enumerate(INPUT_PROMPTS):generated_token_ids = output_ids[idx][input_ids.shape[-1] :]generated_text = tokenizer.decode(generated_token_ids, skip_special_tokens=True)print(f"{input_prompt=}")print(f"{generated_text=}")print("-" * 30)
Intended use
| ✅ In scope | ❌ Out of scope |
|---|---|
| Short creative / children's story generation | Factual question answering |
| Experimenting with the NemotronH architecture | Safety-critical applications |
| Studying small-model story coherence | Long-document generation (> ~1 k tokens) |
| Educational / research use | Languages other than English |
This model was trained for research and experimentation, not production deployment. It has not been instruction-tuned, RLHF-aligned, or red-teamed.
Model architecture
NemotronH-300M-stories uses the NemotronHForCausalLM architecture — a hybrid that
interleaves Mamba SSM layers, Mixture-of-Experts (MoE) feed-forward blocks, and
occasional full attention layers.
Layer schedule
markdown
mamba → moe → mamba → moe → mamba → attention → moe → mamba → moe → mamba → moe → mamba
Key dimensions
| Hyperparameter | Value |
|---|---|
| Hidden size | 512 |
| Intermediate size (MLP / MoE) | 2 048 |
| Number of layers | 12 |
| Attention heads | 8 |
| Head dim | 64 |
| Mamba heads / head dim | 16 / 32 |
| Routed experts | 16 |
| Active experts per token | 2 |
| Shared experts | 1 |
| Max position embeddings | 8 192 |
| Chunk size (Mamba) | 128 |
| Conv kernel (Mamba) | 4 |
| Dtype | bfloat16 |
| Approx. total parameters | ~300 M |
MoE uses relu² activations; Mamba SSM uses silu.
Training data
| Field | Detail |
|---|---|
| Dataset | SimpleStories |
| Language | English |
| Domain | Short children's / simple narrative stories |
| Preprocessing | Standard tokenization; no deduplication beyond what the dataset provides |
The model was trained from scratch — no pretrained checkpoint was used as a starting point.
Training procedure
A custom single-GPU training loop was used (no Accelerate or DeepSpeed).
| Setting | Value |
|---|---|
| Hardware | NVIDIA A4000 16 GB |
| Precision | bfloat16 weights |
| Optimizer | AdamW, weight_decay = 0.001 |
| Num epochs | 1 |
| Batch size | 1 (micro-batch) |
| Gradient accumulation | 128 steps (effective batch ≈ 128) |
| Peak LR | 2e-4 |
| Initial LR (warmup start) | 3e-5 |
| Minimum LR (cosine end) | 5e-5 |
| LR schedule | Linear warmup → cosine annealing |
| Step time (128 micro-batches) | ~7.4 s |
Estimated cost
- Initial experimentation: ~ $9
- Main training run: ~ $9 (about 2 days on GPU)
- Miscellaneous experiments: ~ 0−10
- Total: approximately 18−28
Training code
python
train_losses, track_lrs = [], []tokens_seen, global_step = 0, 0accum_counter = 0accum_loss = 0.0model.train(True)for epoch in range(n_epochs):start_batch_time = time.time()for (input_ids,) in train_dataloader:input_ids_cuda = input_ids.to(device)# --- Forward + scaled loss ---output = model(input_ids=input_ids_cuda, labels=input_ids_cuda)# Divide loss by accum steps so gradients are averaged,# not summed, across the accumulated micro-batches.loss = output.loss / grad_accum_stepsloss.backward()accum_loss += loss.item()accum_counter += 1tokens_seen += input_ids.numel()del output, loss, input_ids_cuda# --- Optimizer step only every grad_accum_steps micro-batches ---if accum_counter == grad_accum_steps:# --- LR schedule ---lr = _get_lr(global_step)for param_group in optimizer.param_groups:param_group["lr"] = lrtrack_lrs.append(lr)# --- Grad clipping (after warmup) ---if global_step >= warmup_steps:torch.nn.utils.clip_grad_norm_(model.parameters(),max_norm=max_grad_norm,)optimizer.step()optimizer.zero_grad()# --- Logging ---if (global_step % eval_freq) == 0:train_losses.append(accum_loss)logger.info(f"Ep {epoch + 1} (Step {global_step:06d}) | "f"Loss: {accum_loss:.4f} | "f"LR: {lr:.2e} | "f"Tokens seen: {tokens_seen:,}")if (global_step % 10) == 0:gc.collect()torch.cuda.synchronize()torch.cuda.empty_cache()global_step += 1accum_counter = 0accum_loss = 0.0end_batch_time = time.time()if (global_step % eval_freq) == 0:logger.info(f"Time for {grad_accum_steps} micro-batches accumulated and then "f"used in one optimizer.step() was {end_batch_time - start_batch_time:.2f} seconds")start_batch_time = time.time()
Training logs
text
2026-06-02 17:38:37,118 - __main__ - INFO - Starting training NemotronH2026-06-02 17:38:38,380 - __main__ - INFO - tokenizer.vocab_size=1310722026-06-02 17:38:38,381 - __main__ - INFO - nemotron_config.num_hidden_layers=122026-06-02 17:38:38,382 - __main__ - INFO - nemotron_config.hybrid_override_pattern='MEMEM*EMEMEM'2026-06-02 17:38:42,093 - __main__ - INFO - Total number of parameters: 321_141_536--snip--2026-06-04 03:23:30,844 - __main__ - INFO - Time for 128 micro-batches accumulated and then used in one optimizer.step() was 7.56 seconds2026-06-04 03:23:38,414 - __main__ - INFO - Ep 1 (Step 015584) | Loss: 2.0747 | LR: 5.00e-05 | Tokens seen: 589,322,7242026-06-04 03:24:30,851 - __main__ - INFO - Total train time: 119567.91 seconds2026-06-04 03:24:30,851 - __main__ - INFO - Final loss => 2.07472026-06-04 03:24:30,851 - __main__ - INFO - Final LR => 5.00e-05
Evaluation
⚠️ Formal benchmark evaluation is pending. The notes below are qualitative.
Informal inspection shows the model produces grammatically coherent short stories with simple vocabulary, consistent with the SimpleStories training distribution. Stories typically resolve within a few paragraphs.
Known limitations
- Trained exclusively on a simple English children's story corpus — vocabulary and sentence complexity are intentionally constrained.
- No formal evaluation against standard LM benchmarks (perplexity, HellaSwag, etc.) has been run yet; contributions welcome.
- May hallucinate character names or produce repetitive endings at low temperatures.
- Not suitable for any task outside short story generation.
Requirements
Tested on Python 3.12 with the following Python packages.
markdown
causal-conv1d==1.6.2.post1kaggle==2.1.2kagglehub==1.0.1mamba-ssm==2.3.1numpy==2.3.5nvidia-cuda-runtime==13.0.96tokenizers==0.22.2torch==2.11.0+cu130transformers==5.9.0triton==3.6.0
Minimum VRAM: ~1 GB (bfloat16). CPU inference is possible but slow.
Repository contents
markdown
.|-- README.md|-- config.json|-- generation_config.json|-- model.safetensors|-- tokenizer.json`-- tokenizer_config.json
Citation
If you use this model, a simple acknowledgment pointing to this repo is appreciated. No formal citation is required.
Contact / feedback
Open a Discussion on the Hub — bug reports, generated story samples, and benchmark results are all welcome.
Model provider
gabrielebeltramo
Model tree
Base
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information