build-small-hackathon/noir-verdict-nemotron-4b-merged API & Inference Endpoint

What it is

Architecture: Nemotron-H (hybrid Mamba-2 / Transformer), 4B params, BF16
Source LoRA: build-small-hackathon/noir-verdict-nemotron-4b-lora
Merge method: save_pretrained_merged(..., save_method="merged_16bit") (Unsloth)
Trust remote code: yes (Nemotron 3 hybrid uses custom modeling code)

How to use

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "build-small-hackathon/noir-verdict-nemotron-4b-merged"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    repo, torch_dtype=torch.bfloat16, trust_remote_code=True,
).cuda().eval()

Chat template

The chat template is the Nemotron 3 chat template, with enable_thinking=False baked in. The system prompt for an active interrogation is built by engine/prompts.py:build_system_prompt(...).

python
messages = [
    {"role": "system", "content": "You are Greta Lindholm, junior continuity writer at WJBK. ..."},
    {"role": "user",   "content": "Where were you at the time of the theft?"},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

Inference tips

n_ctx ≥ 4096
temperature 0.6–0.7, top_p 0.9–0.95
max_new_tokens 180–280 per turn
Stop on <|im_end|>

How it was built

Image: nvidia/cuda:12.8.1-devel-ubuntu22.04 + Python 3.13
Fine-tune: Unsloth LoRA on A10G, 240 steps, Nemotron 3 Nano 4B
Merge: model.save_pretrained_merged(..., save_method="merged_16bit") in the same Modal job
Orchestrator: train/modal_finetune.py

Companion artifacts

LoRA: build-small-hackathon/noir-verdict-nemotron-4b-lora (40.5 MB)
Q4_K_M GGUF: build-small-hackathon/noir-verdict-nemotron-4b-gguf (2.84 GB)
App: build-small-hackathon/noir-verdict

License

Apache-2.0. The base Nemotron 3 Nano weights are governed by NVIDIA's model license; the merged checkpoint and training code in this repo are Apache-2.0.

noir-verdict-nemotron-4b-merged

Get help setting up a custom Dedicated Endpoints.

README