Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherKey Improvements & Availability
-
Reduced Latency: Compared to Holo3 Flash, this model significantly reduces latency, enabling more responsive real-time agentic workflows.
-
Try it in HoloTab: You can experience the model's capabilities firsthand in HoloTab, our browser-based AI agent platform.
-
Open Access: The model is available on Hugging Face under the NVIDIA Open Model License.
H Company is part of the NVIDIA Inception Program.
Why We Built Holotron 3 Nano
Holotron 3 Nano continues the legacy of Holotron-12B as a specialized policy model for agents that perceive and act within interactive environments. By outperforming other leading models like GPT-5.4 and Sonnet 4.6 at a lower price point, the Holotron 3 Nano model is Pareto-optimal in terms of price-performance.
Requirements
bash
pip install mamba-ssm causal-conv1d # required for the hybrid Mamba LLM backbone
The vision encoder (nvidia/C-RADIOv2-H) is fetched from the Hub on first load via trust_remote_code=True.
Usage
Note: We recommend using vLLM to serve this model. A cleaner modeling implementation better aligned with the
transformersconventions will be released soon.
python
import torchfrom PIL import Imagefrom transformers import AutoModelForCausalLM, AutoProcessorMODEL_ID = "Hcompany/Holotron-3-Nano"model = AutoModelForCausalLM.from_pretrained(MODEL_ID,trust_remote_code=True,torch_dtype=torch.bfloat16,device_map="auto",).eval()processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)image = Image.open("your_image.jpg").convert("RGB")messages = [{"role": "user","content": [{"type": "image", "image": image},{"type": "text", "text": "Describe this image."},],}]inputs = processor.apply_chat_template(messages,add_generation_prompt=True,tokenize=True,return_dict=True,return_tensors="pt",).to(model.device)with torch.inference_mode():out = model.generate(**inputs,max_new_tokens=256,do_sample=False,pad_token_id=processor.tokenizer.eos_token_id,)print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Model provider
Stanisz
Model tree
Base
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Fine-tuned
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information