Stanisz

holotron-torchao-streamed

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Key Improvements & Availability

Reduced Latency: Compared to Holo3 Flash, this model significantly reduces latency, enabling more responsive real-time agentic workflows.
Try it in HoloTab: You can experience the model's capabilities firsthand in HoloTab, our browser-based AI agent platform.
Open Access: The model is available on Hugging Face under the NVIDIA Open Model License.

H Company is part of the NVIDIA Inception Program.

Why We Built Holotron 3 Nano

Holotron 3 Nano continues the legacy of Holotron-12B as a specialized policy model for agents that perceive and act within interactive environments. By outperforming other leading models like GPT-5.4 and Sonnet 4.6 at a lower price point, the Holotron 3 Nano model is Pareto-optimal in terms of price-performance.

Requirements

bash
pip install mamba-ssm causal-conv1d  # required for the hybrid Mamba LLM backbone

The vision encoder (nvidia/C-RADIOv2-H) is fetched from the Hub on first load via trust_remote_code=True.

Usage

Note: We recommend using vLLM to serve this model. A cleaner modeling implementation better aligned with the transformers conventions will be released soon.

python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

MODEL_ID = "Hcompany/Holotron-3-Nano"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

image = Image.open("your_image.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
    )

print(processor.tokenizer.decode(
    out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True
))

Model provider

Stanisz

Model tree

Base

nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Fine-tuned

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Key Improvements & Availability

Reduced Latency: Compared to Holo3 Flash, this model significantly reduces latency, enabling more responsive real-time agentic workflows.
Try it in HoloTab: You can experience the model's capabilities firsthand in HoloTab, our browser-based AI agent platform.
Open Access: The model is available on Hugging Face under the NVIDIA Open Model License.

H Company is part of the NVIDIA Inception Program.

Why We Built Holotron 3 Nano

Requirements

bash
pip install mamba-ssm causal-conv1d  # required for the hybrid Mamba LLM backbone

The vision encoder (nvidia/C-RADIOv2-H) is fetched from the Hub on first load via trust_remote_code=True.

Usage

Note: We recommend using vLLM to serve this model. A cleaner modeling implementation better aligned with the transformers conventions will be released soon.

python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

MODEL_ID = "Hcompany/Holotron-3-Nano"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
).eval()
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)

image = Image.open("your_image.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        pad_token_id=processor.tokenizer.eos_token_id,
    )

print(processor.tokenizer.decode(
    out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True
))

holotron-torchao-streamed

Get help setting up a custom Dedicated Endpoints.

README

Key Improvements & Availability

Why We Built Holotron 3 Nano

Requirements

Usage

Explore FriendliAI today

README

Key Improvements & Availability

Why We Built Holotron 3 Nano

Requirements

Usage