Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Holo 3.1 35B A3B Mixed NVFP4 BF16-Head Overlay

This repository contains only the unique patched overlay files for the loadable mixed NVFP4 runtime variant. It does not reupload the full base NVFP4 checkpoint.

Base checkpoint:

text

Hcompany/Holo-3.1-35B-A3B-NVFP4

Patch purpose:

  • keep the base mixed ModelOpt NVFP4/FP8 model body
  • replace packed lm_head.weight with a dequantized BF16 full-width head
  • filter stale lm_head.* tensors out of shard 3 so vLLM does not load the packed head
  • preserve the patched config.json and model.safetensors.index.json

Reconstruct on a RunPod volume:

bash

BASE=/workspace/holo3/models/Holo-3.1-35B-A3B-NVFP4
PATCH=/workspace/holo3/models/Holo-3.1-35B-A3B-NVFP4-bf16-head
mkdir -p "$PATCH"
for f in "$BASE"/*; do ln -s "$f" "$PATCH/$(basename "$f")" 2>/dev/null || true; done
hf download akzaidan/holo31-mixed-nvfp4-bf16-head-overlay --local-dir /tmp/holo31-overlay
cp -f /tmp/holo31-overlay/config.json "$PATCH/config.json"
cp -f /tmp/holo31-overlay/model.safetensors.index.json "$PATCH/model.safetensors.index.json"
cp -f /tmp/holo31-overlay/model-00003-of-00003.safetensors "$PATCH/model-00003-of-00003.safetensors"
cp -f /tmp/holo31-overlay/model-lm-head-bf16.safetensors "$PATCH/model-lm-head-bf16.safetensors"
cp -f /tmp/holo31-overlay/start_vllm_nvfp4_bf16_head.sh /workspace/holo3/scripts/start_vllm_nvfp4_bf16_head.sh
chmod +x /workspace/holo3/scripts/start_vllm_nvfp4_bf16_head.sh

Launch:

bash

/workspace/holo3/scripts/start_vllm_nvfp4_bf16_head.sh

Served model name:

text

holo3-1-35b-a3b-mixed-nvfp4

For normal non-thinking OpenAI-compatible chat responses, send:

json

{"chat_template_kwargs":{"enable_thinking":false}}

With default thinking enabled and --reasoning-parser qwen3, vLLM routes open <think> text into the reasoning field, so simple prompts may return content: null until the model emits answer text after </think>.

Model provider

akzaidan

Model tree

Base

Hcompany/Holo-3.1-35B-A3B-NVFP4

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today