Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherHolo 3.1 35B A3B Mixed NVFP4 BF16-Head Overlay
This repository contains only the unique patched overlay files for the loadable mixed NVFP4 runtime variant. It does not reupload the full base NVFP4 checkpoint.
Base checkpoint:
text
Hcompany/Holo-3.1-35B-A3B-NVFP4
Patch purpose:
- keep the base mixed ModelOpt NVFP4/FP8 model body
- replace packed
lm_head.weightwith a dequantized BF16 full-width head - filter stale
lm_head.*tensors out of shard 3 so vLLM does not load the packed head - preserve the patched
config.jsonandmodel.safetensors.index.json
Reconstruct on a RunPod volume:
bash
BASE=/workspace/holo3/models/Holo-3.1-35B-A3B-NVFP4PATCH=/workspace/holo3/models/Holo-3.1-35B-A3B-NVFP4-bf16-headmkdir -p "$PATCH"for f in "$BASE"/*; do ln -s "$f" "$PATCH/$(basename "$f")" 2>/dev/null || true; donehf download akzaidan/holo31-mixed-nvfp4-bf16-head-overlay --local-dir /tmp/holo31-overlaycp -f /tmp/holo31-overlay/config.json "$PATCH/config.json"cp -f /tmp/holo31-overlay/model.safetensors.index.json "$PATCH/model.safetensors.index.json"cp -f /tmp/holo31-overlay/model-00003-of-00003.safetensors "$PATCH/model-00003-of-00003.safetensors"cp -f /tmp/holo31-overlay/model-lm-head-bf16.safetensors "$PATCH/model-lm-head-bf16.safetensors"cp -f /tmp/holo31-overlay/start_vllm_nvfp4_bf16_head.sh /workspace/holo3/scripts/start_vllm_nvfp4_bf16_head.shchmod +x /workspace/holo3/scripts/start_vllm_nvfp4_bf16_head.sh
Launch:
bash
/workspace/holo3/scripts/start_vllm_nvfp4_bf16_head.sh
Served model name:
text
holo3-1-35b-a3b-mixed-nvfp4
For normal non-thinking OpenAI-compatible chat responses, send:
json
{"chat_template_kwargs":{"enable_thinking":false}}
With default thinking enabled and --reasoning-parser qwen3, vLLM routes open <think> text into the reasoning field, so simple prompts may return content: null until the model emits answer text after </think>.
Model provider
akzaidan
Model tree
Base
Hcompany/Holo-3.1-35B-A3B-NVFP4
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information