Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Overview
NVFP4-quantized version of NousResearch/Hermes-4-14B, an instruction-following and function-calling model built on Qwen/Qwen3-14B. It is not a newly trained model. This is an unofficial community quantization; NousResearch is the original model author.
NVFP4 is NVIDIA's 4-bit floating-point quantization format. This variant targets high-throughput inference on NVIDIA hardware with NVFP4 acceleration. For details of the NVFP4 format and its hardware support, consult NVIDIA's TensorRT-LLM documentation.
Source
| Field | Value |
|---|---|
| Upstream model | NousResearch/Hermes-4-14B |
| Upstream source revision | d6ce765c8b83f847357b98254be079afa0c6ca76 |
| Export tool/script | NVIDIA TensorRT Model Optimizer (modelopt) 0.45.0 |
| Quantization recipe | NVFP4 (NVIDIA 4-bit floating point) safetensors |
| Base model (weights) | Qwen/Qwen3-14B |
Files
| File | Size | Description |
|---|---|---|
model-00001-of-00002.safetensors | ~8.99 GB | NVFP4 transformer weights: U8-packed 4-bit + FP8 (E4M3) group scales + FP32 global/input scales + BF16 norms |
model-00002-of-00002.safetensors | ~1.56 GB | BF16 embeddings / output head (excluded from NVFP4) |
model.safetensors.index.json | — | Weight map / shard index |
config.json | — | Model + quantization config |
hf_quant_config.json | — | modelopt NVFP4 quantization config |
generation_config.json | — | Generation defaults |
chat_template.jinja | — | Chat template |
tokenizer.json + tokenizer_config.json | — | Tokenizer |
.quant_summary.txt | — | Per-module quantization summary log |
Intended Use
A general instruction-following and function-calling LLM, quantized to NVFP4 for high-throughput inference on NVIDIA hardware. The NVFP4 variant targets NVIDIA GPUs with native FP4 tensor-core support; on other GPUs, NVFP4 weights may require emulation or upcast, which reduces the throughput benefit (verify with your TensorRT-LLM version).
Runtime Notes
- Library: Hugging Face Transformers (or TensorRT-LLM for NVFP4 acceleration).
- NVFP4 native acceleration requires NVIDIA GPU support; validate on the target hardware and TensorRT-LLM version before production use.
- Context length: inherited from Qwen3-14B / Hermes-4-14B; consult the upstream model card.
Precision and Packaging
Export tooling, precision, and quantization are recorded in the Source table above. This packaging mirror does not publish independent parity benchmarks; validate on your target execution provider before production use.
Limitations
- NVFP4 native acceleration is hardware-specific; not all NVIDIA GPUs support native FP4 tensor-core operations.
- Quantization introduces approximation error relative to the FP16/BF16 baseline.
- No repository-specific quality benchmark is documented here.
- English-dominant model; multilingual quality may vary.
License
Apache 2.0 — inherited from
NousResearch/Hermes-4-14B. This packaging repo adds no new license terms.
Model provider
tonythethompson
Model tree
Base
NousResearch/Hermes-4-14B
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information