Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Overview

NVFP4-quantized version of NousResearch/Hermes-4-14B, an instruction-following and function-calling model built on Qwen/Qwen3-14B. It is not a newly trained model. This is an unofficial community quantization; NousResearch is the original model author.

NVFP4 is NVIDIA's 4-bit floating-point quantization format. This variant targets high-throughput inference on NVIDIA hardware with NVFP4 acceleration. For details of the NVFP4 format and its hardware support, consult NVIDIA's TensorRT-LLM documentation.

Source

FieldValue
Upstream modelNousResearch/Hermes-4-14B
Upstream source revisiond6ce765c8b83f847357b98254be079afa0c6ca76
Export tool/scriptNVIDIA TensorRT Model Optimizer (modelopt) 0.45.0
Quantization recipeNVFP4 (NVIDIA 4-bit floating point) safetensors
Base model (weights)Qwen/Qwen3-14B

Files

FileSizeDescription
model-00001-of-00002.safetensors~8.99 GBNVFP4 transformer weights: U8-packed 4-bit + FP8 (E4M3) group scales + FP32 global/input scales + BF16 norms
model-00002-of-00002.safetensors~1.56 GBBF16 embeddings / output head (excluded from NVFP4)
model.safetensors.index.jsonWeight map / shard index
config.jsonModel + quantization config
hf_quant_config.jsonmodelopt NVFP4 quantization config
generation_config.jsonGeneration defaults
chat_template.jinjaChat template
tokenizer.json + tokenizer_config.jsonTokenizer
.quant_summary.txtPer-module quantization summary log

Intended Use

A general instruction-following and function-calling LLM, quantized to NVFP4 for high-throughput inference on NVIDIA hardware. The NVFP4 variant targets NVIDIA GPUs with native FP4 tensor-core support; on other GPUs, NVFP4 weights may require emulation or upcast, which reduces the throughput benefit (verify with your TensorRT-LLM version).

Runtime Notes

  • Library: Hugging Face Transformers (or TensorRT-LLM for NVFP4 acceleration).
  • NVFP4 native acceleration requires NVIDIA GPU support; validate on the target hardware and TensorRT-LLM version before production use.
  • Context length: inherited from Qwen3-14B / Hermes-4-14B; consult the upstream model card.

Precision and Packaging

Export tooling, precision, and quantization are recorded in the Source table above. This packaging mirror does not publish independent parity benchmarks; validate on your target execution provider before production use.

Limitations

  • NVFP4 native acceleration is hardware-specific; not all NVIDIA GPUs support native FP4 tensor-core operations.
  • Quantization introduces approximation error relative to the FP16/BF16 baseline.
  • No repository-specific quality benchmark is documented here.
  • English-dominant model; multilingual quality may vary.

License

Apache 2.0 — inherited from NousResearch/Hermes-4-14B. This packaging repo adds no new license terms.

Model provider

tonythethompson

Model tree

Base

NousResearch/Hermes-4-14B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today