tonythethompson/Hermes-4-14B-NVFP4 API & Inference Endpoint

Overview

NVFP4-quantized version of NousResearch/Hermes-4-14B, an instruction-following and function-calling model built on Qwen/Qwen3-14B. It is not a newly trained model. This is an unofficial community quantization; NousResearch is the original model author.

NVFP4 is NVIDIA's 4-bit floating-point quantization format. This variant targets high-throughput inference on NVIDIA hardware with NVFP4 acceleration. For details of the NVFP4 format and its hardware support, consult NVIDIA's TensorRT-LLM documentation.

Source

Field	Value
Upstream model	NousResearch/Hermes-4-14B
Upstream source revision	`d6ce765c8b83f847357b98254be079afa0c6ca76`
Export tool/script	NVIDIA TensorRT Model Optimizer (modelopt) 0.45.0
Quantization recipe	NVFP4 (NVIDIA 4-bit floating point) safetensors
Base model (weights)	Qwen/Qwen3-14B

Files

File	Size	Description
`model-00001-of-00002.safetensors`	~8.99 GB	NVFP4 transformer weights: U8-packed 4-bit + FP8 (E4M3) group scales + FP32 global/input scales + BF16 norms
`model-00002-of-00002.safetensors`	~1.56 GB	BF16 embeddings / output head (excluded from NVFP4)
`model.safetensors.index.json`	—	Weight map / shard index
`config.json`	—	Model + quantization config
`hf_quant_config.json`	—	modelopt NVFP4 quantization config
`generation_config.json`	—	Generation defaults
`chat_template.jinja`	—	Chat template
`tokenizer.json` + `tokenizer_config.json`	—	Tokenizer
`.quant_summary.txt`	—	Per-module quantization summary log

Intended Use

A general instruction-following and function-calling LLM, quantized to NVFP4 for high-throughput inference on NVIDIA hardware. The NVFP4 variant targets NVIDIA GPUs with native FP4 tensor-core support; on other GPUs, NVFP4 weights may require emulation or upcast, which reduces the throughput benefit (verify with your TensorRT-LLM version).

Runtime Notes

Library: Hugging Face Transformers (or TensorRT-LLM for NVFP4 acceleration).
NVFP4 native acceleration requires NVIDIA GPU support; validate on the target hardware and TensorRT-LLM version before production use.
Context length: inherited from Qwen3-14B / Hermes-4-14B; consult the upstream model card.

Precision and Packaging

Export tooling, precision, and quantization are recorded in the Source table above. This packaging mirror does not publish independent parity benchmarks; validate on your target execution provider before production use.

Limitations

NVFP4 native acceleration is hardware-specific; not all NVIDIA GPUs support native FP4 tensor-core operations.
Quantization introduces approximation error relative to the FP16/BF16 baseline.
No repository-specific quality benchmark is documented here.
English-dominant model; multilingual quality may vary.

License

Apache 2.0 — inherited from NousResearch/Hermes-4-14B. This packaging repo adds no new license terms.

Hermes-4-14B-NVFP4

Get help setting up a custom Dedicated Endpoints.

README