dlsxj101

A.X-3.1-NVFP4

README

License: apache-2.0

Model Details

Table with columns: Property, Value
Property	Value
Base Model	skt/A.X-3.1 (35B params)
Architecture	LlamaForCausalLM
Quantization	NVFP4 (4-bit floating point, Blackwell-native)
Quantization Tool	nvidia-modelopt v0.44.0
Quantization Config	`NVFP4_DEFAULT_CFG` (max algorithm)
Model Size	~20.5 GB (3 shards)
Original Size	~64.6 GB (FP16)
Compression Ratio	3.15x
Context Length	32,768 tokens
Vocab Size	102,400

Performance

Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):

Table with columns: Metric, NVFP4 (this model), FP16 Original
Metric	NVFP4 (this model)	FP16 Original
PPL (8 Korean eval texts)	4.49	4.88
Speed (vLLM 0.19.1)	~10 t/s	~3.5 t/s
Memory	20.5 GB	64.6 GB

PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.

Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.

Benchmark Results (Accuracy vs Original)

Evaluated using the same Chat CoT protocol as the original model (0-shot, chat template applied, exact_match on the generated answer — the Llama 3 evaluation methodology SKT used for A.X-3.1). This ensures a fair, apples-to-apples comparison between the original FP16 model and the NVFP4 quantized version.

Table with columns: Category, Benchmark, A.X-3.1 (Original FP16), A.X-3.1-NVFP4, Recovery
Category	Benchmark	A.X-3.1 (Original FP16)	A.X-3.1-NVFP4	Recovery
Knowledge	KMMLU (Chat CoT, 0-shot)	69.73%	67.08%	96.2%
Knowledge	CLIcK (Chat CoT, 0-shot)	77.09%	76.99%	99.9%
Knowledge	MMLU (CoT, 0-shot, test)	75.20%

Average recovery 97.8% across 5 benchmarks — NVFP4 4-bit quantization preserves nearly all of the original model's accuracy. On CLIcK the gap is just 0.10pp (essentially lossless).

Per-domain breakdown:

Table with columns: KMMLU (45 subjects, 35,030 Q), STEM, HUMSS, Applied Science, Other
KMMLU (45 subjects, 35,030 Q)	STEM	HUMSS	Applied Science	Other
	69.40%	69.22%	65.48%	65.25%

Table with columns: CLIcK (1,995 Q), Culture, Language
CLIcK (1,995 Q)	Culture	Language
	78.96%	72.92%

Table with columns: MMLU (14,042 Q, test), STEM, Social Sciences, Other, Humanities
MMLU (14,042 Q, test)	STEM	Social Sciences	Other	Humanities
	80.7%	80.4%	75.7%	61.9%

Table with columns: MATH (5,000 Q), Algebra, Prealgebra, Num. Theory, Counting, Precalc, Geometry, Int. Algebra
MATH (5,000 Q)	Algebra	Prealgebra	Num. Theory	Counting	Precalc	Geometry	Int. Algebra
	88.3%	79.8%	69.8%	69.6%	67.2%	62.2%	62.2%

IFEval (4 sub-metrics): prompt-strict 81.89% · inst-strict 87.29% · prompt-loose 83.36% · inst-loose 88.61% (avg 85.29%)

Evaluation: lm-evaluation-harness via local-chat-completions, vLLM 0.19.1 on NVIDIA DGX Spark. Original FP16 scores from the skt/A.X-3.1 model card. Knowledge benchmarks use the 0-shot Chat CoT protocol (chat template + step-by-step reasoning + exact_match); MMLU uses flexible-extract on the full test split, and MATH uses math_verify (symbolic equivalence) — both to match the original's methodology. IFEval recovery is vs the 4-metric average.

How to Use

With vLLM (Recommended)

bash
# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

With vLLM Docker

bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
  python3 -m vllm.entrypoints.openai.api_server \
  --model dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 --port 8000

OpenAI-Compatible API

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="dlsxj101/A.X-3.1-NVFP4",
    messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

Hardware Requirements

GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
- NVFP4 is a Blackwell-native format computed directly on Tensor Cores
- Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
Memory: ~21 GB GPU memory minimum
Software: vLLM >= 0.19.0 with NVFP4 support

Quantization Details

Algorithm: max (NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor
Group Size: 16
Excluded Modules: lm_head (kept in FP16)
Calibration: 8 English text samples (sufficient for max algorithm)
Quantization Time: ~1 minute on DGX Spark

Qualitative Evaluation

Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):

Korean Knowledge: Accurate, well-structured responses identical to FP16
Logic/Reasoning: Correct problem-solving with proper mathematical notation
Creative Writing: Natural Korean poetry with appropriate imagery
Coding: Correct Python code with proper explanations
Summarization: Concise and accurate 3-sentence summaries
Math: Correct differentiation with step-by-step solutions
Fact-Checking: Accurate historical information
English: Clear, well-organized English explanations

License

This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.

Acknowledgments

Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
SKT for the original A.X-3.1 model
NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
vLLM team for NVFP4 inference support

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

dlsxj101

Model Tree

Base

skt/A.X-3.1

Quantized

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Table with columns: Property, Value
Property	Value
Base Model	skt/A.X-3.1 (35B params)
Architecture	LlamaForCausalLM
Quantization	NVFP4 (4-bit floating point, Blackwell-native)
Quantization Tool	nvidia-modelopt v0.44.0
Quantization Config	`NVFP4_DEFAULT_CFG` (max algorithm)
Model Size	~20.5 GB (3 shards)
Original Size	~64.6 GB (FP16)
Compression Ratio	3.15x
Context Length	32,768 tokens
Vocab Size	102,400

Performance

Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):

Table with columns: Metric, NVFP4 (this model), FP16 Original
Metric	NVFP4 (this model)	FP16 Original
PPL (8 Korean eval texts)	4.49	4.88
Speed (vLLM 0.19.1)	~10 t/s	~3.5 t/s
Memory	20.5 GB	64.6 GB

PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.

Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.

Benchmark Results (Accuracy vs Original)

Table with columns: Category, Benchmark, A.X-3.1 (Original FP16), A.X-3.1-NVFP4, Recovery
Category	Benchmark	A.X-3.1 (Original FP16)	A.X-3.1-NVFP4	Recovery
Knowledge	KMMLU (Chat CoT, 0-shot)	69.73%	67.08%	96.2%
Knowledge	CLIcK (Chat CoT, 0-shot)	77.09%	76.99%	99.9%
Knowledge	MMLU (CoT, 0-shot, test)	75.20%

Average recovery 97.8% across 5 benchmarks — NVFP4 4-bit quantization preserves nearly all of the original model's accuracy. On CLIcK the gap is just 0.10pp (essentially lossless).

Per-domain breakdown:

Table with columns: KMMLU (45 subjects, 35,030 Q), STEM, HUMSS, Applied Science, Other
KMMLU (45 subjects, 35,030 Q)	STEM	HUMSS	Applied Science	Other
	69.40%	69.22%	65.48%	65.25%

Table with columns: CLIcK (1,995 Q), Culture, Language
CLIcK (1,995 Q)	Culture	Language
	78.96%	72.92%

Table with columns: MMLU (14,042 Q, test), STEM, Social Sciences, Other, Humanities
MMLU (14,042 Q, test)	STEM	Social Sciences	Other	Humanities
	80.7%	80.4%	75.7%	61.9%

Table with columns: MATH (5,000 Q), Algebra, Prealgebra, Num. Theory, Counting, Precalc, Geometry, Int. Algebra
MATH (5,000 Q)	Algebra	Prealgebra	Num. Theory	Counting	Precalc	Geometry	Int. Algebra
	88.3%	79.8%	69.8%	69.6%	67.2%	62.2%	62.2%

IFEval (4 sub-metrics): prompt-strict 81.89% · inst-strict 87.29% · prompt-loose 83.36% · inst-loose 88.61% (avg 85.29%)

How to Use

With vLLM (Recommended)

bash
# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85

With vLLM Docker

bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
  python3 -m vllm.entrypoints.openai.api_server \
  --model dlsxj101/A.X-3.1-NVFP4 \
  --quantization fp4 \
  --dtype float16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.85 \
  --host 0.0.0.0 --port 8000

OpenAI-Compatible API

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="dlsxj101/A.X-3.1-NVFP4",
    messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

Hardware Requirements

GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
- NVFP4 is a Blackwell-native format computed directly on Tensor Cores
- Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
Memory: ~21 GB GPU memory minimum
Software: vLLM >= 0.19.0 with NVFP4 support

Quantization Details

Algorithm: max (NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor
Group Size: 16
Excluded Modules: lm_head (kept in FP16)
Calibration: 8 English text samples (sufficient for max algorithm)
Quantization Time: ~1 minute on DGX Spark

Qualitative Evaluation

Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):

Korean Knowledge: Accurate, well-structured responses identical to FP16
Logic/Reasoning: Correct problem-solving with proper mathematical notation
Creative Writing: Natural Korean poetry with appropriate imagery
Coding: Correct Python code with proper explanations
Summarization: Concise and accurate 3-sentence summaries
Math: Correct differentiation with step-by-step solutions
Fact-Checking: Accurate historical information
English: Clear, well-organized English explanations

License

This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.

Acknowledgments

Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
SKT for the original A.X-3.1 model
NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
vLLM team for NVFP4 inference support