Model Details
Table with columns: Property, Value| Property | Value |
|---|
| Base Model | skt/A.X-3.1 (35B params) |
| Architecture | LlamaForCausalLM |
| Quantization | NVFP4 (4-bit floating point, Blackwell-native) |
| Quantization Tool | nvidia-modelopt v0.44.0 |
| Quantization Config | NVFP4_DEFAULT_CFG (max algorithm) |
| Model Size | ~20.5 GB (3 shards) |
| Original Size | ~64.6 GB (FP16) |
| Compression Ratio | 3.15x |
| Context Length | 32,768 tokens |
| Vocab Size | 102,400 |
Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):
Table with columns: Metric, NVFP4 (this model), FP16 Original| Metric | NVFP4 (this model) | FP16 Original |
|---|
| PPL (8 Korean eval texts) | 4.49 | 4.88 |
| Speed (vLLM 0.19.1) | ~10 t/s | ~3.5 t/s |
| Memory | 20.5 GB | 64.6 GB |
PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.
Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.
Benchmark Results (Accuracy vs Original)
Evaluated using the same Chat CoT protocol as the original model (0-shot, chat template applied, exact_match on the generated answer — the Llama 3 evaluation methodology SKT used for A.X-3.1). This ensures a fair, apples-to-apples comparison between the original FP16 model and the NVFP4 quantized version.
Table with columns: Category, Benchmark, A.X-3.1 (Original FP16), A.X-3.1-NVFP4, Recovery| Category | Benchmark | A.X-3.1 (Original FP16) | A.X-3.1-NVFP4 | Recovery |
|---|
| Knowledge | KMMLU (Chat CoT, 0-shot) | 69.73% | 67.08% | 96.2% |
| Knowledge | CLIcK (Chat CoT, 0-shot) | 77.09% | 76.99% | 99.9% |
| Knowledge | MMLU (CoT, 0-shot, test) | 75.20% | |
Average recovery 97.8% across 5 benchmarks — NVFP4 4-bit quantization preserves nearly all of the original model's accuracy. On CLIcK the gap is just 0.10pp (essentially lossless).
Per-domain breakdown:
Table with columns: KMMLU (45 subjects, 35,030 Q), STEM, HUMSS, Applied Science, Other| KMMLU (45 subjects, 35,030 Q) | STEM | HUMSS | Applied Science | Other |
|---|
| 69.40% | 69.22% | 65.48% | 65.25% |
Table with columns: CLIcK (1,995 Q), Culture, Language| CLIcK (1,995 Q) | Culture | Language |
|---|
| 78.96% | 72.92% |
Table with columns: MMLU (14,042 Q, test), STEM, Social Sciences, Other, Humanities| MMLU (14,042 Q, test) | STEM | Social Sciences | Other | Humanities |
|---|
| 80.7% | 80.4% | 75.7% | 61.9% |
Table with columns: MATH (5,000 Q), Algebra, Prealgebra, Num. Theory, Counting, Precalc, Geometry, Int. Algebra| MATH (5,000 Q) | Algebra | Prealgebra | Num. Theory | Counting | Precalc | Geometry | Int. Algebra |
|---|
| 88.3% | 79.8% | 69.8% | 69.6% | 67.2% | 62.2% | 62.2% |
IFEval (4 sub-metrics): prompt-strict 81.89% · inst-strict 87.29% · prompt-loose 83.36% · inst-loose 88.61% (avg 85.29%)
Evaluation: lm-evaluation-harness via local-chat-completions, vLLM 0.19.1 on NVIDIA DGX Spark. Original FP16 scores from the skt/A.X-3.1 model card. Knowledge benchmarks use the 0-shot Chat CoT protocol (chat template + step-by-step reasoning + exact_match); MMLU uses flexible-extract on the full test split, and MATH uses math_verify (symbolic equivalence) — both to match the original's methodology. IFEval recovery is vs the 4-metric average.
How to Use
With vLLM (Recommended)
# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
--quantization fp4 \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85
With vLLM Docker
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
python3 -m vllm.entrypoints.openai.api_server \
--model dlsxj101/A.X-3.1-NVFP4 \
--quantization fp4 \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--host 0.0.0.0 --port 8000
OpenAI-Compatible API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="dlsxj101/A.X-3.1-NVFP4",
messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
Hardware Requirements
- GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
- NVFP4 is a Blackwell-native format computed directly on Tensor Cores
- Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
- Memory: ~21 GB GPU memory minimum
- Software: vLLM >= 0.19.0 with NVFP4 support
Quantization Details
- Algorithm:
max (NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor
- Group Size: 16
- Excluded Modules:
lm_head (kept in FP16)
- Calibration: 8 English text samples (sufficient for
max algorithm)
- Quantization Time: ~1 minute on DGX Spark
Qualitative Evaluation
Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):
- Korean Knowledge: Accurate, well-structured responses identical to FP16
- Logic/Reasoning: Correct problem-solving with proper mathematical notation
- Creative Writing: Natural Korean poetry with appropriate imagery
- Coding: Correct Python code with proper explanations
- Summarization: Concise and accurate 3-sentence summaries
- Math: Correct differentiation with step-by-step solutions
- Fact-Checking: Accurate historical information
- English: Clear, well-organized English explanations
License
This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.
Acknowledgments
- Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
- SKT for the original A.X-3.1 model
- NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
- vLLM team for NVFP4 inference support