Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
| Property | Value |
|---|---|
| Base Model | skt/A.X-3.1 (35B params) |
| Architecture | LlamaForCausalLM |
| Quantization | NVFP4 (4-bit floating point, Blackwell-native) |
| Quantization Tool | nvidia-modelopt v0.44.0 |
| Quantization Config | NVFP4_DEFAULT_CFG (max algorithm) |
| Model Size | ~20.5 GB (3 shards) |
| Original Size | ~64.6 GB (FP16) |
| Compression Ratio | 3.15x |
| Context Length | 32,768 tokens |
| Vocab Size | 102,400 |
Performance
Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):
| Metric | NVFP4 (this model) | FP16 Original |
|---|---|---|
| PPL (8 Korean eval texts) | 4.49 | 4.88 |
| Speed (vLLM 0.19.1) | ~10 t/s | ~3.5 t/s |
| Memory | 20.5 GB | 64.6 GB |
PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.
Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.
Benchmark Results (Accuracy vs Original)
Evaluated using the same Chat CoT protocol as the original model (0-shot, chat template applied, exact_match on the generated answer — the Llama 3 evaluation methodology SKT used for A.X-3.1). This ensures a fair, apples-to-apples comparison between the original FP16 model and the NVFP4 quantized version.
| Category | Benchmark | A.X-3.1 (Original FP16) | A.X-3.1-NVFP4 | Recovery |
|---|---|---|---|---|
| Knowledge | KMMLU (Chat CoT, 0-shot) | 69.73% | 67.08% | 96.2% |
| Knowledge | CLIcK (Chat CoT, 0-shot) | 77.09% | 76.99% | 99.9% |
| Knowledge | MMLU (CoT, 0-shot, test) | 75.20% | 73.22% | 97.4% |
| Instruction | IFEval (0-shot) | 87.11% | 85.29% | 97.9% |
| Math | MATH (CoT, 0-shot) | 75.40% | 73.54% | 97.5% |
| Average | 97.8% |
Average recovery 97.8% across 5 benchmarks — NVFP4 4-bit quantization preserves nearly all of the original model's accuracy. On CLIcK the gap is just 0.10pp (essentially lossless).
Per-domain breakdown:
| KMMLU (45 subjects, 35,030 Q) | STEM | HUMSS | Applied Science | Other |
|---|---|---|---|---|
| 69.40% | 69.22% | 65.48% | 65.25% |
| CLIcK (1,995 Q) | Culture | Language |
|---|---|---|
| 78.96% | 72.92% |
| MMLU (14,042 Q, test) | STEM | Social Sciences | Other | Humanities |
|---|---|---|---|---|
| 80.7% | 80.4% | 75.7% | 61.9% |
| MATH (5,000 Q) | Algebra | Prealgebra | Num. Theory | Counting | Precalc | Geometry | Int. Algebra |
|---|---|---|---|---|---|---|---|
| 88.3% | 79.8% | 69.8% | 69.6% | 67.2% | 62.2% | 62.2% |
IFEval (4 sub-metrics): prompt-strict 81.89% · inst-strict 87.29% · prompt-loose 83.36% · inst-loose 88.61% (avg 85.29%)
Evaluation: lm-evaluation-harness via local-chat-completions, vLLM 0.19.1 on NVIDIA DGX Spark. Original FP16 scores from the skt/A.X-3.1 model card. Knowledge benchmarks use the 0-shot Chat CoT protocol (chat template + step-by-step reasoning + exact_match); MMLU uses flexible-extract on the full test split, and MATH uses math_verify (symbolic equivalence) — both to match the original's methodology. IFEval recovery is vs the 4-metric average.
How to Use
With vLLM (Recommended)
bash
# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 supportvllm serve dlsxj101/A.X-3.1-NVFP4 \--quantization fp4 \--dtype float16 \--max-model-len 8192 \--gpu-memory-utilization 0.85
With vLLM Docker
bash
docker run --gpus all \-v ~/.cache/huggingface:/root/.cache/huggingface \-p 8000:8000 \ghcr.io/bjk110/vllm-spark:v019-ngc2603 \python3 -m vllm.entrypoints.openai.api_server \--model dlsxj101/A.X-3.1-NVFP4 \--quantization fp4 \--dtype float16 \--max-model-len 8192 \--gpu-memory-utilization 0.85 \--host 0.0.0.0 --port 8000
OpenAI-Compatible API
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")response = client.chat.completions.create(model="dlsxj101/A.X-3.1-NVFP4",messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],max_tokens=1024,temperature=0.7,)print(response.choices[0].message.content)
Hardware Requirements
- GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
- NVFP4 is a Blackwell-native format computed directly on Tensor Cores
- Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
- Memory: ~21 GB GPU memory minimum
- Software: vLLM >= 0.19.0 with NVFP4 support
Quantization Details
- Algorithm:
max(NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor - Group Size: 16
- Excluded Modules:
lm_head(kept in FP16) - Calibration: 8 English text samples (sufficient for
maxalgorithm) - Quantization Time: ~1 minute on DGX Spark
Qualitative Evaluation
Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):
- Korean Knowledge: Accurate, well-structured responses identical to FP16
- Logic/Reasoning: Correct problem-solving with proper mathematical notation
- Creative Writing: Natural Korean poetry with appropriate imagery
- Coding: Correct Python code with proper explanations
- Summarization: Concise and accurate 3-sentence summaries
- Math: Correct differentiation with step-by-step solutions
- Fact-Checking: Accurate historical information
- English: Clear, well-organized English explanations
License
This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.
Acknowledgments
- Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
- SKT for the original A.X-3.1 model
- NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
- vLLM team for NVFP4 inference support
Model provider
dlsxj101
Model tree
Base
skt/A.X-3.1
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information