Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

PropertyValue
Base Modelskt/A.X-3.1 (35B params)
ArchitectureLlamaForCausalLM
QuantizationNVFP4 (4-bit floating point, Blackwell-native)
Quantization Toolnvidia-modelopt v0.44.0
Quantization ConfigNVFP4_DEFAULT_CFG (max algorithm)
Model Size~20.5 GB (3 shards)
Original Size~64.6 GB (FP16)
Compression Ratio3.15x
Context Length32,768 tokens
Vocab Size102,400

Performance

Benchmarked on NVIDIA DGX Spark (Blackwell GB10, 128GB unified LPDDR5X):

MetricNVFP4 (this model)FP16 Original
PPL (8 Korean eval texts)4.494.88
Speed (vLLM 0.19.1)~10 t/s~3.5 t/s
Memory20.5 GB64.6 GB

PPL (Perplexity) measured on 8 diverse Korean texts (289 tokens total) using vLLM logprobs API. Lower is better.

Key finding: NVFP4 quantization achieves virtually identical quality to FP16 while being ~3x faster and using ~3x less memory.

Benchmark Results (Accuracy vs Original)

Evaluated using the same Chat CoT protocol as the original model (0-shot, chat template applied, exact_match on the generated answer — the Llama 3 evaluation methodology SKT used for A.X-3.1). This ensures a fair, apples-to-apples comparison between the original FP16 model and the NVFP4 quantized version.

CategoryBenchmarkA.X-3.1 (Original FP16)A.X-3.1-NVFP4Recovery
KnowledgeKMMLU (Chat CoT, 0-shot)69.73%67.08%96.2%
KnowledgeCLIcK (Chat CoT, 0-shot)77.09%76.99%99.9%
KnowledgeMMLU (CoT, 0-shot, test)75.20%73.22%97.4%
InstructionIFEval (0-shot)87.11%85.29%97.9%
MathMATH (CoT, 0-shot)75.40%73.54%97.5%
Average97.8%

Average recovery 97.8% across 5 benchmarks — NVFP4 4-bit quantization preserves nearly all of the original model's accuracy. On CLIcK the gap is just 0.10pp (essentially lossless).

Per-domain breakdown:

KMMLU (45 subjects, 35,030 Q)STEMHUMSSApplied ScienceOther
69.40%69.22%65.48%65.25%
CLIcK (1,995 Q)CultureLanguage
78.96%72.92%
MMLU (14,042 Q, test)STEMSocial SciencesOtherHumanities
80.7%80.4%75.7%61.9%
MATH (5,000 Q)AlgebraPrealgebraNum. TheoryCountingPrecalcGeometryInt. Algebra
88.3%79.8%69.8%69.6%67.2%62.2%62.2%

IFEval (4 sub-metrics): prompt-strict 81.89% · inst-strict 87.29% · prompt-loose 83.36% · inst-loose 88.61% (avg 85.29%)

Evaluation: lm-evaluation-harness via local-chat-completions, vLLM 0.19.1 on NVIDIA DGX Spark. Original FP16 scores from the skt/A.X-3.1 model card. Knowledge benchmarks use the 0-shot Chat CoT protocol (chat template + step-by-step reasoning + exact_match); MMLU uses flexible-extract on the full test split, and MATH uses math_verify (symbolic equivalence) — both to match the original's methodology. IFEval recovery is vs the 4-metric average.

How to Use

With vLLM (Recommended)

bash

# Requires NVIDIA Blackwell GPU (sm_121a) and vLLM with NVFP4 support
vllm serve dlsxj101/A.X-3.1-NVFP4 \
--quantization fp4 \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85

With vLLM Docker

bash

docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
ghcr.io/bjk110/vllm-spark:v019-ngc2603 \
python3 -m vllm.entrypoints.openai.api_server \
--model dlsxj101/A.X-3.1-NVFP4 \
--quantization fp4 \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85 \
--host 0.0.0.0 --port 8000

OpenAI-Compatible API

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="dlsxj101/A.X-3.1-NVFP4",
messages=[{"role": "user", "content": "한국의 AI 산업 현황을 설명해주세요."}],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)

Hardware Requirements

  • GPU: NVIDIA Blackwell architecture (GB10, GB100, GB200, B100, B200)
    • NVFP4 is a Blackwell-native format computed directly on Tensor Cores
    • Not compatible with pre-Blackwell GPUs (A100, H100, etc.)
  • Memory: ~21 GB GPU memory minimum
  • Software: vLLM >= 0.19.0 with NVFP4 support

Quantization Details

  • Algorithm: max (NVFP4_DEFAULT_CFG) — measures maximum activation values per tensor
  • Group Size: 16
  • Excluded Modules: lm_head (kept in FP16)
  • Calibration: 8 English text samples (sufficient for max algorithm)
  • Quantization Time: ~1 minute on DGX Spark

Qualitative Evaluation

Tested across 8 categories (Korean knowledge, logic, creative writing, coding, summarization, math, fact-checking, English):

  • Korean Knowledge: Accurate, well-structured responses identical to FP16
  • Logic/Reasoning: Correct problem-solving with proper mathematical notation
  • Creative Writing: Natural Korean poetry with appropriate imagery
  • Coding: Correct Python code with proper explanations
  • Summarization: Concise and accurate 3-sentence summaries
  • Math: Correct differentiation with step-by-step solutions
  • Fact-Checking: Accurate historical information
  • English: Clear, well-organized English explanations

License

This model is released under the Apache 2.0 license, same as the base model skt/A.X-3.1.

Acknowledgments

  • Quantum Nexus — Quantization, benchmarking, and deployment performed on Quantum Nexus's NVIDIA DGX Spark (Blackwell GB10, 128GB)
  • SKT for the original A.X-3.1 model
  • NVIDIA for ModelOpt quantization toolkit and DGX Spark hardware
  • vLLM team for NVFP4 inference support

Model provider

dlsxj101

Model tree

Base

skt/A.X-3.1

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today