XReyRobert

Nex-N2-mini-GPTQ-Pro

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Source And Credits

Source model:

Quantization tooling and reference recipe:

Artifact Summary

Table
FieldValue
Source modelnex-agi/Nex-N2-mini
ArchitectureQwen3_5MoeForConditionalGeneration
Model typeqwen3_5_moe
Tensor files5
Safetensors size19.23 GiB
Indexed tensors124576
Quantized qweight tensors30970
mtp.* tensors in indexfalse
vision/visual tensors in indextrue
Index metadata size matches shardstrue

The source index/logs showed no mtp.* tensors. This artifact therefore normalizes text_config.mtp_num_hidden_layers to 0 and records the change under artifact_notes.mtp.

Quantization Recipe

Table
SettingValue
MethodGPTQ-Pro / GPTQModel
Quantizergptqmodel:6.1.0-dev
Bits4
Group size128
Symmetric quantizationtrue
Desc actfalse
True sequentialtrue
Calibration datasetWikiText
Calibration samples256
Calibration sequence length2048
MSE2.0
Damp percent0.05
Damp auto increment0.01
FOEM alpha0.25
FOEM beta0.2
FOEM devicecuda:0
MoE routingExpertsRoutingBypass
MoE bypass batch size320
Dense VRAM strategyexclusive
MoE VRAM strategybalanced
Pack implementationcpu

Fallback smoothing was enabled for difficult groups with threshold 0.5%.

Intended Serving Shape

This checkpoint is intended for advanced users testing text-only GPTQ serving for Qwen3.6-style MoE models.

A starting vLLM shape for text-only testing:

bash

vllm serve XReyRobert/Nex-N2-mini-GPTQ-Pro \
--served-model-name nex-n2-mini-gptq-pro \
--language-model-only \
--dtype float16 \
--quantization gptq_marlin \
--tensor-parallel-size 1 \
--max-model-len 262144 \
--max-num-seqs 1 \
--kv-cache-dtype fp8_e5m2 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--trust-remote-code

Treat this as a starting point. Loader compatibility depends on vLLM, Transformers, GPTQModel, GPTQ-Marlin, and Qwen3.6 MoE support.

The RTX 3090 image above reflects separate 262k-context serving validation.

Validation And Benchmarks

Completed artifact checks:

  • Local shard index inspection completed before upload.
  • Remote file list verified after upload.
  • Remote model.safetensors.index.json verified after upload.
  • Index metadata total size matches the local safetensor shards.
  • The remote artifact contains the expected five safetensor shards.

Terminal-Bench 2.0 Smoke24 result and associated vLLM serving measurements. This Smoke24 run used max_model_len=131072 for apples-to-apples comparison with the other local models in this publication batch:

Table
RunScoreSuccess rateWall-timeOutput tokensObserved decodeLLM API time
nex-n2-mini-gptq-pro14/2458.3%314.6m1670.6k140.8 tok/s197.4m

Smoke24 is a fixed 24-task Terminal-Bench 2.0 comparison corpus, not a full Terminal-Bench leaderboard run. In this harness, Nex-N2-mini GPTQ-Pro tied the Qwen3.6 27B GPTQ reference on solved tasks but used more wall time and far more output tokens. That makes it a useful candidate for further serving and generation-control tuning, not an efficiency leader in this specific test.

Task list and harness shape:

MTP And Vision Status

  • mtp.* tensors are not present in this artifact.
  • text_config.mtp_num_hidden_layers was normalized to 0.
  • Do not enable MTP speculative decoding for this artifact.
  • Vision/visual tensors are present, but multimodal serving has not been validated for this quantized artifact.

Limitations

  • Experimental quantization.
  • Terminal-Bench Smoke24 is a small local comparison corpus, not a full benchmark submission.
  • Nex-N2-mini was verbose and reasoning-heavy in the Smoke24 harness; generation controls may need further tuning.
  • MTP speculative decoding is not supported by this artifact.
  • Vision tensors are preserved, but vision behavior has not been validated.
  • Loader behavior may vary across vLLM, Transformers, GPTQModel, and GPTQ-Marlin versions.

Files

Key files:

  • model.safetensors.index.json
  • model-00001-of-00005.safetensors through model-00005-of-00005.safetensors
  • config.json
  • quantize_config.json
  • processor_config.json
  • tokenizer.json
  • UPLOAD_MANIFEST.json

UPLOAD_MANIFEST.json records the upload guardrail checks and artifact inspection summary.

References

Individual Project Notice

This repository is an individual research project. It is not affiliated with, sponsored by, or endorsed by any employer or organization.

Model provider

XReyRobert

Model tree

Base

nex-agi/Nex-N2-mini

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today