Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Headline calibration-scaling comparison
Canonical tx4/quality3 evaluation was run on the equivalent legacy/original exports against the original BF16 HF model. Packed and legacy exports contain the same quantized tensors, so quality metrics are expected to be identical.
| Model | Calibration | Optimizer recipe | Packed BPW ↓ | PPL ↓ | KL nats ↓ | ΔNLL ↓ | RMS Δp % ↓ | Top-1 % ↑ | Max KL ↓ |
|---|---|---|---|---|---|---|---|---|---|
| PARO full2048-e1 | 2048×2048 | early recipe | 4.6799 | 6.6829 | 0.036681 | +0.018721 | 5.329 | 91.683 | 17.3496 |
| PARO full4096-e5-packed | 4096×2048 | previous recipe | 4.6799 | 6.6216 | 0.034684 | +0.009506 | 5.170 | 92.000 | 11.0422 |
| PARO full4096-rbparams-e5-packed | 4096×2048 | runbook params | 4.6799 | 6.6116 | 0.028336 | +0.007996 | 4.730 | 92.816 | 9.7888 |
| PARO full8192-oldfresh-rbparams-e5-packed | 8192×2048 | runbook params | 4.6799 | 6.6090 | 0.027939 | +0.007594 | 4.646 | 92.856 | 6.3961 |
Relative to the previous 4096 runbook-parameter packed checkpoint:
| Metric | 4096 rbparams | 8192 old+fresh rbparams | Change |
|---|---|---|---|
| PPL ↓ | 6.6116 | 6.6090 | -0.0027 / -0.04% |
| KL nats ↓ | 0.028336 | 0.027939 | -0.000397 / -1.4% |
| ΔNLL ↓ | +0.007996 | +0.007594 | -0.000402 / -5.0% |
| RMS Δp % ↓ | 4.730 | 4.646 | -0.084 / -1.8% |
| Top-1 % ↑ | 92.816 | 92.856 | +0.039 pp |
| Max KL ↓ | 9.7888 | 6.3961 | -3.3927 / -34.7% |
Relative to the first 2048-sample row:
- PPL improved
6.6829 → 6.6090(-0.0739,-1.1%) - KL improved
0.036681 → 0.027939(-0.008742,-23.8%) - Top-1 agreement improved
91.683% → 92.856%(+1.17 pp)
The 8192 run is a modest but consistent quality improvement over the 4096 runbook-parameter run, with the largest visible gain in outlier control (Max KL).
Full canonical quality table
Evaluation protocol:
- Reference: original BF16 HF model
- Validation source: held-out tx4/quality3 calibration validation mix
- Context/window length:
2048tokens - Stride:
1023tokens - Scored target positions/window:
1025..2047inclusive - Windows:
127 - Prompt tokens/model:
260,096 - Scored tokens/model:
129,921
| Model | Kind | Reference | Artifact BPW ↓ | Packed BPW est. ↓ | PPL ↓ | Ref PPL | Mean NLL ↓ | Ref NLL | ΔNLL ↓ | KL nats ↓ | Max KL ↓ | RMS Δp % ↓ | Top-1 % ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Original BF16 HF | HF/Transformers | self | 16.435 | 16.435 | 6.5590 | 6.5590 | 1.880836 | 1.880836 | +0.000000 | 0.000000 | 0.000000 | 0.000 | 100.000 |
| PARO full2048-e1 | HF/ParoQuant | Original BF16 HF | 4.680 | 4.677 | 6.6829 | 6.5590 | 1.899558 | 1.880836 | +0.018721 | 0.036681 | 17.349611 | 5.329 | 91.683 |
| PARO full4096-e5 | HF/ParoQuant | Original BF16 HF | 5.322 | 4.677 | 6.6216 | 6.5590 | 1.890342 | 1.880836 | +0.009506 | 0.034684 | 11.042196 | 5.170 | 92.000 |
| PARO full4096-rbparams-e5 | HF/ParoQuant | Original BF16 HF | 5.322 | 4.677 | 6.6116 | 6.5590 | 1.888832 | 1.880836 | +0.007996 | 0.028336 | 9.788753 | 4.730 | 92.816 |
| PARO full8192-oldfresh-rbparams-e5 | HF/ParoQuant | Original BF16 HF | 5.322 | 4.677 | 6.6090 | 6.5590 | 1.888431 | 1.880836 | +0.007594 | 0.027939 | 6.396052 | 4.646 | 92.856 |
Training and calibration details
Training run:
- Optimizer run name:
full8192-oldfresh-rbparams-e5 - Started:
2026-05-31T19:18:10+09:00 - Finished:
2026-06-07T00:28:57+09:00 - Wall time: about
149h 11m(6d 5h 11m) - Layer-loop time reported by tqdm:
149:07:56 - GPU: single GPU,
CUDA_VISIBLE_DEVICES=2 - Activation spill: local NVMe spill directory under
/models/qwen36-paroquant-spill/full8192-oldfresh-rbparams-e5 - Peak observed spill footprint during monitoring: about
130G - Quantization: W4A16 ParoQuant,
bits=4,group_size=128,krot=8 - Batch size:
8 - Gradient accumulation:
2 - Cache shards:
8 - Loss:
smooth_l1 - Skipped modules:
mlp.gate,mlp.shared_expert_gate,linear_attn.in_proj_a,linear_attn.in_proj_b
Calibration data:
| Split/source | Rows in JSONL | Chars | Qwen token metadata | Notes |
|---|---|---|---|---|
| Old 4096 train mix | 8,068 | 26,070,158 | 8,388,652 | prior 4096 tx4/codebreadth/chotto mix |
| Fresh 4096 no-overlap train mix | 7,838 | 26,154,425 | 8,388,608 | fresh sample set; row-hash overlap with old mix was checked as 0 |
| Combined train target | 15,906 rows | 52,224,583 | ~16.78M | optimized as 8192×2048 token blocks; one deterministic duplicate block was padded to preserve fixed batch shapes |
| Held-out validation | 207 | 409,969 | 131,116 | canonical 64×2048-token validation mix |
Combined train rows by group:
| Group | Rows |
|---|---|
| english_general | 3,916 |
| code_breadth | 3,212 |
| japanese | 3,162 |
| chat_translation | 1,549 |
| chinese | 1,485 |
| other_multilingual | 1,415 |
| math_stem | 1,166 |
| calibration_padding | 1 |
Validation rows by group:
| Group | Rows |
|---|---|
| code_breadth | 54 |
| english_general | 40 |
| japanese | 40 |
| other_multilingual | 24 |
| math_stem | 17 |
| chinese | 16 |
| chat_translation | 15 |
| calibration_padding | 1 |
Top train sources include chotto-20260107-sft, abeja-cc-ja, fineweb2-zh, fineweb-edu-sample, fineweb-sample, wikipedia-en, finemath-4plus, and multiple stack-edu-* code sources.
Packed artifact details
The packed artifact was produced from the legacy/original export with:
bash
python3 scripts/strip_paro_safetensors.py \--input-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5 \--output-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5-packed \--mode packed \--overwrite
Packed changes:
- Removed every duplicate fp16
.weightfallback tensor where the same module has.qweight - Removed tensors: 250
- Removed tensor bytes: 2,810,183,680
model.safetensors: 20,474,495,512 bytes- Actual packed BPW: 4.6799 using a 35B denominator
- Verified duplicate shared-expert fallback count after stripping: 0
Related checkpoints:
- 4096 runbook packed release:
shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-rbparams-e5-packed - Previous packed 4096/e5 release:
shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed - Legacy/original-format 8192 export:
Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5
Notes
This artifact requires a packed-aware ParoQuant-compatible loader/runtime; legacy loaders that expect duplicate fp16 fallback .weight tensors will not load this format.
See strip_paro_safetensors_report.json for the exact stripping report.
Model provider
shisa-ai
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information