Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Headline comparison vs previous packed PARO checkpoint
Canonical tx4/quality3 evaluation was run on the equivalent legacy/original exports against the original BF16 HF model. Packed and legacy exports contain the same quantized tensors, so quality metrics are expected to be identical.
| Model | Calibration | Optimizer recipe | Packed BPW ↓ | PPL ↓ | Δ PPL vs prev packed | KL nats ↓ | Δ KL vs prev packed | ΔNLL ↓ | Top-1 % ↑ |
|---|---|---|---|---|---|---|---|---|---|
| PARO full4096-e5-packed | 4096×2048 | previous recipe | 4.6799 | 6.6216 | baseline | 0.034684 | baseline | +0.009506 | 92.000 |
| PARO full4096-rbparams-e5-packed | 4096×2048 | runbook params | 4.6799 | 6.6116 | -0.0100 / -0.15% | 0.028336 | -0.006348 / -18.3% | +0.007996 | 92.816 |
| PARO full8192-oldfresh-rbparams-e5-packed | 8192×2048 | runbook params | pending | pending | pending | pending | pending | pending | pending |
The new runbook-parameter checkpoint is the best PARO result so far on the canonical held-out validation protocol: lower PPL, lower ΔNLL, lower KL divergence, lower RMS true-token probability drift, and higher top-1 agreement than the previous packed 4096/e5 release.
When the 8192-sample run finishes, this headline section should be updated into a calibration-scaling table covering the first 2048-sample run, this 4096-sample run, and the 8192-sample run. The older 2048 row should be re-evaluated under the same canonical protocol before mixing it into this table.
Full canonical quality table
Evaluation protocol:
- Reference: original BF16 HF model
- Validation source: held-out tx4/quality3 calibration validation mix
- Context/window length:
2048tokens - Stride:
1023tokens - Scored target positions/window:
1025..2047inclusive - Windows:
127 - Scored tokens/model:
129,921
| Model | Kind | Reference | Artifact BPW ↓ | Packed BPW est. ↓ | PPL ↓ | Ref PPL | Mean NLL ↓ | Ref NLL | ΔNLL ↓ | KL nats ↓ | Max KL ↓ | RMS Δp % ↓ | Top-1 % ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Original BF16 HF | HF/Transformers | self | 16.435 | 16.435 | 6.5590 | 6.5590 | 1.880836 | 1.880836 | +0.000000 | 0.000000 | 0.000000 | 0.000 | 100.000 |
| PARO full4096-e1 | HF/ParoQuant | Original BF16 HF | 5.322 | 4.677 | 6.6569 | 6.5590 | 1.895660 | 1.880836 | +0.014824 | 0.034055 | 6.379075 | 5.098 | 92.036 |
| PARO full4096-e5 | HF/ParoQuant | Original BF16 HF | 5.322 | 4.677 | 6.6216 | 6.5590 | 1.890342 | 1.880836 | +0.009506 | 0.034684 | 11.042196 | 5.170 | 92.000 |
| PARO full4096-rbparams-e5 | HF/ParoQuant | Original BF16 HF | 5.322 | 4.677 | 6.6116 | 6.5590 | 1.888832 | 1.880836 | +0.007996 | 0.028336 | 9.788753 | 4.730 | 92.816 |
Packed artifact details
The packed artifact was produced from the legacy/original export with:
bash
python3 scripts/strip_paro_safetensors.py \--input-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full4096-rbparams-e5 \--output-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full4096-rbparams-e5-packed \--mode packed \--overwrite
Packed changes:
- Removed every duplicate fp16
.weightfallback tensor where the same module has.qweight - Removed tensors: 250
- Removed tensor bytes: 2,810,183,680
model.safetensors: 20,474,495,512 bytes- Actual packed BPW: 4.6799 using a 35B denominator
- Verified duplicate shared-expert fallback count after stripping: 0
Related checkpoints:
- Previous packed 4096/e5 release:
shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed - Legacy/original-format rbparams export:
Qwen3.6-35B-A3B-PARO-full4096-rbparams-e5
Training/calibration notes
- Quantization: W4A16 ParoQuant,
bits=4,group_size=128,krot=8 - Calibration size:
4096samples ×2048tokens - Validation size:
64samples ×2048tokens - Batch size:
8 - Gradient accumulation:
2 - Skipped modules:
mlp.gate,mlp.shared_expert_gate,linear_attn.in_proj_a,linear_attn.in_proj_b
Notes
This artifact requires a packed-aware ParoQuant-compatible loader/runtime; legacy loaders that expect duplicate fp16 fallback .weight tensors will not load this format.
See strip_paro_safetensors_report.json for the exact stripping report.
Model provider
shisa-ai
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information