shisa-ai

Qwen3.6-35B-A3B-PARO-packed

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Headline calibration-scaling comparison

Canonical tx4/quality3 evaluation was run on the equivalent legacy/original exports against the original BF16 HF model. Packed and legacy exports contain the same quantized tensors, so quality metrics are expected to be identical.

Table with columns: Model, Calibration, Optimizer recipe, Packed BPW ↓, PPL ↓, KL nats ↓, ΔNLL ↓, RMS Δp % ↓, Top-1 % ↑, Max KL ↓
Model	Calibration	Optimizer recipe	Packed BPW ↓	PPL ↓	KL nats ↓	ΔNLL ↓	RMS Δp % ↓	Top-1 % ↑	Max KL ↓
PARO full2048-e1	2048×2048	early recipe	4.6799	6.6829	0.036681	+0.018721	5.329	91.683	17.3496
PARO full4096-e5-packed	4096×2048	previous recipe	4.6799	6.6216	0.034684	+0.009506	5.170	92.000	11.0422
PARO full4096-rbparams-e5-packed	4096×2048	runbook params	4.6799	6.6116	0.028336	+0.007996	4.730	92.816	9.7888
PARO full8192-oldfresh-rbparams-e5-packed	8192×2048	runbook params	4.6799	6.6090	0.027939	+0.007594	4.646	92.856	6.3961

Relative to the previous 4096 runbook-parameter packed checkpoint:

Table with columns: Metric, 4096 rbparams, 8192 old+fresh rbparams, Change
Metric	4096 rbparams	8192 old+fresh rbparams	Change
PPL ↓	6.6116	6.6090	-0.0027 / -0.04%
KL nats ↓	0.028336	0.027939	-0.000397 / -1.4%
ΔNLL ↓	+0.007996	+0.007594	-0.000402 / -5.0%
RMS Δp % ↓	4.730

Relative to the first 2048-sample row:

PPL improved 6.6829 → 6.6090 (-0.0739, -1.1%)
KL improved 0.036681 → 0.027939 (-0.008742, -23.8%)
Top-1 agreement improved 91.683% → 92.856% (+1.17 pp)

The 8192 run is a modest but consistent quality improvement over the 4096 runbook-parameter run, with the largest visible gain in outlier control (Max KL).

Full canonical quality table

Evaluation protocol:

Reference: original BF16 HF model
Validation source: held-out tx4/quality3 calibration validation mix
Context/window length: 2048 tokens
Stride: 1023 tokens
Scored target positions/window: 1025..2047 inclusive
Windows: 127
Prompt tokens/model: 260,096
Scored tokens/model: 129,921

Table with columns: Model, Kind, Reference, Artifact BPW ↓, Packed BPW est. ↓, PPL ↓, Ref PPL, Mean NLL ↓, Ref NLL, ΔNLL ↓, KL nats ↓, Max KL ↓, RMS Δp % ↓, Top-1 % ↑
Model	Kind	Reference	Artifact BPW ↓	Packed BPW est. ↓	PPL ↓	Ref PPL	Mean NLL ↓	Ref NLL	ΔNLL ↓	KL nats ↓	Max KL ↓	RMS Δp % ↓	Top-1 % ↑
Original BF16 HF	HF/Transformers	self	16.435	16.435

Training and calibration details

Training run:

Optimizer run name: full8192-oldfresh-rbparams-e5
Started: 2026-05-31T19:18:10+09:00
Finished: 2026-06-07T00:28:57+09:00
Wall time: about 149h 11m (6d 5h 11m)
Layer-loop time reported by tqdm: 149:07:56
GPU: single GPU, CUDA_VISIBLE_DEVICES=2
Activation spill: local NVMe spill directory under /models/qwen36-paroquant-spill/full8192-oldfresh-rbparams-e5
Peak observed spill footprint during monitoring: about 130G

Calibration data:

Table with columns: Split/source, Rows in JSONL, Chars, Qwen token metadata, Notes
Split/source	Rows in JSONL	Chars	Qwen token metadata	Notes
Old 4096 train mix	8,068	26,070,158	8,388,652	prior 4096 tx4/codebreadth/chotto mix
Fresh 4096 no-overlap train mix	7,838	26,154,425	8,388,608	fresh sample set; row-hash overlap with old mix was checked as 0
Combined train target	15,906 rows	52,224,583	~16.78M

Combined train rows by group:

Table with columns: Group, Rows
Group	Rows
english_general	3,916
code_breadth	3,212
japanese	3,162
chat_translation	1,549
chinese	1,485
other_multilingual	1,415
math_stem	1,166
calibration_padding	1

Validation rows by group:

Table with columns: Group, Rows
Group	Rows
code_breadth	54
english_general	40
japanese	40
other_multilingual	24
math_stem	17
chinese	16
chat_translation	15
calibration_padding	1

Top train sources include chotto-20260107-sft, abeja-cc-ja, fineweb2-zh, fineweb-edu-sample, fineweb-sample, wikipedia-en, finemath-4plus, and multiple stack-edu-* code sources.

Packed artifact details

The packed artifact was produced from the legacy/original export with:

bash
python3 scripts/strip_paro_safetensors.py \
  --input-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5 \
  --output-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5-packed \
  --mode packed \
  --overwrite

Packed changes:

Removed every duplicate fp16 .weight fallback tensor where the same module has .qweight
Removed tensors: 250
Removed tensor bytes: 2,810,183,680
model.safetensors: 20,474,495,512 bytes
Actual packed BPW: 4.6799 using a 35B denominator
Verified duplicate shared-expert fallback count after stripping: 0

Related checkpoints:

4096 runbook packed release: shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-rbparams-e5-packed
Previous packed 4096/e5 release: shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed
Legacy/original-format 8192 export: Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5

Notes

This artifact requires a packed-aware ParoQuant-compatible loader/runtime; legacy loaders that expect duplicate fp16 fallback .weight tensors will not load this format.

See strip_paro_safetensors_report.json for the exact stripping report.

Model provider

shisa-ai

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Headline calibration-scaling comparison

Table with columns: Model, Calibration, Optimizer recipe, Packed BPW ↓, PPL ↓, KL nats ↓, ΔNLL ↓, RMS Δp % ↓, Top-1 % ↑, Max KL ↓
Model	Calibration	Optimizer recipe	Packed BPW ↓	PPL ↓	KL nats ↓	ΔNLL ↓	RMS Δp % ↓	Top-1 % ↑	Max KL ↓
PARO full2048-e1	2048×2048	early recipe	4.6799	6.6829	0.036681	+0.018721	5.329	91.683	17.3496
PARO full4096-e5-packed	4096×2048	previous recipe	4.6799	6.6216	0.034684	+0.009506	5.170	92.000	11.0422
PARO full4096-rbparams-e5-packed	4096×2048	runbook params	4.6799	6.6116	0.028336	+0.007996	4.730	92.816	9.7888
PARO full8192-oldfresh-rbparams-e5-packed	8192×2048	runbook params	4.6799	6.6090	0.027939	+0.007594	4.646	92.856	6.3961

Relative to the previous 4096 runbook-parameter packed checkpoint:

Table with columns: Metric, 4096 rbparams, 8192 old+fresh rbparams, Change
Metric	4096 rbparams	8192 old+fresh rbparams	Change
PPL ↓	6.6116	6.6090	-0.0027 / -0.04%
KL nats ↓	0.028336	0.027939	-0.000397 / -1.4%
ΔNLL ↓	+0.007996	+0.007594	-0.000402 / -5.0%
RMS Δp % ↓	4.730

Relative to the first 2048-sample row:

PPL improved 6.6829 → 6.6090 (-0.0739, -1.1%)
KL improved 0.036681 → 0.027939 (-0.008742, -23.8%)
Top-1 agreement improved 91.683% → 92.856% (+1.17 pp)

The 8192 run is a modest but consistent quality improvement over the 4096 runbook-parameter run, with the largest visible gain in outlier control (Max KL).

Full canonical quality table

Evaluation protocol:

Reference: original BF16 HF model
Validation source: held-out tx4/quality3 calibration validation mix
Context/window length: 2048 tokens
Stride: 1023 tokens
Scored target positions/window: 1025..2047 inclusive
Windows: 127
Prompt tokens/model: 260,096
Scored tokens/model: 129,921

Table with columns: Model, Kind, Reference, Artifact BPW ↓, Packed BPW est. ↓, PPL ↓, Ref PPL, Mean NLL ↓, Ref NLL, ΔNLL ↓, KL nats ↓, Max KL ↓, RMS Δp % ↓, Top-1 % ↑
Model	Kind	Reference	Artifact BPW ↓	Packed BPW est. ↓	PPL ↓	Ref PPL	Mean NLL ↓	Ref NLL	ΔNLL ↓	KL nats ↓	Max KL ↓	RMS Δp % ↓	Top-1 % ↑
Original BF16 HF	HF/Transformers	self	16.435	16.435

Training and calibration details

Training run:

Optimizer run name: full8192-oldfresh-rbparams-e5
Started: 2026-05-31T19:18:10+09:00
Finished: 2026-06-07T00:28:57+09:00
Wall time: about 149h 11m (6d 5h 11m)
Layer-loop time reported by tqdm: 149:07:56
GPU: single GPU, CUDA_VISIBLE_DEVICES=2
Activation spill: local NVMe spill directory under /models/qwen36-paroquant-spill/full8192-oldfresh-rbparams-e5
Peak observed spill footprint during monitoring: about 130G

Calibration data:

Table with columns: Split/source, Rows in JSONL, Chars, Qwen token metadata, Notes
Split/source	Rows in JSONL	Chars	Qwen token metadata	Notes
Old 4096 train mix	8,068	26,070,158	8,388,652	prior 4096 tx4/codebreadth/chotto mix
Fresh 4096 no-overlap train mix	7,838	26,154,425	8,388,608	fresh sample set; row-hash overlap with old mix was checked as 0
Combined train target	15,906 rows	52,224,583	~16.78M

Combined train rows by group:

Table with columns: Group, Rows
Group	Rows
english_general	3,916
code_breadth	3,212
japanese	3,162
chat_translation	1,549
chinese	1,485
other_multilingual	1,415
math_stem	1,166
calibration_padding	1

Validation rows by group:

Table with columns: Group, Rows
Group	Rows
code_breadth	54
english_general	40
japanese	40
other_multilingual	24
math_stem	17
chinese	16
chat_translation	15
calibration_padding	1

Top train sources include chotto-20260107-sft, abeja-cc-ja, fineweb2-zh, fineweb-edu-sample, fineweb-sample, wikipedia-en, finemath-4plus, and multiple stack-edu-* code sources.

Packed artifact details

The packed artifact was produced from the legacy/original export with:

bash
python3 scripts/strip_paro_safetensors.py \
  --input-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5 \
  --output-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5-packed \
  --mode packed \
  --overwrite

Packed changes:

Removed every duplicate fp16 .weight fallback tensor where the same module has .qweight
Removed tensors: 250
Removed tensor bytes: 2,810,183,680
model.safetensors: 20,474,495,512 bytes
Actual packed BPW: 4.6799 using a 35B denominator
Verified duplicate shared-expert fallback count after stripping: 0

Related checkpoints:

4096 runbook packed release: shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-rbparams-e5-packed
Previous packed 4096/e5 release: shisa-ai/Qwen3.6-35B-A3B-PARO-full4096-e5-packed
Legacy/original-format 8192 export: Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5

Notes

This artifact requires a packed-aware ParoQuant-compatible loader/runtime; legacy loaders that expect duplicate fp16 fallback .weight tensors will not load this format.

See strip_paro_safetensors_report.json for the exact stripping report.

Qwen3.6-35B-A3B-PARO-packed

Get help setting up a custom Dedicated Endpoints.

README

Headline calibration-scaling comparison

Full canonical quality table

Training and calibration details

Packed artifact details

Notes

Explore FriendliAI today

README

Headline calibration-scaling comparison

Full canonical quality table

Training and calibration details

Packed artifact details

Notes