Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Headline calibration-scaling comparison

Canonical tx4/quality3 evaluation was run on the equivalent legacy/original exports against the original BF16 HF model. Packed and legacy exports contain the same quantized tensors, so quality metrics are expected to be identical.

ModelCalibrationOptimizer recipePacked BPW ↓PPL ↓KL nats ↓ΔNLL ↓RMS Δp % ↓Top-1 % ↑Max KL ↓
PARO full2048-e12048×2048early recipe4.67996.68290.036681+0.0187215.32991.68317.3496
PARO full4096-e5-packed4096×2048previous recipe4.67996.62160.034684+0.0095065.17092.00011.0422
PARO full4096-rbparams-e5-packed4096×2048runbook params4.67996.61160.028336+0.0079964.73092.8169.7888
PARO full8192-oldfresh-rbparams-e5-packed8192×2048runbook params4.67996.60900.027939+0.0075944.64692.8566.3961

Relative to the previous 4096 runbook-parameter packed checkpoint:

Metric4096 rbparams8192 old+fresh rbparamsChange
PPL ↓6.61166.6090-0.0027 / -0.04%
KL nats ↓0.0283360.027939-0.000397 / -1.4%
ΔNLL ↓+0.007996+0.007594-0.000402 / -5.0%
RMS Δp % ↓4.7304.646-0.084 / -1.8%
Top-1 % ↑92.81692.856+0.039 pp
Max KL ↓9.78886.3961-3.3927 / -34.7%

Relative to the first 2048-sample row:

  • PPL improved 6.6829 → 6.6090 (-0.0739, -1.1%)
  • KL improved 0.036681 → 0.027939 (-0.008742, -23.8%)
  • Top-1 agreement improved 91.683% → 92.856% (+1.17 pp)

The 8192 run is a modest but consistent quality improvement over the 4096 runbook-parameter run, with the largest visible gain in outlier control (Max KL).

Full canonical quality table

Evaluation protocol:

  • Reference: original BF16 HF model
  • Validation source: held-out tx4/quality3 calibration validation mix
  • Context/window length: 2048 tokens
  • Stride: 1023 tokens
  • Scored target positions/window: 1025..2047 inclusive
  • Windows: 127
  • Prompt tokens/model: 260,096
  • Scored tokens/model: 129,921
ModelKindReferenceArtifact BPW ↓Packed BPW est. ↓PPL ↓Ref PPLMean NLL ↓Ref NLLΔNLL ↓KL nats ↓Max KL ↓RMS Δp % ↓Top-1 % ↑
Original BF16 HFHF/Transformersself16.43516.4356.55906.55901.8808361.880836+0.0000000.0000000.0000000.000100.000
PARO full2048-e1HF/ParoQuantOriginal BF16 HF4.6804.6776.68296.55901.8995581.880836+0.0187210.03668117.3496115.32991.683
PARO full4096-e5HF/ParoQuantOriginal BF16 HF5.3224.6776.62166.55901.8903421.880836+0.0095060.03468411.0421965.17092.000
PARO full4096-rbparams-e5HF/ParoQuantOriginal BF16 HF5.3224.6776.61166.55901.8888321.880836+0.0079960.0283369.7887534.73092.816
PARO full8192-oldfresh-rbparams-e5HF/ParoQuantOriginal BF16 HF5.3224.6776.60906.55901.8884311.880836+0.0075940.0279396.3960524.64692.856

Training and calibration details

Training run:

  • Optimizer run name: full8192-oldfresh-rbparams-e5
  • Started: 2026-05-31T19:18:10+09:00
  • Finished: 2026-06-07T00:28:57+09:00
  • Wall time: about 149h 11m (6d 5h 11m)
  • Layer-loop time reported by tqdm: 149:07:56
  • GPU: single GPU, CUDA_VISIBLE_DEVICES=2
  • Activation spill: local NVMe spill directory under /models/qwen36-paroquant-spill/full8192-oldfresh-rbparams-e5
  • Peak observed spill footprint during monitoring: about 130G
  • Quantization: W4A16 ParoQuant, bits=4, group_size=128, krot=8
  • Batch size: 8
  • Gradient accumulation: 2
  • Cache shards: 8
  • Loss: smooth_l1
  • Skipped modules: mlp.gate, mlp.shared_expert_gate, linear_attn.in_proj_a, linear_attn.in_proj_b

Calibration data:

Split/sourceRows in JSONLCharsQwen token metadataNotes
Old 4096 train mix8,06826,070,1588,388,652prior 4096 tx4/codebreadth/chotto mix
Fresh 4096 no-overlap train mix7,83826,154,4258,388,608fresh sample set; row-hash overlap with old mix was checked as 0
Combined train target15,906 rows52,224,583~16.78Moptimized as 8192×2048 token blocks; one deterministic duplicate block was padded to preserve fixed batch shapes
Held-out validation207409,969131,116canonical 64×2048-token validation mix

Combined train rows by group:

GroupRows
english_general3,916
code_breadth3,212
japanese3,162
chat_translation1,549
chinese1,485
other_multilingual1,415
math_stem1,166
calibration_padding1

Validation rows by group:

GroupRows
code_breadth54
english_general40
japanese40
other_multilingual24
math_stem17
chinese16
chat_translation15
calibration_padding1

Top train sources include chotto-20260107-sft, abeja-cc-ja, fineweb2-zh, fineweb-edu-sample, fineweb-sample, wikipedia-en, finemath-4plus, and multiple stack-edu-* code sources.

Packed artifact details

The packed artifact was produced from the legacy/original export with:

bash

python3 scripts/strip_paro_safetensors.py \
--input-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5 \
--output-dir /models/qwen36-quant/Qwen3.6-35B-A3B-PARO-full8192-oldfresh-rbparams-e5-packed \
--mode packed \
--overwrite

Packed changes:

  • Removed every duplicate fp16 .weight fallback tensor where the same module has .qweight
  • Removed tensors: 250
  • Removed tensor bytes: 2,810,183,680
  • model.safetensors: 20,474,495,512 bytes
  • Actual packed BPW: 4.6799 using a 35B denominator
  • Verified duplicate shared-expert fallback count after stripping: 0

Related checkpoints:

Notes

This artifact requires a packed-aware ParoQuant-compatible loader/runtime; legacy loaders that expect duplicate fp16 fallback .weight tensors will not load this format.

See strip_paro_safetensors_report.json for the exact stripping report.

Model provider

shisa-ai

shisa-ai

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today