rdtand

Qwen3.6-27B-PrismaAURA-5.5bit-vllm

README

License: apache-2.0

Tool-use fidelity (ToolEvalBench, hardmode, deterministic)

Tool-use is the metric we weight most: a small probability shift at a decision point can flip a tool call. On ToolEvalBench (--no-think --hardmode, sequential, temperature=0, seed=1234), PrismaAURA scores the highest of the entire family — above full precision:

Table with columns: Artifact, ToolEvalBench
Artifact	ToolEvalBench
Qwen3.6-27B PrismaAURA 5.5 (this)	91 / 100 (134/148)
Qwen3.6-27B PrismaSCOUT 5.31 (prior flagship)	85 / 100
Qwen3.6-27B BF16 (full precision)	86 / 100

Same harness, same seed for all three. PrismaAURA preserves tool-calling behavior at least as well as the unquantized model on this benchmark, at 5.5 bpp.

Served KL-vs-BF16

KL divergence measures how far the quantized model's full output distribution has drifted from the original full-precision model (0 = identical). Measured on a held WikiText split (exact vLLM, n=8 × seqlen 512, vs the BF16 teacher in the same session):

Served KL-vs-BF16: 0.0342

Against the prior AURA research build at the same bpp (NVFP4+BF16 only, earlier render/export code), this is a −40.9% reduction in served KL — driven by the full FP8 menu, a corrected per-Linear render (fixed GPTQ damping, scale-faithful NVFP4 export), and a corrected calibration probe. (Single-draw served KL; the direction is corroborated by the deterministic ToolEvalBench result above.)

What AURA does

A modern LLM has thousands of weight matrices, each storable at one of several hardware precision formats. AURA splits quantization into two questions and answers the hard one by measurement:

Local (well studied): given a fixed format, round this one matrix well — GPTQ, implicit clipping, activation-order. PrismaAURA runs the full deliberate render under every Linear.
Global (PrismaQuant's contribution): how many bits should each Linear get, and in which format? AURA prices each (Linear, format) by a KL–Fisher quadratic — the second-order effect of that Linear's quantization error on the model's output distribution, measured with stochastic probes through the real model — and solves a multiple-choice knapsack over the bit budget. A heterogeneous per-Linear assignment extracts quality no single-format method structurally can.

Artifact details

Source model: Qwen/Qwen3.6-27B
Export format: vLLM compressed-tensors, mixed precision
Format menu: NVFP4 (group 16) + FP8 (E4M3 dynamic) + BF16, allocated per-Linear by AURA
Target hardware: NVIDIA Blackwell (NVFP4-native)
MTP tensors: included (BF16 passthrough)
Size on disk: ~23 GB (~5.5 bpp over quantizable parameters)
Passthrough dtype policy: source dtype preserved (no silent FP32 upcasting)

This is a quality-first operating point: it is larger than the 20.17 GB PrismaSCOUT artifact, not smaller. A matched-footprint AURA point is in progress. Downstream task evals (GSM8K / IFEval / MMLU) are forthcoming; the numbers above are what has been directly measured.

Serving

bash
vllm serve rdtand/Qwen3.6-27B-PrismaAURA-5.5bit-vllm \
  --quantization compressed-tensors \
  --trust-remote-code \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --speculative-config '{"method":"mtp","num_speculative_tokens":3}'

PrismaQuant — mixed-precision LLM quantization that chooses the right format per Linear on real end-to-end KL. Contact: robert.tand@icloud.com

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

rdtand

Model Tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities