rdtand

Qwen3.6-27B-PrismaAURA-5.5bit-vllm

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Tool-use fidelity (ToolEvalBench, hardmode, deterministic)

Tool-use is the metric we weight most: a small probability shift at a decision point can flip a tool call. On ToolEvalBench (--no-think --hardmode, sequential, temperature=0, seed=1234), PrismaAURA scores the highest of the entire family — above full precision:

Table
ArtifactToolEvalBench
Qwen3.6-27B PrismaAURA 5.5 (this)91 / 100 (134/148)
Qwen3.6-27B PrismaSCOUT 5.31 (prior flagship)85 / 100
Qwen3.6-27B BF16 (full precision)86 / 100

Same harness, same seed for all three. PrismaAURA preserves tool-calling behavior at least as well as the unquantized model on this benchmark, at 5.5 bpp.

Served KL-vs-BF16

KL divergence measures how far the quantized model's full output distribution has drifted from the original full-precision model (0 = identical). Measured on a held WikiText split (exact vLLM, n=8 × seqlen 512, vs the BF16 teacher in the same session):

  • Served KL-vs-BF16: 0.0342

Against the prior AURA research build at the same bpp (NVFP4+BF16 only, earlier render/export code), this is a −40.9% reduction in served KL — driven by the full FP8 menu, a corrected per-Linear render (fixed GPTQ damping, scale-faithful NVFP4 export), and a corrected calibration probe. (Single-draw served KL; the direction is corroborated by the deterministic ToolEvalBench result above.)

What AURA does

A modern LLM has thousands of weight matrices, each storable at one of several hardware precision formats. AURA splits quantization into two questions and answers the hard one by measurement:

  • Local (well studied): given a fixed format, round this one matrix well — GPTQ, implicit clipping, activation-order. PrismaAURA runs the full deliberate render under every Linear.
  • Global (PrismaQuant's contribution): how many bits should each Linear get, and in which format? AURA prices each (Linear, format) by a KL–Fisher quadratic — the second-order effect of that Linear's quantization error on the model's output distribution, measured with stochastic probes through the real model — and solves a multiple-choice knapsack over the bit budget. A heterogeneous per-Linear assignment extracts quality no single-format method structurally can.

Artifact details

  • Source model: Qwen/Qwen3.6-27B
  • Export format: vLLM compressed-tensors, mixed precision
  • Format menu: NVFP4 (group 16) + FP8 (E4M3 dynamic) + BF16, allocated per-Linear by AURA
  • Target hardware: NVIDIA Blackwell (NVFP4-native)
  • MTP tensors: included (BF16 passthrough)
  • Size on disk: ~23 GB (~5.5 bpp over quantizable parameters)
  • Passthrough dtype policy: source dtype preserved (no silent FP32 upcasting)

This is a quality-first operating point: it is larger than the 20.17 GB PrismaSCOUT artifact, not smaller. A matched-footprint AURA point is in progress. Downstream task evals (GSM8K / IFEval / MMLU) are forthcoming; the numbers above are what has been directly measured.

Serving

bash

vllm serve rdtand/Qwen3.6-27B-PrismaAURA-5.5bit-vllm \
--quantization compressed-tensors \
--trust-remote-code \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

PrismaQuant — mixed-precision LLM quantization that chooses the right format per Linear on real end-to-end KL. Contact: robert.tand@icloud.com

Model provider

rdtand

Model tree

Base

Qwen/Qwen3.6-27B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today