rdtand
Qwen3.6-27B-PrismaAURA-5.5bit-vllm
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Tool-use fidelity (ToolEvalBench, hardmode, deterministic)
Tool-use is the metric we weight most: a small probability shift at a decision
point can flip a tool call. On ToolEvalBench (--no-think --hardmode, sequential,
temperature=0, seed=1234), PrismaAURA scores the highest of the entire
family — above full precision:
| Artifact | ToolEvalBench |
|---|---|
| Qwen3.6-27B PrismaAURA 5.5 (this) | 91 / 100 (134/148) |
| Qwen3.6-27B PrismaSCOUT 5.31 (prior flagship) | 85 / 100 |
| Qwen3.6-27B BF16 (full precision) | 86 / 100 |
Same harness, same seed for all three. PrismaAURA preserves tool-calling behavior at least as well as the unquantized model on this benchmark, at 5.5 bpp.
Served KL-vs-BF16
KL divergence measures how far the quantized model's full output distribution has drifted from the original full-precision model (0 = identical). Measured on a held WikiText split (exact vLLM, n=8 × seqlen 512, vs the BF16 teacher in the same session):
- Served KL-vs-BF16: 0.0342
Against the prior AURA research build at the same bpp (NVFP4+BF16 only, earlier render/export code), this is a −40.9% reduction in served KL — driven by the full FP8 menu, a corrected per-Linear render (fixed GPTQ damping, scale-faithful NVFP4 export), and a corrected calibration probe. (Single-draw served KL; the direction is corroborated by the deterministic ToolEvalBench result above.)
What AURA does
A modern LLM has thousands of weight matrices, each storable at one of several hardware precision formats. AURA splits quantization into two questions and answers the hard one by measurement:
- Local (well studied): given a fixed format, round this one matrix well — GPTQ, implicit clipping, activation-order. PrismaAURA runs the full deliberate render under every Linear.
- Global (PrismaQuant's contribution): how many bits should each Linear get,
and in which format? AURA prices each
(Linear, format)by a KL–Fisher quadratic — the second-order effect of that Linear's quantization error on the model's output distribution, measured with stochastic probes through the real model — and solves a multiple-choice knapsack over the bit budget. A heterogeneous per-Linear assignment extracts quality no single-format method structurally can.
Artifact details
- Source model:
Qwen/Qwen3.6-27B - Export format: vLLM
compressed-tensors, mixed precision - Format menu: NVFP4 (group 16) + FP8 (E4M3 dynamic) + BF16, allocated per-Linear by AURA
- Target hardware: NVIDIA Blackwell (NVFP4-native)
- MTP tensors: included (BF16 passthrough)
- Size on disk: ~23 GB (~5.5 bpp over quantizable parameters)
- Passthrough dtype policy: source dtype preserved (no silent FP32 upcasting)
This is a quality-first operating point: it is larger than the 20.17 GB PrismaSCOUT artifact, not smaller. A matched-footprint AURA point is in progress. Downstream task evals (GSM8K / IFEval / MMLU) are forthcoming; the numbers above are what has been directly measured.
Serving
bash
vllm serve rdtand/Qwen3.6-27B-PrismaAURA-5.5bit-vllm \--quantization compressed-tensors \--trust-remote-code \--max-model-len 32768 \--kv-cache-dtype fp8 \--enable-prefix-caching \--reasoning-parser qwen3 \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
PrismaQuant — mixed-precision LLM quantization that chooses the right format per Linear on real end-to-end KL. Contact: robert.tand@icloud.com
Model provider
rdtand
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information