mattbucci

Qwen3.6-REAM-A3B-AWQ

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Vision (restored 2026-05-28)

Earlier revisions of this repo shipped text-only: the quantization path dropped the vision-tower weights (0 model.visual.* tensors) while the config still declared the multimodal architecture, so image inputs produced NaN logits. This was not a structural REAM limitation — REAM only merges the MoE experts and leaves the vision tower untouched; the tower was simply lost in a text-only build step.

Fix: splice the 333-tensor model.visual.* tower back from the upstream BF16 base into model-vision.safetensors (FP16). The pruned INT4 experts are unchanged. validate_capabilities.py now passes 4/4 — basic, thinking, image, and video.

Build provenance

Built on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4):

  1. Pull upstream Qwen/Qwen3.6-35B-A3B BF16 weights.
  2. REAM-merge experts (256 → 192) using Samsung SAIL merge.py (saliency=reap, grouping=ream, merging=logits+weights).
  3. Calibrate with llmcompressor GPTQModifier, 1024 samples, balanced thinking/text mix.
  4. Convert compressed-tensors output to native AWQ INT4 (group-size 128).
  5. Splice the vision tower (model.visual.*) back from the upstream BF16 base as model-vision.safetensors (FP16) so the multimodal path works.

A few layer-0 expert scales are flagged low-density by the calibration audit — a benign artifact of the REAM merge in the first transformer block (router top-k routes around them; the tiny merged experts contribute ≈0). This does not affect vision: an earlier hypothesis blaming these scales for the image NaNs was ruled out — the real cause was the missing vision tower, now restored.

Usage

bash

python -m sglang.launch_server \
--model-path mattbucci/Qwen3.6-REAM-A3B-AWQ \
--quantization moe_wna16 \
--dtype bfloat16 \
--context-length 262144 \
--reasoning-parser qwen3 \
--port 23334

Post chat completions with enable_thinking: true for the thinking path; image inputs use the standard OpenAI image_url content type (base64 data URL or URL):

python

import openai
client = openai.OpenAI(base_url="http://localhost:23334/v1", api_key="x")
resp = client.chat.completions.create(
model="mattbucci/Qwen3.6-REAM-A3B-AWQ",
messages=[{"role": "user", "content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
]}],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

vLLM / Transformers

Loads under standard AWQ paths in vLLM (≥0.6) and transformers with auto-awq installed.

Notes

  • Single-user 256K decode ≈ 20–21 tok/s on a 2× R9700 (RDNA4) box with SGLang; ≈ 133 tok/s (text/thinking) with the parent Qwen3.6-35B-A3B DFlash draft via speculative decode.
  • For NVIDIA Ampere or Apple Silicon cross-validation, see the sister-team READMEs linked from the project repo.

Model provider

mattbucci

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today