mattbucci

Qwen3.6-REAM-A3B-AWQ

README

License: apache-2.0

Vision (restored 2026-05-28)

Earlier revisions of this repo shipped text-only: the quantization path dropped the vision-tower weights (0 model.visual.* tensors) while the config still declared the multimodal architecture, so image inputs produced NaN logits. This was not a structural REAM limitation — REAM only merges the MoE experts and leaves the vision tower untouched; the tower was simply lost in a text-only build step.

Fix: splice the 333-tensor model.visual.* tower back from the upstream BF16 base into model-vision.safetensors (FP16). The pruned INT4 experts are unchanged. validate_capabilities.py now passes 4/4 — basic, thinking, image, and video.

Build provenance

Built on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4):

Pull upstream Qwen/Qwen3.6-35B-A3B BF16 weights.
REAM-merge experts (256 → 192) using Samsung SAIL merge.py (saliency=reap, grouping=ream, merging=logits+weights).
Calibrate with llmcompressor GPTQModifier, 1024 samples, balanced thinking/text mix.
Convert compressed-tensors output to native AWQ INT4 (group-size 128).
Splice the vision tower (model.visual.*) back from the upstream BF16 base as model-vision.safetensors (FP16) so the multimodal path works.

A few layer-0 expert scales are flagged low-density by the calibration audit — a benign artifact of the REAM merge in the first transformer block (router top-k routes around them; the tiny merged experts contribute ≈0). This does not affect vision: an earlier hypothesis blaming these scales for the image NaNs was ruled out — the real cause was the missing vision tower, now restored.

Usage

SGLang (recommended on AMD ROCm)

bash
python -m sglang.launch_server \
  --model-path mattbucci/Qwen3.6-REAM-A3B-AWQ \
  --quantization moe_wna16 \
  --dtype bfloat16 \
  --context-length 262144 \
  --reasoning-parser qwen3 \
  --port 23334

Post chat completions with enable_thinking: true for the thinking path; image inputs use the standard OpenAI image_url content type (base64 data URL or URL):

python
import openai
client = openai.OpenAI(base_url="http://localhost:23334/v1", api_key="x")
resp = client.chat.completions.create(
    model="mattbucci/Qwen3.6-REAM-A3B-AWQ",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "What's in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
    ]}],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

vLLM / Transformers

Loads under standard AWQ paths in vLLM (≥0.6) and transformers with auto-awq installed.

Notes

Single-user 256K decode ≈ 20–21 tok/s on a 2× R9700 (RDNA4) box with SGLang; ≈ 133 tok/s (text/thinking) with the parent Qwen3.6-35B-A3B DFlash draft via speculative decode.
For NVIDIA Ampere or Apple Silicon cross-validation, see the sister-team READMEs linked from the project repo.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

mattbucci

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality