mattbucci
Qwen3.6-REAM-A3B-AWQ
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Vision (restored 2026-05-28)
Earlier revisions of this repo shipped text-only: the quantization path dropped the vision-tower weights (0 model.visual.* tensors) while the config still declared the multimodal architecture, so image inputs produced NaN logits. This was not a structural REAM limitation — REAM only merges the MoE experts and leaves the vision tower untouched; the tower was simply lost in a text-only build step.
Fix: splice the 333-tensor model.visual.* tower back from the upstream BF16 base into model-vision.safetensors (FP16). The pruned INT4 experts are unchanged. validate_capabilities.py now passes 4/4 — basic, thinking, image, and video.
Build provenance
Built on 2× AMD Radeon AI PRO R9700 (gfx1201, RDNA4):
- Pull upstream
Qwen/Qwen3.6-35B-A3BBF16 weights. - REAM-merge experts (256 → 192) using Samsung SAIL
merge.py(saliency=reap, grouping=ream, merging=logits+weights). - Calibrate with
llmcompressorGPTQModifier, 1024 samples, balanced thinking/text mix. - Convert compressed-tensors output to native AWQ INT4 (group-size 128).
- Splice the vision tower (
model.visual.*) back from the upstream BF16 base asmodel-vision.safetensors(FP16) so the multimodal path works.
A few layer-0 expert scales are flagged low-density by the calibration audit — a benign artifact of the REAM merge in the first transformer block (router top-k routes around them; the tiny merged experts contribute ≈0). This does not affect vision: an earlier hypothesis blaming these scales for the image NaNs was ruled out — the real cause was the missing vision tower, now restored.
Usage
SGLang (recommended on AMD ROCm)
bash
python -m sglang.launch_server \--model-path mattbucci/Qwen3.6-REAM-A3B-AWQ \--quantization moe_wna16 \--dtype bfloat16 \--context-length 262144 \--reasoning-parser qwen3 \--port 23334
Post chat completions with enable_thinking: true for the thinking path; image inputs use the standard OpenAI image_url content type (base64 data URL or URL):
python
import openaiclient = openai.OpenAI(base_url="http://localhost:23334/v1", api_key="x")resp = client.chat.completions.create(model="mattbucci/Qwen3.6-REAM-A3B-AWQ",messages=[{"role": "user", "content": [{"type": "text", "text": "What's in this image?"},{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},]}],extra_body={"chat_template_kwargs": {"enable_thinking": True}},)
vLLM / Transformers
Loads under standard AWQ paths in vLLM (≥0.6) and transformers with auto-awq installed.
Notes
- Single-user 256K decode ≈ 20–21 tok/s on a 2× R9700 (RDNA4) box with SGLang; ≈ 133 tok/s (text/thinking) with the parent
Qwen3.6-35B-A3BDFlash draft via speculative decode. - For NVIDIA Ampere or Apple Silicon cross-validation, see the sister-team READMEs linked from the project repo.
Model provider
mattbucci
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information