Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitOverview
4-bit NVFP4 quantization of OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of Qwen/Qwen3.5-122B-A10B (Mixture of Experts, ~10B active / 122B total).
This build packs the transformer weights to NVFP4 with LLM Compressor, cutting the on-disk footprint from ~250 GB to ≈82 GB while keeping the vision tower, MTP head, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — beats the full-precision Qwen3.5-122B-A10B baseline on every benchmark we ran (see Evaluation).
It loads anywhere compressed-tensors is supported and is auto-detected by vLLM (no --quantization flag needed).
Evaluation
Scores below were measured on this NVFP4 build and compared against the full-precision (BF16) Qwen/Qwen3.5-122B-A10B baseline:
| Benchmark | Qwen3.5-122B-A10B (BF16, baseline) | Qwopus3.5 NVFP4 (this model) |
|---|---|---|
| CTI | 64.8 | 71.5 |
| LiveCodeBench | 78.9 | 79.9 |
| BFCL | 72.2 | 85.6 |
Even after 4-bit (NVFP4) weight quantization, this model outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation.
Quantization (NVFP4)
Produced with LLM Compressor using the QuantizationModifier recipe shipped in this repo (recipe.yaml).
- Scheme:
NVFP4(format: nvfp4-pack-quantized) — 4-bit float weights in micro-blocks of 16, each block carrying an FP8 (float8_e4m3fn) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax). - Quantized: all transformer
Linearlayers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors). - Left in higher precision (BF16): the vision tower (
visual.*— 333 tensors), the MTP head (model_mtp.safetensors— 785 tensors),lm_head, token embeddings, the MoE router gates (mlp.gate,shared_expert_gate), and the Gated-DeltaNet linear-attention path (linear_attn.*). - Architecture preserved:
Qwen3_5MoeForConditionalGeneration/model_type: qwen3_5_moe, so the checkpoint loads as a drop-in replacement for the base at the architecture level.
Downloads / Other Formats
| Format | Repo | Use it for |
|---|---|---|
| Full BF16 weights | Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated | Transformers / vLLM, fine-tuning, requantizing |
| NVFP4 (this repo) | Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 | vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included) |
| GGUF (Q4_K_M) | …-Kimi-K2.6-destill-healed-abliterated-GGUF | llama.cpp / LM Studio (text-only). MTP head included. |
| MLX 4-bit | …-Kimi-K2.6-destill-healed-abliterated-MLX-4bit | Apple Silicon / LM Studio (vision supported) |
Files
| File | Description | Size |
|---|---|---|
model-00001-of-00002.safetensors | NVFP4-packed language weights (4-bit + FP8 scales) + lm_head | ~50.0 GB |
model-00002-of-00002.safetensors | NVFP4-packed language weights (tail) + BF16 vision tower | ~26.4 GB |
model_mtp.safetensors | BF16 MTP head (785 tensors, 1 hidden layer) | ~5.0 GB |
model.safetensors.index.json | Combined weight map | — |
config.json | Multimodal config incl. quantization_config (nvfp4-pack-quantized) | — |
recipe.yaml | LLM Compressor quantization recipe | — |
tokenizer*, chat_template.jinja, generation_config.json, *preprocessor_config.json | Standard | — |
Total on disk: ≈81.5 GB (~76 GiB).
Usage (vLLM)
vLLM auto-detects the NVFP4 compressed-tensors format — no --quantization flag.
bash
vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \--tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--max-model-len 262144
The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:
bash
--speculative-config '{"num_speculative_tokens":1}'
Tip (Qwen3.5 MoE / Gated-DeltaNet): if
torch.compileerrors in the GDN path during startup, add--compilation-config '{"use_inductor_graph_partition":true}'.
Text + vision both work through AutoProcessor / AutoModelForImageTextToText (via the compressed-tensors integration) for non-vLLM workflows.
Vision & MTP
Both the vision tower and the MTP (multi-token-prediction) head are included and kept in BF16.
- Vision works as expected (image / video → text).
- MTP: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.
Hardware
The NVFP4 weights are ≈82 GB (vs ~250 GB for the BF16 release), so the model runs on a single accelerator with ≥ 96 GB: H200, B200, RTX PRO 6000 Blackwell, or a 128 GB unified-memory NVIDIA DGX Spark / GB10. Native FP4 math requires a Blackwell GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation.
Support & Community
- Discord: https://discord.gg/rhUZY5GEZr
- Bitcoin Donations:
bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv
Notes
- License: MIT (inherits from the upstream Qwen3.5 base license terms)
- Base Model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated → Qwen/Qwen3.5-122B-A10B
- Quantization: NVFP4 (
nvfp4-pack-quantized, group size 16) via LLM Compressor - Modality: Text + Vision (image / video) + MTP
- Architecture: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head
Thanks
- Jackrong — for the idea of Qwopus merges (Opus distillations on Qwen models).
- wangzhang — for the wonderful abliterix framework, which was customized to do this abliteration.
- The LLM Compressor and vLLM teams for the NVFP4 tooling.
Disclaimer
Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.
Model provider
OpenYourMind
Model tree
Base
OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information