Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Overview

4-bit NVFP4 quantization of OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of Qwen/Qwen3.5-122B-A10B (Mixture of Experts, ~10B active / 122B total).

This build packs the transformer weights to NVFP4 with LLM Compressor, cutting the on-disk footprint from ~250 GB to ≈82 GB while keeping the vision tower, MTP head, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — beats the full-precision Qwen3.5-122B-A10B baseline on every benchmark we ran (see Evaluation).

It loads anywhere compressed-tensors is supported and is auto-detected by vLLM (no --quantization flag needed).

Evaluation

Scores below were measured on this NVFP4 build and compared against the full-precision (BF16) Qwen/Qwen3.5-122B-A10B baseline:

BenchmarkQwen3.5-122B-A10B (BF16, baseline)Qwopus3.5 NVFP4 (this model)
CTI64.871.5
LiveCodeBench78.979.9
BFCL72.285.6

Even after 4-bit (NVFP4) weight quantization, this model outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation.

Quantization (NVFP4)

Produced with LLM Compressor using the QuantizationModifier recipe shipped in this repo (recipe.yaml).

  • Scheme: NVFP4 (format: nvfp4-pack-quantized) — 4-bit float weights in micro-blocks of 16, each block carrying an FP8 (float8_e4m3fn) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax).
  • Quantized: all transformer Linear layers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors).
  • Left in higher precision (BF16): the vision tower (visual.* — 333 tensors), the MTP head (model_mtp.safetensors — 785 tensors), lm_head, token embeddings, the MoE router gates (mlp.gate, shared_expert_gate), and the Gated-DeltaNet linear-attention path (linear_attn.*).
  • Architecture preserved: Qwen3_5MoeForConditionalGeneration / model_type: qwen3_5_moe, so the checkpoint loads as a drop-in replacement for the base at the architecture level.

Downloads / Other Formats

FormatRepoUse it for
Full BF16 weightsQwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliteratedTransformers / vLLM, fine-tuning, requantizing
NVFP4 (this repo)Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included)
GGUF (Q4_K_M)…-Kimi-K2.6-destill-healed-abliterated-GGUFllama.cpp / LM Studio (text-only). MTP head included.
MLX 4-bit…-Kimi-K2.6-destill-healed-abliterated-MLX-4bitApple Silicon / LM Studio (vision supported)

Files

FileDescriptionSize
model-00001-of-00002.safetensorsNVFP4-packed language weights (4-bit + FP8 scales) + lm_head~50.0 GB
model-00002-of-00002.safetensorsNVFP4-packed language weights (tail) + BF16 vision tower~26.4 GB
model_mtp.safetensorsBF16 MTP head (785 tensors, 1 hidden layer)~5.0 GB
model.safetensors.index.jsonCombined weight map
config.jsonMultimodal config incl. quantization_config (nvfp4-pack-quantized)
recipe.yamlLLM Compressor quantization recipe
tokenizer*, chat_template.jinja, generation_config.json, *preprocessor_config.jsonStandard

Total on disk: ≈81.5 GB (~76 GiB).

Usage (vLLM)

vLLM auto-detects the NVFP4 compressed-tensors format — no --quantization flag.

bash

vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--max-model-len 262144

The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:

bash

--speculative-config '{"num_speculative_tokens":1}'

Tip (Qwen3.5 MoE / Gated-DeltaNet): if torch.compile errors in the GDN path during startup, add --compilation-config '{"use_inductor_graph_partition":true}'.

Text + vision both work through AutoProcessor / AutoModelForImageTextToText (via the compressed-tensors integration) for non-vLLM workflows.

Vision & MTP

Both the vision tower and the MTP (multi-token-prediction) head are included and kept in BF16.

  • Vision works as expected (image / video → text).
  • MTP: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.

Hardware

The NVFP4 weights are ≈82 GB (vs ~250 GB for the BF16 release), so the model runs on a single accelerator with ≥ 96 GB: H200, B200, RTX PRO 6000 Blackwell, or a 128 GB unified-memory NVIDIA DGX Spark / GB10. Native FP4 math requires a Blackwell GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation.

Support & Community

Notes

Thanks

  • Jackrong — for the idea of Qwopus merges (Opus distillations on Qwen models).
  • wangzhang — for the wonderful abliterix framework, which was customized to do this abliteration.
  • The LLM Compressor and vLLM teams for the NVFP4 tooling.

Disclaimer

Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.

Model provider

OpenYourMind

Model tree

Base

OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today