Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 API & Inference Endpoint

Support & Community

☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.

buymeacoffee.com/oym.kuato

💬 Discord: discord.gg/rhUZY5GEZr · ₿ Bitcoin: bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv

Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

Overview

4-bit NVFP4 quantization of OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated — the Kimi-K2.6-distilled, reasoning-DPO-healed, abliterated/uncensored evolution of Qwen/Qwen3.5-122B-A10B (Mixture of Experts, ~10B active / 122B total).

This build packs the transformer weights to NVFP4 with LLM Compressor, cutting the on-disk footprint from ~250 GB to ≈82 GB while keeping the vision tower, MTP head, router gates, and the Gated-DeltaNet attention path in higher precision. It is multimodal (image + text), uncensored, and — despite 4-bit weights — beats the full-precision Qwen3.5-122B-A10B baseline on every benchmark we ran (see Evaluation).

It loads anywhere compressed-tensors is supported and is auto-detected by vLLM (no --quantization flag needed).

Evaluation

Scores below were measured on this NVFP4 build and compared against the full-precision (BF16) Qwen/Qwen3.5-122B-A10B baseline:

Table with columns: Benchmark, Qwen3.5-122B-A10B (BF16, baseline), Qwopus3.5 NVFP4 (this model)
Benchmark	Qwen3.5-122B-A10B (BF16, baseline)	Qwopus3.5 NVFP4 (this model)
CTI	64.8	71.5
LiveCodeBench	78.9	79.9
BFCL	72.2	85.6

Even after 4-bit (NVFP4) weight quantization, this model outperforms the BF16 Qwen3.5-122B-A10B baseline on all three benchmarks — the Kimi-K2.6 distillation + reasoning-DPO healing more than offsets any quantization loss. BFCL is the Berkeley Function-Calling Leaderboard (tool use); LiveCodeBench is contamination-controlled code generation.

Quantization (NVFP4)

Produced with LLM Compressor using the QuantizationModifier recipe shipped in this repo (recipe.yaml).

Scheme: NVFP4 (format: nvfp4-pack-quantized) — 4-bit float weights in micro-blocks of 16, each block carrying an FP8 (float8_e4m3fn) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax).
Quantized: all transformer Linear layers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors).
Left in higher precision (BF16): the vision tower (visual.* — 333 tensors), the MTP head (model_mtp.safetensors — 785 tensors), lm_head, token embeddings, the MoE router gates (mlp.gate, shared_expert_gate), and the Gated-DeltaNet linear-attention path (linear_attn.*).
Architecture preserved: / , so the checkpoint loads as a drop-in replacement for the base at the architecture level.

Downloads / Other Formats

Table with columns: Format, Repo, Use it for
Format	Repo	Use it for
Full BF16 weights	Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated	Transformers / vLLM, fine-tuning, requantizing
NVFP4 (this repo)	Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4	vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included)
GGUF (Q4_K_M)	…-Kimi-K2.6-destill-healed-abliterated-GGUF	llama.cpp / LM Studio (text-only). MTP head included.

Files

Table with columns: File, Description, Size
File	Description	Size
`model-00001-of-00002.safetensors`	NVFP4-packed language weights (4-bit + FP8 scales) + `lm_head`	~50.0 GB
`model-00002-of-00002.safetensors`	NVFP4-packed language weights (tail) + BF16 vision tower	~26.4 GB
`model_mtp.safetensors`	BF16 MTP head (785 tensors, 1 hidden layer)	~5.0 GB
`model.safetensors.index.json`

Total on disk: ≈81.5 GB (~76 GiB).

Usage (vLLM)

vLLM auto-detects the NVFP4 compressed-tensors format — no --quantization flag.

bash
vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-model-len 262144

The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:

bash
--speculative-config '{"num_speculative_tokens":1}'

Tip (Qwen3.5 MoE / Gated-DeltaNet): if torch.compile errors in the GDN path during startup, add --compilation-config '{"use_inductor_graph_partition":true}'.

Text + vision both work through AutoProcessor / AutoModelForImageTextToText (via the compressed-tensors integration) for non-vLLM workflows.

Vision & MTP

Both the vision tower and the MTP (multi-token-prediction) head are included and kept in BF16.

Vision works as expected (image / video → text).
MTP: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.

Hardware

The NVFP4 weights are ≈82 GB (vs ~250 GB for the BF16 release), so the model runs on a single accelerator with ≥ 96 GB: H200, B200, RTX PRO 6000 Blackwell, or a 128 GB unified-memory NVIDIA DGX Spark / GB10. Native FP4 math requires a Blackwell GPU (compute capability ≥ 10.0 / sm_120+); on other hardware vLLM runs NVFP4 via FlashInfer/emulation.

Notes

License: MIT (inherits from the upstream Qwen3.5 base license terms)
Base Model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated → Qwen/Qwen3.5-122B-A10B
Quantization: NVFP4 (nvfp4-pack-quantized, group size 16) via LLM Compressor
Modality: Text + Vision (image / video) + MTP
Architecture: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head

Thanks

Jackrong — for the idea of Qwopus merges (Opus distillations on Qwen models).
wangzhang — for the wonderful abliterix framework, which was customized to do this abliteration.
The LLM Compressor and vLLM teams for the NVFP4 tooling.

Disclaimer

Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.

Support & Community

☕ If these models are useful to you, consider supporting my work — it funds compute for more & larger abliterations.

buymeacoffee.com/oym.kuato

💬 Discord: discord.gg/rhUZY5GEZr · ₿ Bitcoin: bc1qsvfduzj9fjs9fugpc52yver3f2g8fp7xjxecdv

Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

Overview

It loads anywhere compressed-tensors is supported and is auto-detected by vLLM (no --quantization flag needed).

Evaluation

Scores below were measured on this NVFP4 build and compared against the full-precision (BF16) Qwen/Qwen3.5-122B-A10B baseline:

Table with columns: Benchmark, Qwen3.5-122B-A10B (BF16, baseline), Qwopus3.5 NVFP4 (this model)
Benchmark	Qwen3.5-122B-A10B (BF16, baseline)	Qwopus3.5 NVFP4 (this model)
CTI	64.8	71.5
LiveCodeBench	78.9	79.9
BFCL	72.2	85.6

Quantization (NVFP4)

Produced with LLM Compressor using the QuantizationModifier recipe shipped in this repo (recipe.yaml).

Scheme: NVFP4 (format: nvfp4-pack-quantized) — 4-bit float weights in micro-blocks of 16, each block carrying an FP8 (float8_e4m3fn) scale. Weights are static; input activations are quantized dynamically (per-group, static-minmax).
Quantized: all transformer Linear layers — attention projections and the 256 routed-expert MoE FFNs (37,056 packed weight tensors).
Left in higher precision (BF16): the vision tower (visual.* — 333 tensors), the MTP head (model_mtp.safetensors — 785 tensors), lm_head, token embeddings, the MoE router gates (mlp.gate, shared_expert_gate), and the Gated-DeltaNet linear-attention path (linear_attn.*).
Architecture preserved: / , so the checkpoint loads as a drop-in replacement for the base at the architecture level.

Downloads / Other Formats

Table with columns: Format, Repo, Use it for
Format	Repo	Use it for
Full BF16 weights	Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated	Transformers / vLLM, fine-tuning, requantizing
NVFP4 (this repo)	Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4	vLLM on a single ≥96 GB / Blackwell accelerator (vision + MTP included)
GGUF (Q4_K_M)	…-Kimi-K2.6-destill-healed-abliterated-GGUF	llama.cpp / LM Studio (text-only). MTP head included.

Files

Table with columns: File, Description, Size
File	Description	Size
`model-00001-of-00002.safetensors`	NVFP4-packed language weights (4-bit + FP8 scales) + `lm_head`	~50.0 GB
`model-00002-of-00002.safetensors`	NVFP4-packed language weights (tail) + BF16 vision tower	~26.4 GB
`model_mtp.safetensors`	BF16 MTP head (785 tensors, 1 hidden layer)	~5.0 GB
`model.safetensors.index.json`

Total on disk: ≈81.5 GB (~76 GiB).

Usage (vLLM)

vLLM auto-detects the NVFP4 compressed-tensors format — no --quantization flag.

bash
vllm serve OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4 \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-model-len 262144

The checkpoint ships the MTP head, so you can enable 1-token speculative decoding:

bash
--speculative-config '{"num_speculative_tokens":1}'

Tip (Qwen3.5 MoE / Gated-DeltaNet): if torch.compile errors in the GDN path during startup, add --compilation-config '{"use_inductor_graph_partition":true}'.

Text + vision both work through AutoProcessor / AutoModelForImageTextToText (via the compressed-tensors integration) for non-vLLM workflows.

Vision & MTP

Both the vision tower and the MTP (multi-token-prediction) head are included and kept in BF16.

Vision works as expected (image / video → text).
MTP: the head is present and shape-compatible. It enables speculative decoding under vLLM, but on the upstream checkpoint it produced little measurable speedup/quality gain and would benefit from retraining — shipped intact for completeness and forward-compatibility.

Hardware

Notes

License: MIT (inherits from the upstream Qwen3.5 base license terms)
Base Model: OpenYourMind/Qwopus3.5-122B-A10B-Kimi-K2.6-destill-healed-abliterated → Qwen/Qwen3.5-122B-A10B
Quantization: NVFP4 (nvfp4-pack-quantized, group size 16) via LLM Compressor
Modality: Text + Vision (image / video) + MTP
Architecture: Qwen3 MoE (~10B active / 122B total) + Qwen3-VL vision tower + MTP head

Thanks

Jackrong — for the idea of Qwopus merges (Opus distillations on Qwen models).
wangzhang — for the wonderful abliterix framework, which was customized to do this abliteration.
The LLM Compressor and vLLM teams for the NVFP4 tooling.

Disclaimer

Use is the responsibility of the user. Ensure your usage complies with applicable laws, platform rules, and deployment requirements.

Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

README

Support & Community

Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

Overview

Evaluation

Quantization (NVFP4)

Downloads / Other Formats

Files

Usage (vLLM)

Vision & MTP

Hardware

Notes

Thanks

Disclaimer

Explore FriendliAI today

README

Support & Community

Qwopus3.5-122B-A10B-Kimi-K2.6-destilled-abliterated-NVFP4

Overview

Evaluation

Quantization (NVFP4)

Downloads / Other Formats

Files

Usage (vLLM)

Vision & MTP

Hardware

Notes

Thanks

Disclaimer