Quantizations
Three release lines:
GGUF (llama.cpp / ollama / text-generation-webui)
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5. imatrix.dat archived alongside the quants for reproducibility/audit.
Also published as ollama tags: mannix/omnimerge-v4.
The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).
MLX 4-bit — text-only (Apple Silicon)
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit — text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads via mlx_lm.load. Use this if you don't need vision and want a slightly smaller download.
from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))
MLX 4-bit — Vision-Language (Apple Silicon, multimodal)
ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit — full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads via mlx_vlm.load. Use this for image + video input.
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)
prompt = apply_chat_template(processor, config,
"Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
max_tokens=512, verbose=True, image=["path/to/image.png"]))
Sources
Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.
Benchmark Results (Q6_K quantization)
All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampler greedy (do_sample=False, T=0.0, top_p=1.0, top_k=0) across all benches — this is the canonical recipe for cross-cohort comparison. Earlier revisions used T=0.6 for GPQA to match v2's published recipe; the canonical 2026-05-22 re-run on pod 37268930 uses greedy throughout and supersedes those numbers.
v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)
All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs. v4-MLP columns reflect the canonical 2026-05-22 full-bench greedy re-run on pod 37268930.
Table with columns: Benchmark, Qwen3.6 base Q6_K (bartowski), Omnimerge-v2 (Qwen3.5 base), Omnimerge-v4-MLP (Qwen3.6 base), Δ vs base, Δ vs v2| Benchmark | Qwen3.6 base Q6_K (bartowski) | Omnimerge-v2 (Qwen3.5 base) | Omnimerge-v4-MLP (Qwen3.6 base) | Δ vs base | Δ vs v2 |
|---|
| HumanEval pass@1 (164q) | 84.76% (139/164) | 79.27% | 83.54% (137/164) | −1.22 pp | +4.27 pp |
| MBPP pass@1 (500q) — raw lm_eval | 56.20% | n/a | 68.40% | +12.20 pp | n/a |
Key observations:
- HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
- MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
- GPQA is the strongest reasoning lift — +9.09 pp over v2 on the full-bench greedy comparison. Note this is smaller than the previous partial-cache estimate (≈ +15.5 pp) because v2 was sampled at T=0.6 with budget 16384 (an easier configuration for verbose reasoning) while v4 is now measured under greedy at budget 8192. The marquee win is real, but the magnitude is the +9.09 pp greedy figure, not the +15.5 pp partial-sampled figure.
§ GPQA Diamond full greedy re-measurement (2026-05-22, pod 37268930). Sampler do_sample=False, T=0.0, --reasoning-budget 8192, max_gen_toks=8192. Wall time 4 h 55 min on 3090 Q6_K. Companion strict-match (rigid Answer: X template) is 7.58 % — the model emits CoT verbosely rather than the strict template, so the flexible-extract 78.28 % is the real quality signal. The earlier partial 84.75 % (177 of 198, sampled T=0.6, budget=16384) was a methodology artifact, not a model regression — re-measuring v2 under greedy at budget=8192 would also drop several points. The new 78.28 % is the canonical figure going forward.
* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.
- v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
- Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
- v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.
Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.
‡ GPQA Diamond eval history (resolved 2026-05-22). The original 2026-05-13 run hit an aiohttp lifecycle bug in lm_eval.models.api_models.amodel_call that crashed on the at-budget reasoning tail (16384-token responses outlasting the ClientSession); we produced a partial 84.75 % (150/177 matched cached responses sampled at T=0.6, budget=16384) and kept restarting until 192/198 cached. The 2026-05-22 canonical re-run on pod 37268930 ran the full 198 under greedy decoding with budget=8192 and max_length=32768, having patched lm_eval's api_models.py:545 UnboundLocalError upstream (it crashed on transient TimeoutError before outputs was assigned) — see the quantize_gguf.py chain script + the omnimergekit pod_v4_q6k_eval_chain.sh for the bit-exact recipe. The canonical headline going forward is 78.28 % flexible-extract / 7.58 % strict-match on 198/198. The earlier 84.75 % partial-sampled figure is superseded but kept here for transparency about the prior methodology drift.
Why "MLP-passthrough"
When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.
We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.
Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations
Table with columns: Test, Clean Qwen3.6 base, v4 (full merge, broken), v4-MLP (this model)| Test | Clean Qwen3.6 base | v4 (full merge, broken) | v4-MLP (this model) |
|---|
<think> open rate (mbpp-10 isolation) | 40% | 80% | 0% |
Unclosed </think> | 0/4 | 88% of opens | 0/10 |
| MBPP pass@1 (mbpp-10 isolation) | 40% | 20% | 50% |
| Empty response (chat-completions) |
Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.
The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.
Compatibility
Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).
Inference works under:
transformers (BF16) — both use_cache=True and False paths
llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
- vLLM (untested at time of publish, expected to work)
Scripts
All merge tooling is in the scripts/ directory of this repo:
Table with columns: Script, Purpose| Script | Purpose |
|---|
dare_ties_merge.py | Main merger. --method omnimerge_v2 is the published method. Auto-detects Qwen3.6 base via config.output_gate_type and auto-applies --skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj' (override with --no-auto-mlp-skip). |
v4_mlp_passthrough.py | Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6. |
inspect_v4_delta.py | Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region. |
|
Reproducing the merge
python scripts/dare_ties_merge.py \
--method omnimerge_v2 \
--base /path/to/Qwen3.6-27B \
--source /path/to/Qwen3.6-rico03 \
--source /path/to/Qwen3.6-Esper3.1 \
--source /path/to/Qwen3.6-Opus-Reasoning-anchor \
--weights 0.40,0.35,0.25 \
--density 0.53 \
--darex-q 0.75 \
--output ./Qwen3.6-27B-Omnimerge-v4 \
--seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)
Caveats
- Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw
/v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
- MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).
Acknowledgements