ManniX-ITA

Qwen3.6-27B-Omnimerge-v4

README

License: apache-2.0

Quantizations

Three release lines:

GGUF (`llama.cpp` / `ollama` / `text-generation-webui`)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5. imatrix.dat archived alongside the quants for reproducibility/audit.

Also published as ollama tags: mannix/omnimerge-v4.

The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).

MLX 4-bit — text-only (Apple Silicon)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit — text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads via mlx_lm.load. Use this if you don't need vision and want a slightly smaller download.

python
from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit — full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads via mlx_vlm.load. Use this for image + video input.

python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

prompt = apply_chat_template(processor, config,
    "Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, image=["path/to/image.png"]))

Sources

Table with columns: Source, Weight, Role
Source	Weight	Role
Qwen/Qwen3.6-27B	base	base + chat template
rico03/Qwen3.6-27B-rico03	0.40	general capability
ValiantLabs/Qwen3.6-27B-Esper3.1	0.35	code + reasoning
kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor)

Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.

Benchmark Results (Q6_K quantization)

All numbers from lm_eval with --model local-completions (raw /v1/completions) on a llama.cpp server with --reasoning-format deepseek --reasoning-budget 8192. Sampler greedy (do_sample=False, T=0.0, top_p=1.0, top_k=0) across all benches — this is the canonical recipe for cross-cohort comparison. Earlier revisions used T=0.6 for GPQA to match v2's published recipe; the canonical 2026-05-22 re-run on pod 37268930 uses greedy throughout and supersedes those numbers.

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

All three columns scored under identical conditions: same llama.cpp server config (--reasoning-format deepseek --reasoning-budget 8192 --parallel 2 --cache-type-k q8_0 --cache-type-v q8_0 -c 65536), same lm_eval invocation (local-completions raw /v1/completions, no chat template), same gen kwargs. v4-MLP columns reflect the canonical 2026-05-22 full-bench greedy re-run on pod 37268930.

Table with columns: Benchmark, Qwen3.6 base Q6_K (bartowski), Omnimerge-v2 (Qwen3.5 base), Omnimerge-v4-MLP (Qwen3.6 base), Δ vs base, Δ vs v2
Benchmark	Qwen3.6 base Q6_K (bartowski)	Omnimerge-v2 (Qwen3.5 base)	Omnimerge-v4-MLP (Qwen3.6 base)	Δ vs base	Δ vs v2
HumanEval pass@1 (164q)	84.76% (139/164)	79.27%	83.54% (137/164)	−1.22 pp	+4.27 pp
MBPP pass@1 (500q) — raw lm_eval	56.20%	n/a	68.40%	+12.20 pp	n/a

Key observations:

HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
GPQA is the strongest reasoning lift — +9.09 pp over v2 on the full-bench greedy comparison. Note this is smaller than the previous partial-cache estimate (≈ +15.5 pp) because v2 was sampled at T=0.6 with budget 16384 (an easier configuration for verbose reasoning) while v4 is now measured under greedy at budget 8192. The marquee win is real, but the magnitude is the +9.09 pp greedy figure, not the +15.5 pp partial-sampled figure.

§ GPQA Diamond full greedy re-measurement (2026-05-22, pod 37268930). Sampler do_sample=False, T=0.0, --reasoning-budget 8192, max_gen_toks=8192. Wall time 4 h 55 min on 3090 Q6_K. Companion strict-match (rigid Answer: X template) is 7.58 % — the model emits CoT verbosely rather than the strict template, so the flexible-extract 78.28 % is the real quality signal. The earlier partial 84.75 % (177 of 198, sampled T=0.6, budget=16384) was a methodology artifact, not a model regression — re-measuring v2 under greedy at budget=8192 would also drop several points. The new 78.28 % is the canonical figure going forward.

* MBPP score correction (important): lm_eval's mbpp scorer evaluates exec(prompt + completion + tests). When a model emits <think>...</think>\n\ndef foo(): ..., the literal < character causes a Python SyntaxError even though the function code below is valid and would pass the tests. We re-scored by stripping <think>...</think> blocks (and unclosed <think>...EOF truncations) before exec.

v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.

Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.

‡ GPQA Diamond eval history (resolved 2026-05-22). The original 2026-05-13 run hit an aiohttp lifecycle bug in lm_eval.models.api_models.amodel_call that crashed on the at-budget reasoning tail (16384-token responses outlasting the ClientSession); we produced a partial 84.75 % (150/177 matched cached responses sampled at T=0.6, budget=16384) and kept restarting until 192/198 cached. The 2026-05-22 canonical re-run on pod 37268930 ran the full 198 under greedy decoding with budget=8192 and max_length=32768, having patched lm_eval's api_models.py:545 UnboundLocalError upstream (it crashed on transient TimeoutError before outputs was assigned) — see the quantize_gguf.py chain script + the omnimergekit pod_v4_q6k_eval_chain.sh for the bit-exact recipe. The canonical headline going forward is 78.28 % flexible-extract / 7.58 % strict-match on 198/198. The earlier 84.75 % partial-sampled figure is superseded but kept here for transparency about the prior methodology drift.

Why "MLP-passthrough"

When we merged Qwen3.6 the same way we'd successfully merged Qwen3.5 (Omnimerge-v2), the resulting model emitted unclosed <think> tags 80% of the time on coding prompts — pass@1 collapsed to ~20%. Forensic per-tensor delta inspection (see scripts/inspect_v4_delta.py) localized the failure mode to the mlp.gate_proj / mlp.up_proj / mlp.down_proj tensors in mid-to-late MLP layers (peak deltas in layers 27-52, max rel-L2 ≈ 2.1%). lm_head and embed_tokens were byte-identical to base — the policy attractor lived in MLP, not in token-emission logits.

We rebuilt v4 with mlp.{gate,up,down}_proj copied verbatim from clean Qwen3.6 base (scripts/v4_mlp_passthrough.py) and everything else (attn, linear_attn, norms, embed/head) kept from the merge. The leak went to 0% on a 10-prompt isolation test, MBPP pass@1 jumped to 50% on the same isolation set, and full-eval scores (above) confirmed the surgery rescued the merge.

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Table with columns: Test, Clean Qwen3.6 base, v4 (full merge, broken), v4-MLP (this model)
Test	Clean Qwen3.6 base	v4 (full merge, broken)	v4-MLP (this model)
`<think>` open rate (mbpp-10 isolation)	40%	80%	0%
Unclosed `</think>`	0/4	88% of opens	0/10
MBPP pass@1 (mbpp-10 isolation)	40%	20%	50%
Empty response (chat-completions)

Identical hyperparameters on Qwen3.5 base (Omnimerge-v2) produced 0.2% leak — so this is a Qwen3.6-specific fragility, not a general merge problem. Plausible cause: Qwen3.6 was post-trained later with reasoning-specific data that tightened the policy decision boundary; small (1-2% rel L2) MLP perturbations push it across.

The cost of MLP-passthrough is that we lose the merged MLP uplift on coding tasks — but full MBPP/HumanEval results show the attn + linear_attn deltas alone are enough to lift HumanEval ~5 pp over Qwen3.5-Omnimerge-v2 while staying tied on MBPP.

Compatibility

Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).

Inference works under:

transformers (BF16) — both use_cache=True and False paths
llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
vLLM (untested at time of publish, expected to work)

Scripts

All merge tooling is in the scripts/ directory of this repo:

Table with columns: Script, Purpose
Script	Purpose
`dare_ties_merge.py`	Main merger. `--method omnimerge_v2` is the published method. Auto-detects Qwen3.6 base via `config.output_gate_type` and auto-applies `--skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj'` (override with `--no-auto-mlp-skip`).
`v4_mlp_passthrough.py`	Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6.
`inspect_v4_delta.py`	Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region.

Reproducing the merge

bash
python scripts/dare_ties_merge.py \
    --method omnimerge_v2 \
    --base /path/to/Qwen3.6-27B \
    --source /path/to/Qwen3.6-rico03 \
    --source /path/to/Qwen3.6-Esper3.1 \
    --source /path/to/Qwen3.6-Opus-Reasoning-anchor \
    --weights 0.40,0.35,0.25 \
    --density 0.53 \
    --darex-q 0.75 \
    --output ./Qwen3.6-27B-Omnimerge-v4 \
    --seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)

Caveats

Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw /v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).

Acknowledgements

Qwen team for the Qwen3.6 base
rico03, ValiantLabs, kai-os for the fine-tunes
DARE / TIES / DARE-TIES authors and the arcee-ai/mergekit community

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

ManniX-ITA

Model Tree

Base

Qwen/Qwen3.6-27B

Merged

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Quantizations

Three release lines:

GGUF (`llama.cpp` / `ollama` / `text-generation-webui`)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-GGUF — 31 quants + F16, all imatrix-quantized with bartowski's calibration_datav5. imatrix.dat archived alongside the quants for reproducibility/audit.

Also published as ollama tags: mannix/omnimerge-v4.

The vision tower's mmproj projector lives in bartowski/Qwen_Qwen3.6-27B-GGUF and works unchanged with the v4 GGUFs (vision tower is preserved verbatim from the base).

MLX 4-bit — text-only (Apple Silicon)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit — text-only 4-bit MLX (group_size 64, 4.501 bits/weight), ~15 GB, loads via mlx_lm.load. Use this if you don't need vision and want a slightly smaller download.

python
from mlx_lm import load, generate
model, tokenizer = load("ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-4bit")
print(generate(model, tokenizer, prompt="...", max_tokens=512, verbose=True))

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit — full multimodal 4-bit MLX (group_size 64, 4.695 bits/weight — vision tower kept at higher precision), ~16 GB, loads via mlx_vlm.load. Use this for image + video input.

python
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

repo = "ManniX-ITA/Qwen3.6-27B-Omnimerge-v4-MLX-VL-4bit"
model, processor = load(repo)
config = load_config(repo)

prompt = apply_chat_template(processor, config,
    "Describe the image in detail.", num_images=1)
print(generate(model, processor, prompt,
    max_tokens=512, verbose=True, image=["path/to/image.png"]))

Sources

Table with columns: Source, Weight, Role
Source	Weight	Role
Qwen/Qwen3.6-27B	base	base + chat template
rico03/Qwen3.6-27B-rico03	0.40	general capability
ValiantLabs/Qwen3.6-27B-Esper3.1	0.35	code + reasoning
kai-os/Qwen3.6-Opus-Reasoning (LoRA→base anchor)

Method: omnimerge_v2 (DARE-TIES base + OBIM-lite + DAREx q + EMR election). Density 0.53, DAREx q 0.75, seed 42.

Benchmark Results (Q6_K quantization)

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

Table with columns: Benchmark, Qwen3.6 base Q6_K (bartowski), Omnimerge-v2 (Qwen3.5 base), Omnimerge-v4-MLP (Qwen3.6 base), Δ vs base, Δ vs v2
Benchmark	Qwen3.6 base Q6_K (bartowski)	Omnimerge-v2 (Qwen3.5 base)	Omnimerge-v4-MLP (Qwen3.6 base)	Δ vs base	Δ vs v2
HumanEval pass@1 (164q)	84.76% (139/164)	79.27%	83.54% (137/164)	−1.22 pp	+4.27 pp
MBPP pass@1 (500q) — raw lm_eval	56.20%	n/a	68.40%	+12.20 pp	n/a

Key observations:

HumanEval is identical to base (bit-for-bit: 139/164 = 0.847560975...). With MLP-passthrough preserving base MLPs and HumanEval being mostly elementary Python function completion, the merged attn + linear_attn deltas don't move the needle. This is also a strong sanity-check: it confirms our MLP-passthrough surgery did its job — the model's "elementary coding" behavior is byte-identical to the base it inherited MLPs from.
MBPP is where the merge value shows — +15.8 pp over Qwen3.6 base on the corrected score, and essentially tied with v2 (Qwen3.5-base merge). MBPP exercises a wider range of algorithms and control flow than HumanEval, where the merged reasoning + attention deltas help.
GPQA is the strongest reasoning lift — +9.09 pp over v2 on the full-bench greedy comparison. Note this is smaller than the previous partial-cache estimate (≈ +15.5 pp) because v2 was sampled at T=0.6 with budget 16384 (an easier configuration for verbose reasoning) while v4 is now measured under greedy at budget 8192. The marquee win is real, but the magnitude is the +9.09 pp greedy figure, not the +15.5 pp partial-sampled figure.

v4-MLP: 68.40% → 73.40% (+5.0 pp, recovered 25/500 valid-code-but-SyntaxError generations).
Qwen3.6 base: 56.20% → 57.60% (+1.4 pp, recovered 7/500). Base closes its think tags more reliably than v4-MLP (0% unclosed vs 4.8%) and emits them less often, which is why the correction is smaller.
v2 (Qwen3.5 base) had a much lower native think-rate so the correction is negligible at that scale; the published 74.60% was the lm_eval raw score.

Re-scoring script: scripts/rescore_mbpp_strip_think.py. The corrected scores are the apples-to-apples comparison; raw lm_eval scores are kept in the table for transparency.

Why "MLP-passthrough"

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Table with columns: Test, Clean Qwen3.6 base, v4 (full merge, broken), v4-MLP (this model)
Test	Clean Qwen3.6 base	v4 (full merge, broken)	v4-MLP (this model)
`<think>` open rate (mbpp-10 isolation)	40%	80%	0%
Unclosed `</think>`	0/4	88% of opens	0/10
MBPP pass@1 (mbpp-10 isolation)	40%	20%	50%
Empty response (chat-completions)

Compatibility

Architecture: qwen3_5 (unified Qwen3.5 / Qwen3.6 family). Vision tower preserved (mmproj available via the Q6_K GGUF release — multimodal works exactly like clean Qwen3.6).

Inference works under:

transformers (BF16) — both use_cache=True and False paths
llama.cpp (GGUF) — recommended args: --reasoning-format deepseek --reasoning-budget 8192
vLLM (untested at time of publish, expected to work)

Scripts

All merge tooling is in the scripts/ directory of this repo:

Table with columns: Script, Purpose
Script	Purpose
`dare_ties_merge.py`	Main merger. `--method omnimerge_v2` is the published method. Auto-detects Qwen3.6 base via `config.output_gate_type` and auto-applies `--skip-patterns 'mlp.gate_proj,mlp.up_proj,mlp.down_proj'` (override with `--no-auto-mlp-skip`).
`v4_mlp_passthrough.py`	Post-process tool: rebuild merged dir with MLP layers copied from base. Refuses to run on Qwen3.5 base (where MLP merging is safe — see v2). Use as final pre-quant step for any external merger output (mergekit, eX-LRP) targeting Qwen3.6.
`inspect_v4_delta.py`	Per-tensor delta-magnitude forensics vs base. Streams safetensors shards, no full model load. Used to localize the policy-leak weight region.

Reproducing the merge

bash
python scripts/dare_ties_merge.py \
    --method omnimerge_v2 \
    --base /path/to/Qwen3.6-27B \
    --source /path/to/Qwen3.6-rico03 \
    --source /path/to/Qwen3.6-Esper3.1 \
    --source /path/to/Qwen3.6-Opus-Reasoning-anchor \
    --weights 0.40,0.35,0.25 \
    --density 0.53 \
    --darex-q 0.75 \
    --output ./Qwen3.6-27B-Omnimerge-v4 \
    --seed 42
# (auto-applies MLP-skip on Qwen3.6 base; no extra flag needed)

Caveats

Qwen3.6 has a higher native think-rate than Qwen3.5 on coding prompts. Use raw /v1/completions for code benchmarks; chat-completions + --apply_chat_template + deepseek extraction will strip think blocks and return empty for prompts where the model thinks before answering. See pod_omnimerge_v4mlp_eval_raw.sh for the working config.
MBPP scoring without think-stripping under-reports pass@1 by ~5 pp on this model (see "MBPP score correction" note above).

Acknowledgements

Qwen team for the Qwen3.6 base
rico03, ValiantLabs, kai-os for the fine-tunes
DARE / TIES / DARE-TIES authors and the arcee-ai/mergekit community

Qwen3.6-27B-Omnimerge-v4

README

Quantizations

GGUF (llama.cpp / ollama / text-generation-webui)

MLX 4-bit — text-only (Apple Silicon)

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

Sources

Benchmark Results (Q6_K quantization)

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

Why "MLP-passthrough"

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Compatibility

Scripts

Reproducing the merge

Caveats

Acknowledgements

Explore FriendliAI today

README

Quantizations

GGUF (llama.cpp / ollama / text-generation-webui)

MLX 4-bit — text-only (Apple Silicon)

MLX 4-bit — Vision-Language (Apple Silicon, multimodal)

Sources

Benchmark Results (Q6_K quantization)

v4-MLP vs Qwen3.6 base + Omnimerge-v2 (head-to-head, same eval methodology)

Why "MLP-passthrough"

Key finding: Qwen3.6's think-policy is fragile to small MLP perturbations

Compatibility

Scripts

Reproducing the merge

Caveats

Acknowledgements

GGUF (`llama.cpp` / `ollama` / `text-generation-webui`)

GGUF (`llama.cpp` / `ollama` / `text-generation-webui`)