zmzfpc

crane-30b

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

How it was made (CRANE)

CRANE is a training-free, parameter-editing weight merge that injects reasoning ability from a "Thinking" donor into a tool-disciplined Instruct / code base, while constraining the edit so the base model's output format and tool-calling behavior are preserved. It treats the Thinking − Instruct delta \(\delta = \theta_{\text{think}} - \theta_{\text{inst}}\) as a pool of candidate reasoning edits, and applies three composable stages per layer \(l\) and parameter component \(c\):

θmerged(l,c)=θinst(l,c)+Stage 3Πτ,q(l,c)GSP(αStage 2SCTG(c,l)Stage 1T(δ(l,c)))

CRANE / GSP merge pipeline

Three small calibration sets drive the stages — \(\mathcal{D}_R\) (reasoning transfer), \(\mathcal{D}_A\) (agent-behavior / tool-use preservation), and \(\mathcal{D}_F\) (format preservation):

  • Stage 1 — Magnitude thresholding \(T(\delta)\). A deterministic median-magnitude threshold keeps only the larger (top-half) delta coordinates and rescales them by 2, discarding low-confidence noise.
  • Stage 2 — Conservative Taylor Gate \(S_{\text{CTG}}\). From a signed, direction-aware score \(s_K(j) = -g_{K,j},\delta_j\) per calibration loss, CTG keeps the positive part of the per-coordinate minimum over the reasoning and agent-behavior objectives, \(p_j = [\min{s_R(j), s_A(j)}]+\) — rewarding a coordinate only when the edit helps both. These aggregate into the per-component, per-layer coefficient \(S{\text{CTG}}(c,l)\), scaled by the single global merge strength \(\alpha\).
  • Stage 3 — Graduated Sigmoidal Projection (GSP). From the SVD of format-critical Instruct activations \(H_q = U_q\Sigma_q V_q^{\top}\), a smooth sigmoidal weight \(\mathbf{w}q\) (set by singular amplitude and threshold \(\tau\)) gives the projector \(\Pi{\tau,q}^{\text{GSP}}(\Delta_q) = \Delta_q - \Delta_q V_q \operatorname{diag}(\mathbf{w}_q) V_q^{\top}\), attenuating high-amplitude format directions so reasoning is injected without perturbing chat-template tokens, tool-call delimiters, or JSON/schema structure.

The result is a merge that gains planning / reflection / recovery reasoning while keeping the base agent's compact, tool-call-disciplined behavior — the entire merge is a closed-form edit of the Instruct weights, with no fine-tuning.

This checkpoint's recipe

This checkpoint merges Qwen/Qwen3-30B-A3B-Instruct-2507 (base) and Qwen/Qwen3-30B-A3B-Thinking-2507 (donor) with:

  • Global injection strength — \(\alpha = 0.25\), multiplied by the per-component CTG coefficients, so the Thinking delta is added at roughly a quarter strength.
  • Per-layer / per-component gating — attention, expert (FFN), norm, and router components each get their own \(S_{\text{CTG}}(c,l)\) coefficient, varying by layer index rather than a single flat scalar.
  • GSP projector — a freshly rebuilt Qwen3-30B graduated-sigmoidal projector (sigmoid threshold \(\tau = 0.03\)) protects the format / tool-call subspace before injection.

Architecture

The merge preserves the standard Qwen3-30B-A3B (MoE) topology unchanged:

Table
PropertyValue
model_typeqwen3_moe
Architecture classQwen3MoeForCausalLM
Total params~30B
Active params~3B
hidden_size2048
num_hidden_layers48
num_experts128
num_experts_per_tok8
num_attention_heads32
num_key_value_heads4
head_dim128
moe_intermediate_size768
max_position_embeddings262144
vocab_size151936
dtypebfloat16
rope_theta10000000

A config_1m.json is also included for the extended long-context variant: it keeps the same rope_scaling (null) and max_position_embeddings (262144), but adds a dual_chunk_attention_config (dual chunk attention, original_max_position_embeddings = 131072 + sparse-attention settings) for longer-context inference.

Usage

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "zmzfpc/crane-30b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Requires a recent transformers with Qwen3-MoE support (transformers >= 4.51).

Citation / attribution

If you use this model or the CRANE method, please cite:

bibtex

@misc{zhu2026crane,
title = {CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing},
author = {Zhu, Mingzhi and Merler, Michele and Pavuluri, Raju and Patterson, Stacy},
year = {2026},
eprint = {2605.14084},
archivePrefix= {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2605.14084}
}

Project page: https://rpi-nsl.github.io/CRANE/ · Code: github.com/rpi-nsl/CRANE

Base models — built from two Apache-2.0 checkpoints:

License: Apache-2.0 (consistent with both base models and the CRANE code).

Model provider

zmzfpc

Model tree

Base

Qwen/Qwen3-30B-A3B-Instruct-2507

Base

Qwen/Qwen3-30B-A3B-Thinking-2507

Merged

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today