zmzfpc
crane-30b
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0How it was made (CRANE)
CRANE is a training-free, parameter-editing weight merge that injects reasoning ability from a "Thinking" donor into a tool-disciplined Instruct / code base, while constraining the edit so the base model's output format and tool-calling behavior are preserved. It treats the Thinking − Instruct delta \(\delta = \theta_{\text{think}} - \theta_{\text{inst}}\) as a pool of candidate reasoning edits, and applies three composable stages per layer \(l\) and parameter component \(c\):
θmerged(l,c)=θinst(l,c)+Stage 3Πτ,q(l,c)GSP(α⋅Stage 2SCTG(c,l)⋅Stage 1T(δ(l,c)))
Three small calibration sets drive the stages — \(\mathcal{D}_R\) (reasoning transfer), \(\mathcal{D}_A\) (agent-behavior / tool-use preservation), and \(\mathcal{D}_F\) (format preservation):
- Stage 1 — Magnitude thresholding \(T(\delta)\). A deterministic median-magnitude threshold keeps only the larger (top-half) delta coordinates and rescales them by 2, discarding low-confidence noise.
- Stage 2 — Conservative Taylor Gate \(S_{\text{CTG}}\). From a signed, direction-aware score \(s_K(j) = -g_{K,j},\delta_j\) per calibration loss, CTG keeps the positive part of the per-coordinate minimum over the reasoning and agent-behavior objectives, \(p_j = [\min{s_R(j), s_A(j)}]+\) — rewarding a coordinate only when the edit helps both. These aggregate into the per-component, per-layer coefficient \(S{\text{CTG}}(c,l)\), scaled by the single global merge strength \(\alpha\).
- Stage 3 — Graduated Sigmoidal Projection (GSP). From the SVD of format-critical Instruct activations \(H_q = U_q\Sigma_q V_q^{\top}\), a smooth sigmoidal weight \(\mathbf{w}q\) (set by singular amplitude and threshold \(\tau\)) gives the projector \(\Pi{\tau,q}^{\text{GSP}}(\Delta_q) = \Delta_q - \Delta_q V_q \operatorname{diag}(\mathbf{w}_q) V_q^{\top}\), attenuating high-amplitude format directions so reasoning is injected without perturbing chat-template tokens, tool-call delimiters, or JSON/schema structure.
The result is a merge that gains planning / reflection / recovery reasoning while keeping the base agent's compact, tool-call-disciplined behavior — the entire merge is a closed-form edit of the Instruct weights, with no fine-tuning.
This checkpoint's recipe
This checkpoint merges Qwen/Qwen3-30B-A3B-Instruct-2507 (base) and Qwen/Qwen3-30B-A3B-Thinking-2507 (donor) with:
- Global injection strength — \(\alpha = 0.25\), multiplied by the per-component CTG coefficients, so the Thinking delta is added at roughly a quarter strength.
- Per-layer / per-component gating — attention, expert (FFN), norm, and router components each get their own \(S_{\text{CTG}}(c,l)\) coefficient, varying by layer index rather than a single flat scalar.
- GSP projector — a freshly rebuilt Qwen3-30B graduated-sigmoidal projector (sigmoid threshold \(\tau = 0.03\)) protects the format / tool-call subspace before injection.
Architecture
The merge preserves the standard Qwen3-30B-A3B (MoE) topology unchanged:
| Property | Value |
|---|---|
| model_type | qwen3_moe |
| Architecture class | Qwen3MoeForCausalLM |
| Total params | ~30B |
| Active params | ~3B |
| hidden_size | 2048 |
| num_hidden_layers | 48 |
| num_experts | 128 |
| num_experts_per_tok | 8 |
| num_attention_heads | 32 |
| num_key_value_heads | 4 |
| head_dim | 128 |
| moe_intermediate_size | 768 |
| max_position_embeddings | 262144 |
| vocab_size | 151936 |
| dtype | bfloat16 |
| rope_theta | 10000000 |
A config_1m.json is also included for the extended long-context variant: it keeps the same rope_scaling (null) and max_position_embeddings (262144), but adds a dual_chunk_attention_config (dual chunk attention, original_max_position_embeddings = 131072 + sparse-attention settings) for longer-context inference.
Usage
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "zmzfpc/crane-30b"tokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype=torch.bfloat16,device_map="auto",)messages = [{"role": "user", "content": "Write a Python function that returns the nth Fibonacci number."},]inputs = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt",).to(model.device)outputs = model.generate(inputs, max_new_tokens=512)print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Requires a recent
transformerswith Qwen3-MoE support (transformers >= 4.51).
Citation / attribution
If you use this model or the CRANE method, please cite:
bibtex
@misc{zhu2026crane,title = {CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing},author = {Zhu, Mingzhi and Merler, Michele and Pavuluri, Raju and Patterson, Stacy},year = {2026},eprint = {2605.14084},archivePrefix= {arXiv},primaryClass = {cs.SE},url = {https://arxiv.org/abs/2605.14084}}
Project page: https://rpi-nsl.github.io/CRANE/ · Code: github.com/rpi-nsl/CRANE
Base models — built from two Apache-2.0 checkpoints:
- Qwen/Qwen3-30B-A3B-Instruct-2507 (base / backbone)
- Qwen/Qwen3-30B-A3B-Thinking-2507 (reasoning donor)
License: Apache-2.0 (consistent with both base models and the CRANE code).
Model provider
zmzfpc
Model tree
Base
Qwen/Qwen3-30B-A3B-Instruct-2507
Base
Qwen/Qwen3-30B-A3B-Thinking-2507
Merged
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information