Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0TL;DR
MIRA is a source-aware data selection framework for heterogeneous mid-training corpora. Instead of applying a single global quality rubric, MIRA (1) clusters sources into capability-coherent groups, (2) lets a frontier teacher (Kimi-K2.6) freely propose rubric dimensions and anchors them per group, (3) distills the anchored teacher into a lightweight per-group student scorer, and (4) applies reliability-aware aggregation with per-source retention thresholds.
This repository is one of those student scorers — variant 4 in the Agent family, specialized for software-engineering repository-repair agents. Given an in-distribution record, it produces a numerical score and a short rationale for every anchor dimension in this group's rubric.
Model summary
| Architecture | Mixture-of-Experts decoder (35B total / ≈3B active params) |
| Base model | Qwen3.5-35B-A3B-Base |
| Fine-tuning | Full-parameter SFT on Kimi-K2.6 anchored teacher labels |
| Domain | Repository-level software-engineering repair agents (OpenHands, daVinc-Dev, swe_qa) — issue → patch trajectories with test-driven verification and minimal-diff style. |
| Anchor rubric | 15 group-specific dimensions (group_D_dim_anchors.jsonl in the project repo) |
| Source count | 3 agent sources |
| Phase-2 corpus (this group) | 231,762 teacher-scored records |
| Output | Structured (score, rationale) per anchor dimension |
| Precision | BF16 |
| License | Apache-2.0 (inherits from Qwen3) |
Sources covered
This scorer is calibrated for the following mid-training sources in the Agent / SWE repository repair group:
| Source | Description |
|---|---|
agent_openhands | OpenHands SWE agent traces |
davinc_dev | daVinc-Dev env-native pass+fail traces |
swe_qa | SFT SWE-bench issue + tool-call traces |
The full source-grouping report (KMeans k=4 / 5 clusters, intra-group cosine similarities) is in the project repo.
Anchor dimensions (15 slots)
The scoring rubric for this group, discovered via Kimi-K2.6 free-form judging and clustered into 15 anchor dimensions (KMeans k=15 over the group's dim-score embeddings). Names are read verbatim from group_D_dim_anchors.jsonl. Dimensions are sorted by cluster size — larger clusters dominate the corpus and carry more signal. When two clusters' centroids collided on the same anchor name, only the cluster that holds the most records labeled with that name keeps the bare form; the other gets a parenthetical sub-label drawn from its own name distribution so all 15 slots are uniquely identifiable.
| Slot | Dimension | Cluster size |
|---|---|---|
| A1 | Plan Quality / Reasoning Transparency (Goal Achievement) | 59,467 |
| A2 | Output Style/Communication | 41,355 |
| A3 | Goal Achievement | 38,129 |
| A4 | Plan Quality / Reasoning Transparency | 36,330 |
| A5 | Test Coverage & Verification Rigor | 29,347 |
| A6 | Multi-turn Coherence | 29,127 |
| A7 | Error Recovery | 28,843 |
| A8 | Tool Argument Correctness | 28,122 |
| A9 | Action Efficiency | 27,440 |
| A10 | User Intent Comprehension | 24,691 |
| A11 | Safety & scope | 24,339 |
| A12 | Observation handling | 23,739 |
| A13 | Code Change Minimality | 20,203 |
| A14 | Tool Selection | 17,000 |
| A15 | Tool Selection (Safety & Scope) | 16,168 |
The scorer outputs one [Ai] <dimension>: <score>/10 — <rationale> line per slot, plus overall, training_recommendation, domain_tag, and brief.
Where this model fits in the MIRA pipeline
markdown
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐│ 1. Rubric │ │ 2. Anchored │ │ 3. Reliability │ │ 4. Data ││ Discovery │→ │ Judge │→ │ Aggregation │→ │ Selection ││ (Kimi-K2.6, │ │ Distillation │ │ (mask unreliable │ │ (per-source ││ free-form │ │ ◀── THIS MODEL │ │ src×dim cells) │ │ retention) ││ judging) │ │ │ │ │ │ │└──────────────────┘ └──────────────────┘ └──────────────────┘ └──────────────────┘
MIRA-Agent-Group4 lives in Stage 2: it scores the full Agent / SWE repository repair corpus so that downstream stages can apply reliability masking and source-aware retention.
Intended use
- Primary: Score software-engineering repository-repair agents on this group's anchor dimensions to drive source-aware data selection and filtering.
- Secondary: Research on rubric distillation, semantic quality scoring, and reliability diagnostics for heterogeneous training corpora.
Not intended for:
- General-purpose chat or instruction following — fine-tuned to emit structured scores, not freeform dialogue.
- Single-shot quality judgments without the anchor-dimension prompt template — outputs will be miscalibrated.
- Records outside the Agent / SWE repository repair group; use the matching sibling scorer instead.
Deployment
The scorer is designed to be served via vLLM behind an OpenAI-compatible endpoint and called in batch from the MIRA scoring pipeline.
1. Serve with vLLM (recommended)
bash
vllm serve whw06/MIRA-Agent-Group4 \--tensor-parallel-size 8 \--dtype bfloat16 \--max-model-len 65536 \--max-num-batched-tokens 131072 \--gpu-memory-utilization 0.9 \--trust-remote-code \--port 8000
Why these values (verified on H200 141GB during the paper's per-source evaluation):
max-model-len=65536— 2× the mid-training cutoff. Records can hit ~60K tokens for densely-tokenized sources; 40K runs into prompt-overflow errors.max-num-batched-tokens=131072— supports two full-length sequences per scheduling step.gpu-memory-utilization=0.9— 35B BF16 weights take ~70GB, leaving ~57GB KV cache. Roughly 4 concurrent 65K-context sequences per GPU.- 8-way tensor parallel works well for the 35B MoE on a single 8×H200/A100 node.
2. Call from Python
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")resp = client.chat.completions.create(model="whw06/MIRA-Agent-Group4",messages=[{"role": "system", "content": SYSTEM_PROMPT}, # group-D anchor calibration{"role": "user", "content": USER_PROMPT}, # record + [A1]..[A15] template],temperature=0.7,top_p=0.95,max_tokens=2048,)print(resp.choices[0].message.content)
3. Prompt template
The user message asks for one structured line per anchor dimension (top-15 of this group):
markdown
[A1] {anchor_dim_1}: <score>/10 — <justification>[A2] {anchor_dim_2}: <score>/10 — <justification>...[A15] {anchor_dim_15}: <score>/10 — <justification>overall: <0-100>training_recommendation: <keep | downsample | drop>domain_tag: <short tag>brief: <one-sentence summary>
The system prompt embeds the top-12 anchor calibration references (canonical examples from clustering) so the student matches the teacher's scoring scale. The full prompt builder, anchor JSONL files, and output parser are in the project repo's scoring/score_agent_anchored.py.
Training details
| Teacher | Kimi-K2.6 (free-form rubric discovery in Phase 1; anchored re-scoring in Phase 2) |
| Training data | Kimi-K2.6 anchored labels on this group's Phase-2 corpus, split into a distillation set + a held-out validation split for reliability diagnostics |
| Loss | Standard next-token CE over (score, rationale) labels for every anchor dimension |
| Hyperparameters | Held constant across all MIRA student scorers; full settings in paper Appendix A.4 |
| Validation | Per-dimension teacher–student MAE and Spearman ρ on a held-out split; dimensions failing reliability thresholds are masked post-hoc (Figure 3 in the paper) |
Training loss / step curve is preserved in trainer_state.json for full reproducibility.
Headline results (from the paper)
End-to-end downstream evaluation: Qwen2.5-Coder-14B mid-trained on 25B-token MIRA-selected subsets vs. baselines, then SFT, evaluated on 9 code benchmarks across 4 categories.
| Method | Code Gen | MultiplE | SQL (EX) | SWE-Multi | Macro Avg |
|---|---|---|---|---|---|
| Base + SFT (no mid) | 53.91 | 72.57 | 64.24 | 3.67 | 48.60 |
| Raw Mixture (50B) | 53.71 | 67.42 | 94.18 | 40.00 | 63.83 |
| Random (25B) | 52.71 | 71.44 | 91.03 | 35.00 | 63.23 |
| DataMan (25B) | 53.82 | 71.38 | 93.84 | 33.00 | 63.01 |
| DSIR (25B) | 48.74 | 67.26 | 95.20 | 27.00 | 59.55 |
| PPL (25B) | 50.52 | 57.74 | 90.66 | 20.00 | 54.73 |
| MIRA-Global (25B) | 53.12 | 67.84 | 94.26 | 32.00 | 61.81 |
| MIRA-Group (25B) | 54.53 | 71.85 | 94.08 | 36.33 | 64.20 |
| MIRA-Source (25B) | 54.18 | 72.84 | 94.38 | 30.33 | 62.93 |
MIRA-Group matches the full 50B-token raw mixture while using only half the tokens, and out-performs all 25B-token selection baselines on the macro average. This scorer is one of the 12 student models used by the MIRA-Group variant.
Sibling models
MIRA releases one student scorer per source-group variant. Use the matching scorer for each record's format:
- Agent: whw06/MIRA-Agent-Group1 · -Group2 · -Group3 · MIRA-Agent-Group4 (this model)
- QA: whw06/MIRA-QA-Group1 · -Group2 · -Group3 · -Group4 · -Group5
- Text: whw06/MIRA-Text-Group1 · -Group2 · -Group3
Limitations
- MIRA addresses source-aware filtering only. Source discovery, mixture-ratio design, curriculum scheduling, deduplication and contamination control remain orthogonal concerns.
- This scorer is calibrated against the Agent / SWE repository repair group; cross-domain transfer is not advised — use the matching sibling for other source formats.
- Some anchor dimensions exhibit high teacher–student MAE and are masked post-hoc during aggregation (see paper §3.4). The model still emits scores for masked dimensions; downstream consumers should re-apply the reliability mask from the project repository.
- Calibrated on 3 sources within this group; behavior on out-of-distribution formats is unverified.
Citation
bibtex
@inproceedings{wang2026mira,title = {MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection},author = {Wang, Haowen and Du, Yaxin and Yang, Jian and Wu, Jiajun andLiu, Shukai and Zhang, Yuxuan and Wang, Pingjie and Chen, Siheng andZheng, Tuney and Zhou, Ming and Liu, Xianglong},booktitle = {Proceedings of the 2026 Conference on Empirical Methods in Natural Language Processing (EMNLP)},year = {2026}}
Acknowledgments
Built on Qwen3.5-35B-A3B-Base and the Megatron-LM training stack. Teacher labels generated with Kimi-K2.6.
Model provider
whw06
Model tree
Base
Qwen/Qwen3.5-35B-A3B-Base
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information