dilr/Mira-Q2 API & Inference Endpoint

Comprehensive Evaluation (782 docs across 4 test sets)

Table
Eval Set	N	Type	JSON Validity	Identifier Leak	Field-F1
test_gold	200	Same distribution (held-out)	100.0% [1.0-1.0]	0.0%	1.000 [0.999-1.0]
synthetic_v2	150	Different formatting dialect	100.0% [1.0-1.0]	0.0%	n/a (unlabeled)
extraction_relevant	150	Real physician docs (on-schema)	94.7% [90.7-98.0]	0.0%	n/a (unlabeled)
mtsamples	282	Real physician docs (39 specialties)	85.8% [81.9-89.7]	0.0%	n/a (unlabeled)

95% bootstrap CIs (1000 resamples). Zero identifier leaks across all 782 documents.

Three-Way Comparison

Table
Model	Training Data	Validity (test_gold)	F1 (test_gold)
Qwen2.5-3B zero-shot	—	0% (invents own schema)	0.0
Mira-Q1 (v1)	3,438 examples	98% (50-example eval)	—
Mira-Q2 (this model)	8,400 examples	100% (200-example eval)	1.000

Training

Table
Parameter	Value
Base model	Qwen/Qwen2.5-3B-Instruct (via Unsloth)
Method	QLoRA (4-bit, r=16, alpha=32)
Training data	8,400 examples (6,400 gold-by-construction + 2,000 schema variants)
Data sources	Real ICD-10 codes (71K), NLM drug names, curated lab reference ranges
Schema variants	Renamed fields, dropped fields, minimal schemas (for generalization)
Epochs	2
Final train loss	0.132
Final eval loss	0.142
Overfit gap	0.010 (healthy)

Loss Curve

markdown
Step   50: 1.0723  (epoch 0.1)
Step  200: 0.1556  (epoch 0.4)
Step  525: 0.1414  (epoch 1.0) — checkpoint
Step  750: 0.1318  (epoch 1.4) — lowest
Step 1050: 0.1320  (epoch 2.0) — final
Eval:      0.1418  (epoch 2.0)

What's New vs Mira-Q1

2.4x more training data (8,400 vs 3,438)
Gold-by-construction data — real ICD-10 codes, NLM drugs, real lab reference ranges (not Synthea-rendered)
Schema-variant training — 2,000 examples with modified schemas for schema-as-input generalization
8% lower loss (0.132 vs 0.143)
100% validity on 200-example gold eval (vs 98% on 50 examples)
Comprehensive eval on 782 docs including real physician dictations
Zero identifier leaks verified across all test sets

Synthetic-to-Real Gap

The honest finding: Mira-Q2 scores 100% on training-distribution data but 86% on general real physician prose (MTSamples). This is expected for a model trained on synthetic data — it learned our generator's patterns well but struggles with document types it never saw (operative notes, physical exams). The gap narrows to ~5% on on-schema real docs (94.7%).

This gap closes with: real partner data retraining (v1), broader document type coverage in training, and OCR pipeline integration.

Usage

python
# IMPORTANT: Load with Unsloth (not standard PeftModel — quantization mismatch)
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="dilr/Mira-Q2",
    max_seq_length=4096,
    dtype=torch.float16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": "You are a clinical information extraction system..."},
    {"role": "user", "content": "Patient: 45/M\nHb 12.5 g/dL (13-17) LOW\nWBC 8.2 x10^9/L (4-11) Normal"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False,
                          eos_token_id=[tokenizer.eos_token_id,
                                        tokenizer.convert_tokens_to_ids("<|im_end|>")])
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Note: Do NOT load with PeftModel.from_pretrained(base, "dilr/Mira-Q2") — the adapter was trained with Unsloth's quantization which differs from standard bitsandbytes. Use FastLanguageModel as shown above.

Schema

Extracts 10 required fields:

document_type: lab_report | medication_list | discharge_summary | pathology_report | intake_form | progress_note | other
patient: {age, sex} — de-identified, never includes names/MRN
encounter: {date (ISO), department}
vitals[], labs[], medications[], diagnoses[], procedures[], allergies[]
extraction_notes

Architecture: Schema-as-Input

Mira-Q2 is trained with schema-variant examples — the model learns to follow any extraction schema injected in the system prompt, not just the clinical one. This enables customer onboarding with zero code changes (schema file + seed examples only).

Eval Data

The eval/ directory contains:

comprehensive_scorecard.json — full results with bootstrap CIs
test_gold_200_result.json — test_gold scorecard
mtsamples_282_result.json — real MTSamples probe
extraction_relevant_150_result.json — on-schema real docs
synthetic_v2_150_result.json — format robustness probe

Limitations

English only
Trained on synthetic data — real clinical document retraining improves accuracy (v1 with design partner)
86% validity on general real docs (39 specialties) — strongest on lab/discharge/med types it was trained on
Every output is a draft for human review — not for autonomous clinical decisions
Must load with Unsloth (not vanilla PeftModel)

License

Apache-2.0 (same as base model)

Mira-Q2

Get help setting up a custom Dedicated Endpoints.

README