Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitIntended Use
This adapter is intended for:
- research on medical QA reliability
- selective prediction and abstention experiments
- evaluation of confidence-based and learned abstention methods
- educational and portfolio demonstration
It is not intended for clinical decision-making.
Out-of-Scope Use
Do not use this model for:
- real patient diagnosis
- treatment recommendations
- clinical triage
- emergency medical decisions
- replacing medical professionals
Training Data
Dataset:
GBaker/MedQA-USMLE-4-options
Task format:
- Medical question
- Four answer options: A, B, C, D
- Model predicts the correct option
The dataset contains USMLE-style multiple-choice questions and does not represent open-ended clinical consultation.
Method
Base model:
mistralai/Mistral-7B-v0.3
Fine-tuning method:
- QLoRA
- 4-bit quantization
- LoRA rank: 16
- LoRA alpha: 32
- Target modules:
q_proj,v_proj
Evaluation Summary
On the MedQA-USMLE test split:
| Model | Accuracy | Coverage | Dataset Wrong Rate |
|---|---|---|---|
| Base Mistral-7B | 49.33% | 100% | 50.67% |
| SFT adapter | 52.24% | 100% | 47.76% |
The SFT adapter was also used as the starting point for post-hoc confidence thresholding and later warm-start + DPO learned-abstention experiments.
Note: this Hugging Face repository contains the SFT QLoRA adapter. DPO learned-abstention checkpoints are reported in the GitHub repository results, but are not hosted here.
Selective Prediction Context
The larger project studies two abstention approaches:
- Post-hoc confidence thresholding using A/B/C/D answer probabilities.
- Learned abstention using warm-start SFT + DPO, where the model learns an explicit abstention completion:
text
I cannot answer confidently.
Final learned-abstention checkpoints reduced dataset-level wrong answers from about 48% to 7-17%, depending on the selected coverage/safety operating point.
Limitations
- Evaluated only on multiple-choice USMLE-style questions.
- Not evaluated for open-ended clinical advice.
- Not clinically validated.
- May produce incorrect medical answers.
- Should not be used for real medical decisions.
- Performance depends on prompt format and evaluation method.
Ethical and Safety Notes
Medical QA models can produce plausible but incorrect answers. This adapter is provided for research and educational purposes only.
Any clinical use would require:
- domain expert validation
- extensive safety testing
- calibration on held-out data
- uncertainty estimation
- regulatory and institutional review
Citation / Attribution
If using this adapter or project, please cite or link the GitHub repository:
Model provider
Primeinvincible
Model tree
Base
mistralai/Mistral-7B-v0.3
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information