Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherModel Summary
| Field | Value |
|---|---|
| Base model | meta-llama/Meta-Llama-3-8B |
| Adapter type | PEFT LoRA |
| Fine-tuning method | QLoRA |
| Task | ICD-10-CM code generation |
| Dataset | generative-technologies/synth-ehr-icd10-llama3-format |
| Training hardware | Kaggle 2× NVIDIA T4 |
| Language | English |
This repository contains only the LoRA adapter weights. You must have access to the gated
meta-llama/Meta-Llama-3-8Bbase model on Hugging Face to use it.
Problem
Medical coding converts clinical documentation into standardized ICD-10-CM diagnosis codes. This project explores whether a compact LoRA adapter can teach an open-weight LLM to map synthetic EHR-style notes to ICD-10-CM codes.
Example task: Input: Patient is a 58-year-old male with type 2 diabetes mellitus without complications and essential hypertension. Output: E11.9, I10
Training Setup
| Setting | Value |
|---|---|
| Train samples | 50,000 |
| Validation samples | 2,000 |
| Epochs | 1 |
| Max sequence length | 768 |
| Quantization | 4-bit NF4 |
| Double quantization | Enabled |
| Compute dtype | float16 |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj |
| Optimizer | paged_adamw_8bit |
| Learning rate | 2e-4 |
| Gradient accumulation steps | 16 |
| Gradient checkpointing | Enabled |
Training used completion-only loss: prompt tokens were masked, and loss was computed only on the ICD-10-CM answer tokens.
Dataset Processing
Before training, examples were cleaned and normalized:
- Removed empty clinical notes and empty targets
- Normalized ICD-10-CM code formatting
- Filtered examples with ICD code leakage in the prompt
- Removed duplicate examples
- Converted each example into Llama 3 chat format
Prompt format: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an expert medical coder. Read the clinical note and return only the correct ICD-10-CM code or codes. <|eot_id|><|start_header_id|>user<|end_header_id|> {clinical_note} <|eot_id|><|start_header_id|>assistant<|end_header_id|> {icd_codes}<|eot_id|>
Results
| Split | Examples | Precision | Recall | F1 | Exact Match | Micro F1 |
|---|---|---|---|---|---|---|
| Validation | 2,000 | 0.8705 | 0.8705 | 0.8705 | 0.8705 | 0.8714 |
| Test | 500 | 0.8720 | 0.8720 | 0.8720 | 0.8720 | 0.8755 |
Test evaluation was run on 500 examples due to Kaggle runtime constraints. Validation evaluation was run on 2,000 examples.
Usage
Load the model
python
import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfigfrom peft import PeftModelbase_model = "meta-llama/Meta-Llama-3-8B"adapter_id = "anasxs/icd10-llama3-8b-qlora-adapter"bnb_config = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_use_double_quant=True,bnb_4bit_compute_dtype=torch.float16,)tokenizer = AutoTokenizer.from_pretrained(base_model)tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "left"model = AutoModelForCausalLM.from_pretrained(base_model,quantization_config=bnb_config,device_map="auto",)model = PeftModel.from_pretrained(model, adapter_id)model.eval()
Generate ICD-10-CM codes
python
clinical_note = "Patient is a 58-year-old male with type 2 diabetes mellitus without complications and essential hypertension."prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are an expert medical coder. Read the clinical note and return only the correct ICD-10-CM code or codes.<|eot_id|><|start_header_id|>user<|end_header_id|>{clinical_note}<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""inputs = tokenizer(prompt, return_tensors="pt").to(model.device)with torch.inference_mode():outputs = model.generate(**inputs,max_new_tokens=32,do_sample=False,num_beams=1,eos_token_id=tokenizer.eos_token_id,pad_token_id=tokenizer.pad_token_id,)prediction = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:],skip_special_tokens=True,).strip()print(prediction)# Expected: E11.9, I10
Intended Use
This adapter is intended for:
- ML engineering portfolio demonstration
- QLoRA and PEFT experimentation
- Educational examples of fine-tuning open-weight LLMs
- Synthetic EHR medical coding research practice
Limitations
- Trained on synthetic EHR data, not real patient records — results do not prove clinical usefulness
- May generate malformed text around ICD-10-CM codes
- A production system would require stronger output parsing, clinical validation, privacy review, monitoring, and expert human oversight
Out-of-Scope Use
Do not use this model for:
- Real patient care
- Billing or insurance claims
- Diagnosis or treatment decisions
- Clinical coding automation without expert review
- Any regulated medical workflow
Reproducibility
Trained with:
- PyTorch · Hugging Face Transformers · PEFT · TRL · bitsandbytes · datasets
- Weights & Biases (experiment tracking)
- Kaggle T4 ×2
- Hugging Face Hub checkpointing (to survive Kaggle session resets)
License
This adapter depends on meta-llama/Meta-Llama-3-8B. Users must comply with the Llama 3 Community License and Hugging Face access requirements.
Model provider
anasxs
Model tree
Base
meta-llama/Meta-Llama-3-8B
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information