Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

Model Summary

FieldValue
Base modelmeta-llama/Meta-Llama-3-8B
Adapter typePEFT LoRA
Fine-tuning methodQLoRA
TaskICD-10-CM code generation
Datasetgenerative-technologies/synth-ehr-icd10-llama3-format
Training hardwareKaggle 2× NVIDIA T4
LanguageEnglish

This repository contains only the LoRA adapter weights. You must have access to the gated meta-llama/Meta-Llama-3-8B base model on Hugging Face to use it.


Problem

Medical coding converts clinical documentation into standardized ICD-10-CM diagnosis codes. This project explores whether a compact LoRA adapter can teach an open-weight LLM to map synthetic EHR-style notes to ICD-10-CM codes.

Example task: Input: Patient is a 58-year-old male with type 2 diabetes mellitus without complications and essential hypertension. Output: E11.9, I10


Training Setup

SettingValue
Train samples50,000
Validation samples2,000
Epochs1
Max sequence length768
Quantization4-bit NF4
Double quantizationEnabled
Compute dtypefloat16
LoRA rank16
LoRA alpha32
LoRA dropout0.05
LoRA target modulesq_proj, k_proj, v_proj, o_proj
Optimizerpaged_adamw_8bit
Learning rate2e-4
Gradient accumulation steps16
Gradient checkpointingEnabled

Training used completion-only loss: prompt tokens were masked, and loss was computed only on the ICD-10-CM answer tokens.


Dataset Processing

Before training, examples were cleaned and normalized:

  • Removed empty clinical notes and empty targets
  • Normalized ICD-10-CM code formatting
  • Filtered examples with ICD code leakage in the prompt
  • Removed duplicate examples
  • Converted each example into Llama 3 chat format

Prompt format: <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an expert medical coder. Read the clinical note and return only the correct ICD-10-CM code or codes. <|eot_id|><|start_header_id|>user<|end_header_id|> {clinical_note} <|eot_id|><|start_header_id|>assistant<|end_header_id|> {icd_codes}<|eot_id|>


Results

SplitExamplesPrecisionRecallF1Exact MatchMicro F1
Validation2,0000.87050.87050.87050.87050.8714
Test5000.87200.87200.87200.87200.8755

Test evaluation was run on 500 examples due to Kaggle runtime constraints. Validation evaluation was run on 2,000 examples.


Usage

Load the model

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
base_model = "meta-llama/Meta-Llama-3-8B"
adapter_id = "anasxs/icd10-llama3-8b-qlora-adapter"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

Generate ICD-10-CM codes

python

clinical_note = "Patient is a 58-year-old male with type 2 diabetes mellitus without complications and essential hypertension."
prompt = f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are an expert medical coder. Read the clinical note and return only the correct ICD-10-CM code or codes.
<|eot_id|><|start_header_id|>user<|end_header_id|>
{clinical_note}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
**inputs,
max_new_tokens=32,
do_sample=False,
num_beams=1,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
prediction = tokenizer.decode(
outputs[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
).strip()
print(prediction)
# Expected: E11.9, I10

Intended Use

This adapter is intended for:

  • ML engineering portfolio demonstration
  • QLoRA and PEFT experimentation
  • Educational examples of fine-tuning open-weight LLMs
  • Synthetic EHR medical coding research practice

Limitations

  • Trained on synthetic EHR data, not real patient records — results do not prove clinical usefulness
  • May generate malformed text around ICD-10-CM codes
  • A production system would require stronger output parsing, clinical validation, privacy review, monitoring, and expert human oversight

Out-of-Scope Use

Do not use this model for:

  • Real patient care
  • Billing or insurance claims
  • Diagnosis or treatment decisions
  • Clinical coding automation without expert review
  • Any regulated medical workflow

Reproducibility

Trained with:

  • PyTorch · Hugging Face Transformers · PEFT · TRL · bitsandbytes · datasets
  • Weights & Biases (experiment tracking)
  • Kaggle T4 ×2
  • Hugging Face Hub checkpointing (to survive Kaggle session resets)

License

This adapter depends on meta-llama/Meta-Llama-3-8B. Users must comply with the Llama 3 Community License and Hugging Face access requirements.

Model provider

anasxs

Model tree

Base

meta-llama/Meta-Llama-3-8B

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today