Model Description
This adapter enables researchers to query the CpG Atlas database — a centralized multi-layer knowledgebase covering 18 functional annotation layers across >1.2 million CpG sites — using plain English. The model was fine-tuned on domain-specific question-SQL pairs formatted in ChatML, where the system message contains relevant table schemas and column descriptions so the model learns to read and apply schema context at inference time.
Task: Natural Language to SQL (NL-to-SQL) for the CpG Atlas DuckDB database
Example input:
Find all CpGs associated with mortality that have ICC greater than 0.75 and are immune cell invariant
Example output:
SELECT e.ProbeID, e.trait, e.beta, i.icc_sugden
FROM ewas_atlas e
JOIN icc_data i ON e.ProbeID = i.ProbeID
JOIN immune_invariant inv ON e.ProbeID = inv.ProbeID
WHERE e.trait LIKE '%mortality%'
AND i.icc_sugden > 0.75;
Training Details
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Base model | Qwen/Qwen2.5-Coder-7B-Instruct |
| Method | LoRA (Low-Rank Adaptation) |
| Rank (r) | 32 |
| Alpha | 64 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Quantization | 4-bit NF4 (double quantization, bfloat16 compute) |
| Epochs | 5 |
| Learning rate | 1e-4 |
| Scheduler | Cosine with 0.05 warmup ratio |
| Batch size | 2 (with 8 gradient accumulation steps) |
| Max sequence length | 4096 |
| Optimizer | paged_adamw_32bit |
| Training examples | 541 |
| Validation examples | 135 |
| Hardware | NVIDIA A100/H100 GPU |
Training loss converged from 1.62 to 0.005.
Evaluation
Evaluated on a held-out benchmark of 135 queries spanning five complexity classes: (1) simple single-table lookups, (2) multi-table joins, (3) filtering, (4) concept-mapping queries requiring biological synonym understanding, and (5) complex queries requiring additional analysis steps.
Table with columns: Model, Exact Accuracy, Accuracy (incl. partial)| Model | Exact Accuracy | Accuracy (incl. partial) |
|---|
| Qwen2.5-Coder-7B-Instruct (base) | 30% | 86% |
| This model (fine-tuned) | 61% | 88% |
| GPT-5.4 mini (API mode) | 80% | 98% |
Usage
With PEFT
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-7B-Instruct",
device_map="auto",
torch_dtype="auto",
)
model = PeftModel.from_pretrained(base_model, "vandijklab/CpGAtlas-NL-to-SQL-Qwen2.5-Coder-7B-LoRA")
tokenizer = AutoTokenizer.from_pretrained("vandijklab/CpGAtlas-NL-to-SQL-Qwen2.5-Coder-7B-LoRA")
messages = [
{"role": "system", "content": "You are an expert bioinformatics data engineer. Write one executable DuckDB SQL query to answer the question based on the schema. Return only SQL. Do not include markdown.\n\nSchema:\n{schema_context}"},
{"role": "user", "content": "What transposable element classes are in the transposon dataset?"}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Self-Correcting Inference
The model is designed to be used with a self-correcting execution loop: the generated SQL is executed against DuckDB, and if execution fails, the error message is appended to the prompt for the model to revise its output (up to 3 retry attempts).
Intended Use
This model is intended for use with the CpG Atlas database and its associated schema. It is designed to support researchers in querying multi-dimensional DNA methylation annotations without requiring SQL expertise. The model can run entirely locally without an internet connection or API key.
Limitations
- Trained on a specific query distribution over the CpG Atlas schema; may underperform on highly novel query patterns beyond its training scope
- Users should inspect generated SQL before treating results as definitive
- Requires sufficient computational resources to run a 7B parameter model (quantized inference requires ~6GB VRAM)
Citation
If you use this model, please cite the CpG Atlas paper (citation forthcoming).
Framework Versions
- PEFT 0.12.0
- Transformers (compatible with Qwen2.5)
- TRL (SFTTrainer)